hdallatorre commited on
Commit
1e463de
1 Parent(s): 0126b6f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -10,11 +10,11 @@ tags:
10
  ---
11
  # segment-nt-multi-species
12
 
13
- Segment-NT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
- elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [Segment-NT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome
15
  but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
16
 
17
- For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **Segment-NT**, mainly because only this subset of annotations is
18
  available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
19
  splice acceptor and donor sites.
20
 
@@ -39,7 +39,7 @@ pip install --upgrade git+https://github.com/huggingface/transformers.git
39
  A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
40
 
41
 
42
- ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, Segment-NT has
43
  been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the `rescaling_factor`
44
  argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
45
  (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.
 
10
  ---
11
  # segment-nt-multi-species
12
 
13
+ SegmentNT-multi-species is a segmentation model leveraging the [Nucleotide Transformer](https://huggingface.co/InstaDeepAI/nucleotide-transformer-v2-500m-multi-species) (NT) DNA foundation model to predict the location of several types of genomics
14
+ elements in a sequence at a single nucleotide resolution. It is the result of finetuning the [SegmentNT](https://huggingface.co/InstaDeepAI/segment_nt) model on a dataset encompassing the human genome
15
  but also the genomes of 5 selected species: mouse, chicken, fly, zebrafish and worm.
16
 
17
+ For the finetuning on the multi-species genomes, we curated a dataset of a subset of the annotations used to train **SegmentNT**, mainly because only this subset of annotations is
18
  available for these species. The annotations therefore concern the 7 main gene elements available from [Ensembl](https://www.ensembl.org/index.html), namely protein-coding gene, 5’UTR, 3’UTR, intron, exon,
19
  splice acceptor and donor sites.
20
 
 
39
  A small snippet of code is given here in order to retrieve both logits and embeddings from a dummy DNA sequence.
40
 
41
 
42
+ ⚠️ The maximum sequence length is set by default at the training length of 30,000 nucleotides, or 5001 tokens (accounting for the CLS token). However, SegmentNT has
43
  been shown to generalize up to sequences of 50,000 bp. In case you need to infer on sequences between 30kbp and 50kbp, make sure to change the `rescaling_factor`
44
  argument in the config to `num_dna_tokens_inference / max_num_tokens_nt` where `num_dna_tokens_inference` is the number of tokens at inference
45
  (i.e 6669 for a sequence of 40008 base pairs) and `max_num_tokens_nt` is the max number of tokens on which the backbone nucleotide-transformer was trained on, i.e `2048`.