IMGT Index

FASTA format

Definition and description

The FASTA format is a sequence format that begins with a single description line followed by lines of sequence data.
The FASTA format is used as query input for many bioinformatic tools such as BLAST, ClustalW, IMGT/V-QUEST etc.
The description line starts with a ">" symbol, followed by a sequence identifier (chosen by the user) without space. The sequence identifier may range from a single character (letter or numeral) to a detailed description as exemplified in the 'IMGT FASTA header of nucleotide IMGT reference sequences' which contains 15 fields.
It is recommended that all lines of text be shorter than 80 characters in length. Blanks are ignored, dots (.), hyphens (-) or underscores (_) must be used to represent gaps. A FASTA file may contain many sequences in FASTA format, blank lines are not allowed in the middle of the FASTA input.

IMGT FASTA header of nucleotide IMGT reference sequences


The IMGT FASTA header of nucleotide IMGT  reference sequences contains 15 fields separated by '|':
1. IMGT/LIGM-DB accession number(s)
2. IMGT gene and allele name
3. species
4. IMGT allele functionality
5. exon(s), region name(s), or extracted label(s)
6. start and end positions in the IMGT/LIGM-DB accession number(s)
7. number of nucleotides in the IMGT/LIGM-DB accession number(s)
8. codon start, or 'NR' (not relevant) for non coding labels
9. +n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
10. +n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
11. +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
12. number of amino acids (AA): this field indicates that the sequence is in amino acids
13. number of characters in the sequence: nt (or AA)+IMGT gaps=total
14. partial (if it is)
15. reverse complementary (if it is)

>M99641|IGHV1-18*01|Homo sapiens|F|V-REGION|188..483|296 nt|1| | | | |296+24=320| | |
caggttcagctggtgcagtctggagct...gaggtgaagaagcctggggcctcagtgaag
gtctcctgcaaggcttctggttacaccttt............accagctatggtatcagc
tgggtgcgacaggcccctggacaagggcttgagtggatgggatggatcagcgcttac...
...aatggtaacacaaactatgcacagaagctccag...ggcagagtcaccatgaccaca
gacacatccacgagcacagcctacatggagctgaggagcctgagatctgacgacacggcc
gtgtattactgtgcgagaga

Examples of nucleotide sequences in FASTA format

in IMGT reference directory:
```
>M29672_IGHV1S1*01_Rajeri
GCGGTCGTGCTGAATCAGAAACCGACC...GAGGCGGCAAAGTCTGGAGAGTCCCTCAAACTGACCTGTGTAACCAGCG
GGTTCAGCCTCAGCAGCTCCAAC............GTGCATTGGGTGAAACAAGTCCCCGGGAAAGGGCTGGAGTGGGT
GGCGATCATGTGGTATGATGATGACAAA.........GATTACGCGCCTGCCTTCAGC...GGCCGATTCACTGTTTCC
AGG......GACAGCAGCAATGTCTATCTCCAAATGACCAACCTGAGTCTGGCCGACACGGCCACCTATTACTGTGCG
```
In this example, the description line comprises the IMGT/LIGM-DB sequence accession number (M29672), the allele name (TRDJ1*01), the abbreviation for the species (Rajeri) for Leucoraja erinacea (the abbreviation used by IMGT comprises the three first letters of the genus Latine name followed by the three first letters of the species Latin name).

in IMGT/LIGM-DB:

>M29672|REIGHA|R.erinacea Ig rearranged H-chain mRNA (V-D-J-C region).
cccattcctggagtgtccaagtgtgtgtccgtgctcagagtgatgggggtcgctgtttat
ctctgtctccttctgttctgtctgccaggcgttcgatccgcggtcgtgctgaatcagaaa
ccgaccgaggcggcaaagtctggagagtccctcaaactgacctgtgtaaccagcgggttc
agcctcagcagctccaacgtgcattgggtgaaacaagtccccgggaaagggctggagtgg
gtggcgatcatgtggtatgatgatgacaaagattacgcgcctgccttcagcggccgattc
actgtttccagggacagcagcaatgtctatctccaaatgaccaacctgagtctggccgac
acggccacctattactgtgcggcagccatggggggctctatatactggcttgagtactgg
ggtgcaggaacctcgctgacagtgacttcagaggatgtggttttgccttcagtccacatc
acctcttcctgcaacacggaatctggccaagagatcagcatcctctgtctggtcaaggac
tacctgcctgaggtcatcagtcagacatggtccaccagcagtggggtcatcaacaatgga
ataacaaagtacccaccagtgttgggacaaaacaagaagtacacaatgagcagcttgctg
cgagtctctgtagcagattggaacaggaaaacctactactgcaaggcagggtacaagccg
gacaacatggtgaaaacggagatccagaagcctcaagccccacagctcatcccccttgtt
ccatctccggagactctccacaatcaaacaactgctgtcctgggctgcatgatatctgga
ttctctcctgacaatattaaagtttcctggaaaaaagctggacttaatcaagcgggcgtc
gttctcccatccactccgagaactaacggtggatttgaaacagttgcttacctgccgttg
aatgtggaggaatggaccaacaaacaggaatatacttgtgaagtgacccacgcaccttcc
ggcttcagcgacaagatcaacatgagatatcaagagggtggaaaatgtcccggctgttcg
aagtgtctgccgaagttcatctaccagagtaatctcaatgtgtcgttctcagatggttct
acccagcagtatcattgttgggcaggaaagtgtgaaataaagtaattggctgc

In this example, the description line comprises the IMGT/LIGM-DB sequence accession number (M29672), the mnemonic (REIGHA), the definition (extracted from EMBL).

in IMGT/GENE-DB:

>AE000658|TRAV4*01|Homo sapiens|F|V-REGION
   cttgctaagaccacccag...cccatctccatggactcatatgaaggacaagaagtgaac
   ataacctgtagccacaacaacattgctacaaatgattat...............atcacg
   tggtaccaacagtttcccagccaaggaccacgatttattattcaaggatacaagaca...
   ...............aaagttacaaacgaa.....................gtggcctcc
   ctgtttatccctgccgacagaaagtccagcactctgagcctgccccgggtttccctgagc
   gacactgctgtgtactactgcctcgtgggtgaca

In this example, the description line comprises the IMGT/LIGM-DB sequence accession number (AE000658), the allele name (TRAV4*01), the species (Homo sapiens), the functionality of the allele (F for Functional), the label of the coding region (V-REGION). The word "partial" is then added when the coding region is partial in 5' or 3'.

See also: