FASTA format
Definition and description
- The FASTA format is a sequence format that begins with a single description line followed by lines of sequence data.
- The FASTA format is used as query input for many bioinformatic tools such as BLAST, ClustalW, IMGT/V-QUEST etc.
- The description line starts with a ">" symbol, followed by a sequence identifier
(chosen by the user) without space.
The sequence identifier may range from a single character (letter or numeral) to a detailed description as exemplified
in the 'IMGT FASTA header of nucleotide IMGT reference sequences' which contains 15 fields.
It is recommended that all lines of text be shorter than 80 characters in length. Blanks are ignored, dots (.), hyphens (-) or underscores (_) must be used to represent gaps. A FASTA file may contain many sequences in FASTA format, blank lines are not allowed in the middle of the FASTA input.
IMGT FASTA header of nucleotide IMGT reference sequences
The IMGT FASTA header of nucleotide IMGT reference sequences contains 15 fields separated by '|': 1. IMGT/LIGM-DB accession number(s) 2. IMGT gene and allele name 3. species 4. IMGT allele functionality 5. exon(s), region name(s), or extracted label(s) 6. start and end positions in the IMGT/LIGM-DB accession number(s) 7. number of nucleotides in the IMGT/LIGM-DB accession number(s) 8. codon start, or 'NR' (not relevant) for non coding labels 9. +n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB 10. +n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB 11. +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors 12. number of amino acids (AA): this field indicates that the sequence is in amino acids 13. number of characters in the sequence: nt (or AA)+IMGT gaps=total 14. partial (if it is) 15. reverse complementary (if it is) >M99641|IGHV1-18*01|Homo sapiens|F|V-REGION|188..483|296 nt|1| | | | |296+24=320| | | caggttcagctggtgcagtctggagct...gaggtgaagaagcctggggcctcagtgaag gtctcctgcaaggcttctggttacaccttt............accagctatggtatcagc tgggtgcgacaggcccctggacaagggcttgagtggatgggatggatcagcgcttac... ...aatggtaacacaaactatgcacagaagctccag...ggcagagtcaccatgaccaca gacacatccacgagcacagcctacatggagctgaggagcctgagatctgacgacacggcc gtgtattactgtgcgagaga
Examples of nucleotide sequences in FASTA format
-
in IMGT reference directory:
>M29672_IGHV1S1*01_Rajeri GCGGTCGTGCTGAATCAGAAACCGACC...GAGGCGGCAAAGTCTGGAGAGTCCCTCAAACTGACCTGTGTAACCAGCG GGTTCAGCCTCAGCAGCTCCAAC............GTGCATTGGGTGAAACAAGTCCCCGGGAAAGGGCTGGAGTGGGT GGCGATCATGTGGTATGATGATGACAAA.........GATTACGCGCCTGCCTTCAGC...GGCCGATTCACTGTTTCC AGG......GACAGCAGCAATGTCTATCTCCAAATGACCAACCTGAGTCTGGCCGACACGGCCACCTATTACTGTGCG
In this example, the description line comprises the IMGT/LIGM-DB sequence accession number (M29672), the allele name (TRDJ1*01), the abbreviation for the species (Rajeri) for Leucoraja erinacea (the abbreviation used by IMGT comprises the three first letters of the genus Latine name followed by the three first letters of the species Latin name).
-
in IMGT/LIGM-DB:
>M29672|REIGHA|R.erinacea Ig rearranged H-chain mRNA (V-D-J-C region). cccattcctggagtgtccaagtgtgtgtccgtgctcagagtgatgggggtcgctgtttat ctctgtctccttctgttctgtctgccaggcgttcgatccgcggtcgtgctgaatcagaaa ccgaccgaggcggcaaagtctggagagtccctcaaactgacctgtgtaaccagcgggttc agcctcagcagctccaacgtgcattgggtgaaacaagtccccgggaaagggctggagtgg gtggcgatcatgtggtatgatgatgacaaagattacgcgcctgccttcagcggccgattc actgtttccagggacagcagcaatgtctatctccaaatgaccaacctgagtctggccgac acggccacctattactgtgcggcagccatggggggctctatatactggcttgagtactgg ggtgcaggaacctcgctgacagtgacttcagaggatgtggttttgccttcagtccacatc acctcttcctgcaacacggaatctggccaagagatcagcatcctctgtctggtcaaggac tacctgcctgaggtcatcagtcagacatggtccaccagcagtggggtcatcaacaatgga ataacaaagtacccaccagtgttgggacaaaacaagaagtacacaatgagcagcttgctg cgagtctctgtagcagattggaacaggaaaacctactactgcaaggcagggtacaagccg gacaacatggtgaaaacggagatccagaagcctcaagccccacagctcatcccccttgtt ccatctccggagactctccacaatcaaacaactgctgtcctgggctgcatgatatctgga ttctctcctgacaatattaaagtttcctggaaaaaagctggacttaatcaagcgggcgtc gttctcccatccactccgagaactaacggtggatttgaaacagttgcttacctgccgttg aatgtggaggaatggaccaacaaacaggaatatacttgtgaagtgacccacgcaccttcc ggcttcagcgacaagatcaacatgagatatcaagagggtggaaaatgtcccggctgttcg aagtgtctgccgaagttcatctaccagagtaatctcaatgtgtcgttctcagatggttct acccagcagtatcattgttgggcaggaaagtgtgaaataaagtaattggctgc
In this example, the description line comprises the IMGT/LIGM-DB sequence accession number (M29672), the mnemonic (REIGHA), the definition (extracted from EMBL).
-
in IMGT/GENE-DB:
>AE000658|TRAV4*01|Homo sapiens|F|V-REGION
cttgctaagaccacccag...cccatctccatggactcatatgaaggacaagaagtgaac
ataacctgtagccacaacaacattgctacaaatgattat...............atcacg
tggtaccaacagtttcccagccaaggaccacgatttattattcaaggatacaagaca...
...............aaagttacaaacgaa.....................gtggcctcc
ctgtttatccctgccgacagaaagtccagcactctgagcctgccccgggtttccctgagc
gacactgctgtgtactactgcctcgtgggtgacaIn this example, the description line comprises the IMGT/LIGM-DB sequence accession number (AE000658), the allele name (TRAV4*01), the species (Homo sapiens), the functionality of the allele (F for Functional), the label of the coding region (V-REGION). The word "partial" is then added when the coding region is partial in 5' or 3'.