IMGT reference sequences
Definition and characteristics
Definition
IMGT reference sequences
are chosen on the basis of one or, whenever possible, several of the following criteria:
- germline sequence
- first sequence published
- longest sequence
- mapped sequence: "mapped" refers to sequences which have been obtained from clones
(phages, cosmids, YACs...) either by subcloning or PCR, and does not apply to sequences
obtained directly from genomic DNA. Note that "mapped" does not refer to chromosomal assignment.
For the immunoglobulins (IG) and T cell receptors (TR), IMGT reference sequences
are defined for the germline
V-GENEs,
D-GENEs,
J-GENEs, and for the
C-GENEs.
IMGT reference sequence for a given allele with partial L-V-GENE-UNIT (For V) or D-GENE-UNIT (for D) or J-GENE-UNIT (for J) or C-GENE-UNIT (for C) will be replaced by a complete sequence, when this available and fully annotated.
Characteristics
Characteristics of the IMGT reference sequences are according to the IMGT-ONTOLOGY concepts.
- Gene name and allele are according to the
IMGT gene name nomenclature
of the 'CLASSIFICATION' concept of IMGT-ONTOLOGY.
- They are described with standardized labels according to the rules of the
IMGT Scientific chart based on the 'DESCRIPTION'
concept of IMGT-ONTOLOGY.
- If gaps are inserted, these gaps are according to the
IMGT unique
numbering ('NUMEROTATION' concept of IMGT-ONTOLOGY).
Presentation
The presentation of the IMGT reference sequences is of three kinds:
IMGT/LIGM-DB reference sequences
They correspond to IMGT/LIGM-DB accession numbers of which any part of the sequence has been
defined as IMGT reference sequence for (a) given gene(s). The IMGT/LIGM-DB reference sequences
can be accessed from:
- IMGT/LIGM-DB
- The IMGT/LIGM-DB reference sequences can be queried in the
IMGT/LIGM-DB Keywords module
(Keyword: "IMGT reference sequence"). After obtaining the results,
you can use the "Decrease" option and then, for instance, the "Taxonomy" module, to select more
precisely the required sequences (species, group, etc.).
- IMGT/GENE-DB
- The IMGT/LIGM-DB reference sequences of a given gene and its alleles are provided in each gene entry of
IMGT/GENE-DB
(available in October 2003 for human and mouse).
- IMGT Repertoire Gene tables
- The IMGT/LIGM-DB reference sequences are listed in
Gene tables
(available for species in IMGT Repertoire).
- Sequences from a same germline V-GENE, D-GENE, J-GENE, or from a same
C-GENE are assigned to
- a unique IMGT/LIGM-DB reference sequence if the sequences of the
V-REGION,
D-REGION,
J-REGION or
C-REGION they contain are identical
and therefore represent a same allele, according to
IMGT
allele nomenclature and sequence polymorphisms (accession numbers are shown on the same
line than the IMGT/LIGM-DB reference sequence, and refer to as "Sequences from the literature"
in Gene tables).
- different IMGT/LIGM-DB reference sequences if the sequences of the
V-REGION, D-REGION, J-REGION or C-REGION they contain are different and therefore
represent different alleles (accession numbers are shown on different lines, under
the same gene name, in Gene tables).
- Note that identical nucleotide sequences from duplicated genes
are assigned to different IMGT/LIGM-DB reference sequences (accession numbers on different
lines, with different gene names, in Gene tables).
IMGT/GENE-DB reference sequences
The IMGT/GENE-DB sequences correspond to the coding region sequences of the
Functional or
ORF
genes (V-REGION, D-REGION, J-REGION, C-REGION), isolated from the IMGT/LIGM-DB sequences.
By definition, there is one sequence for each Functional or ORF allele.
If the C-REGION is encoded by several exons, the sequence is given by exon.
IMGT/GENE-DB reference sequences are provided in FASTA format:
- nucleotide sequences with gaps according to the
IMGT unique numbering
- nucleotide sequences without gaps
- amino acid sequences with gaps according to the IMGT unique numbering
- amino acid sequences without gaps
In order to facilitate the search of expressed
(spliced) sequences by BLAST on IMGT/LIGM-DB,
and to increase interoperability with HGNC
and external generalist expression databases, IMGT/GENE-DB reference sequences
will also be provided, if there are several exons, with the exons being artificially joined.
Interoperability with genome databases:
-
The IMGT/GENE-DB reference sequences for the allele *01 of the human and mouse IG and TR genes
(except for human IGLC1 and IGLC3 for which it is the allele *02) were sent to
HGNC on the 13/02/2003.
IMGT reference directory sequences
The IMGT reference directory
sequences correspond to sequence fragments according to
IMGT Labels, isolated from the
Functional
and ORF
IMGT/LIGM-DB reference sequences, in which gaps are inserted according to the IMGT unique numbering
('NUMEROTATION' concept of IMGT-ONTOLOGY).
By definition, the IMGT reference directory sets contain one sequence for each allele.
Allele names of these sequences are shown in red in
Alignments of alleles.
Sets of the IMGT reference directory are used in
IMGT/V-QUEST and other IMGT tools.
All IMGT reference directory sets can be
downloaded in FASTA format.
FASTA header of IMGT reference directory sequences
A same IMGT coding label can be used for cDNA and genomic sequences. However in the case of splicing frame 1 (sf1) or splicing frame 2 (sf2) (Aide-mémoire,
Splicing sites), the delimitations of the coding region in cDNA (based on codons) differ by one or two nucleotides from the 5' or 3' end of the corresponding exon in gDNA.
For that reason, the header of the downloadable IMGT reference directory sequences for coding regions, indicates, in column 9, the number of nucleotides added in 5' (for example, +1) and, in column 10, the number of nucleotides added or removed in 3' (for example,-1) compared to the corresponding genomic label extracted from IMGT/LIGM-DB.
The FASTA header contains 15 fields separated by '|':
- 1. IMGT/LIGM-DB accession number(s)
- 2. IMGT gene and allele name
- 3. species
- 4. IMGT allele functionality
- 5. exon(s), region name(s), or extracted label(s)
- 6. start and end positions in the IMGT/LIGM-DB accession number(s)
- 7. number of nucleotides in the IMGT/LIGM-DB accession number(s)
- 8. codon start, or 'NR' (not relevant) for non coding labels
9. +n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
10. +n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
- 11. +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
- 12. number of amino acids (AA): this field indicates that the sequence is in amino acids
- 13. number of characters in the sequence: nt (or AA)+IMGT gaps=total
- 14. partial (if it is)
- 15. reverse complementary (if it is)
Example:
>X03604|IGHG3*01|Homo sapiens|F|H1|g,901..950|51 nt|1|
+1
|
-1
| | |51+0=51| | |
gagctcaaaaccccacttggtgacacaactcacacatgcccacggtgccca
from the Homo sapiens IGHG3 alleles IMGT reference directory file