IMGT Scientific chart

IMGT reference sequences

Definition and characteristics
- Definition
- Characteristics
Presentation

Definition and characteristics

Definition

IMGT reference sequences are chosen on the basis of one or, whenever possible, several of the following criteria:

germline sequence
first sequence published
longest sequence
mapped sequence: "mapped" refers to sequences which have been obtained from clones (phages, cosmids, YACs...) either by subcloning or PCR, and does not apply to sequences obtained directly from genomic DNA. Note that "mapped" does not refer to chromosomal assignment.

For the immunoglobulins (IG) and T cell receptors (TR), IMGT reference sequences are defined for the germline V-GENEs, D-GENEs, J-GENEs, and for the C-GENEs.

IMGT reference sequence for a given allele with partial L-V-GENE-UNIT (For V) or D-GENE-UNIT (for D) or J-GENE-UNIT (for J) or C-GENE-UNIT (for C) will be replaced by a complete sequence, when this available and fully annotated.

Characteristics

Characteristics of the IMGT reference sequences are according to the IMGT-ONTOLOGY concepts.

Gene name and allele are according to the IMGT gene name nomenclature of the 'CLASSIFICATION' concept of IMGT-ONTOLOGY.
They are described with standardized labels according to the rules of the IMGT Scientific chart based on the 'DESCRIPTION' concept of IMGT-ONTOLOGY.
If gaps are inserted, these gaps are according to the IMGT unique numbering ('NUMEROTATION' concept of IMGT-ONTOLOGY).

Presentation

The presentation of the IMGT reference sequences is of three kinds:

IMGT/LIGM-DB reference sequences

They correspond to IMGT/LIGM-DB accession numbers of which any part of the sequence has been defined as IMGT reference sequence for (a) given gene(s). The IMGT/LIGM-DB reference sequences can be accessed from:

IMGT/LIGM-DB
- The IMGT/LIGM-DB reference sequences can be queried in the IMGT/LIGM-DB Keywords module (Keyword: "IMGT reference sequence"). After obtaining the results, you can use the "Decrease" option and then, for instance, the "Taxonomy" module, to select more precisely the required sequences (species, group, etc.).
IMGT/GENE-DB
- The IMGT/LIGM-DB reference sequences of a given gene and its alleles are provided in each gene entry of IMGT/GENE-DB (available in October 2003 for human and mouse).
IMGT Repertoire Gene tables
- The IMGT/LIGM-DB reference sequences are listed in Gene tables (available for species in IMGT Repertoire).
  - Sequences from a same germline V-GENE, D-GENE, J-GENE, or from a same C-GENE are assigned to
    - a unique IMGT/LIGM-DB reference sequence if the sequences of the V-REGION, D-REGION, J-REGION or C-REGION they contain are identical and therefore represent a same allele, according to IMGT allele nomenclature and sequence polymorphisms (accession numbers are shown on the same line than the IMGT/LIGM-DB reference sequence, and refer to as "Sequences from the literature" in Gene tables).
    - different IMGT/LIGM-DB reference sequences if the sequences of the V-REGION, D-REGION, J-REGION or C-REGION they contain are different and therefore represent different alleles (accession numbers are shown on different lines, under the same gene name, in Gene tables).
  - Note that identical nucleotide sequences from duplicated genes are assigned to different IMGT/LIGM-DB reference sequences (accession numbers on different lines, with different gene names, in Gene tables).

IMGT/GENE-DB reference sequences

The IMGT/GENE-DB sequences correspond to the coding region sequences of the Functional or ORF genes (V-REGION, D-REGION, J-REGION, C-REGION), isolated from the IMGT/LIGM-DB sequences. By definition, there is one sequence for each Functional or ORF allele. If the C-REGION is encoded by several exons, the sequence is given by exon.

IMGT/GENE-DB reference sequences are provided in FASTA format:

nucleotide sequences with gaps according to the IMGT unique numbering
nucleotide sequences without gaps
amino acid sequences with gaps according to the IMGT unique numbering
amino acid sequences without gaps

In order to facilitate the search of expressed (spliced) sequences by BLAST on IMGT/LIGM-DB, and to increase interoperability with HGNC and external generalist expression databases, IMGT/GENE-DB reference sequences will also be provided, if there are several exons, with the exons being artificially joined.

Interoperability with genome databases:

The IMGT/GENE-DB reference sequences for the allele *01 of the human and mouse IG and TR genes (except for human IGLC1 and IGLC3 for which it is the allele *02) were sent to HGNC on the 13/02/2003.

IMGT reference directory sequences

The IMGT reference directory sequences correspond to sequence fragments according to IMGT Labels, isolated from the Functional and ORF IMGT/LIGM-DB reference sequences, in which gaps are inserted according to the IMGT unique numbering ('NUMEROTATION' concept of IMGT-ONTOLOGY).

By definition, the IMGT reference directory sets contain one sequence for each allele. Allele names of these sequences are shown in red in Alignments of alleles.

Sets of the IMGT reference directory are used in IMGT/V-QUEST and other IMGT tools. All IMGT reference directory sets can be downloaded in FASTA format.

FASTA header of IMGT reference directory sequences
A same IMGT coding label can be used for cDNA and genomic sequences. However in the case of splicing frame 1 (sf1) or splicing frame 2 (sf2) (Aide-mémoire, Splicing sites), the delimitations of the coding region in cDNA (based on codons) differ by one or two nucleotides from the 5' or 3' end of the corresponding exon in gDNA.
For that reason, the header of the downloadable IMGT reference directory sequences for coding regions, indicates, in column 9, the number of nucleotides added in 5' (for example, +1) and, in column 10, the number of nucleotides added or removed in 3' (for example,-1) compared to the corresponding genomic label extracted from IMGT/LIGM-DB.
The FASTA header contains 15 fields separated by '|':

1. IMGT/LIGM-DB accession number(s)
2. IMGT gene and allele name
3. species
4. IMGT allele functionality
5. exon(s), region name(s), or extracted label(s)
6. start and end positions in the IMGT/LIGM-DB accession number(s)
7. number of nucleotides in the IMGT/LIGM-DB accession number(s)
8. codon start, or 'NR' (not relevant) for non coding labels
9. +n: number of nucleotides (nt) added in 5' compared to the corresponding label extracted from IMGT/LIGM-DB
10. +n or -n: number of nucleotides (nt) added or removed in 3' compared to the corresponding label extracted from IMGT/LIGM-DB
11. +n, -n, and/or nS: number of added, deleted, and/or substituted nucleotides to correct sequencing errors, or 'not corrected' if non corrected sequencing errors
12. number of amino acids (AA): this field indicates that the sequence is in amino acids
13. number of characters in the sequence: nt (or AA)+IMGT gaps=total
14. partial (if it is)
15. reverse complementary (if it is)

Example:
>X03604|IGHG3*01|Homo sapiens|F|H1|g,901..950|51 nt|1|

-1

| | |51+0=51| | |
gagctcaaaaccccacttggtgacacaactcacacatgcccacggtgccca

from the Homo sapiens IGHG3 alleles IMGT reference directory file