A simple method to predict protein functions and interactions from aligned sequences; application to MHC superfamily and beta2-microglobulin

Duprat E., Lefranc M.-P. and Gascuel O.

Bioinformatics, 2006, 22, 453-459. Epub 2005 Dec 13
PMID: 16352655

Motivation: The MHC superfamily (MhcSF) not only comprises the immune system MHC class I (MHC-I) proteins, but also the proteins with MHC-I-like structure that are involved in a large variety of biological processes. Beta2-microglobulin (B2M) noncovalent binding to MHC-I proteins is required for their surface expression and function, while MHC-I-like proteins interact, or do not, with B2M. This study aims at predicting B2M binding (or non binding) for newly identified MhcSF proteins, to decipher their function, understand the molecular recognition mechanisms, and identify deleterious mutations. IMGT standardization of MhcSF protein domains provides a unique numbering of the multiple alignment positions, and the conditions to develop such predictive tool.

Method: We combine a simple-Bayes classifier with IMGT unique numbering. Our method comprises two steps: (i) selection of discriminant binary features, which associate an alignment position with an amino acid group; (ii) learning of the classifier by estimating the frequencies of selected features, conditionally to B2M binding property.

Results: Our dataset contains aligned sequences of 806 allelic forms of 47 MhcSF proteins, corresponding to 9 receptor types and 4 mammalian species. 18 discriminant features are selected, belonging to B2M contact sites, or stabilizing the molecular structure that is required for this contact. Three leave-one-out procedures are used to assess classifier performance, which corresponds to B2M binding prediction for: (1) new proteins, (2) species being not represented in the dataset, (3) new receptor types. High prediction accuracy is showed, of 98%, 94% and 70%, respectively. Application of our classifier to inferior vertebrate MHC-I proteins indicate that these proteins bind to B2M and should then be expressed on cellular surface by a process similar to that of mammalian MHC-I proteins. These results demonstrate the usefulness and accuracy of our (simple) approach, which should apply to other function or interaction prediction problems.