Marsupials and monotremes possess a novel family of MHC class I genes that is lost from the eutherian lineage

Sensitive peptide searches for MHC class I proteins

We first set out to identify all annotated MHC class I proteins in 15 representative
species sampled from across vertebrate life. The selected species comprised human,
mouse, dog, cow, three species of marsupials with sequenced genomes, platypus, three
avian species, a lizard, a frog, and two fish species. Additionally, we selected 4
eukaryotic species known to lack MHC class I genes as negative controls (lamprey,
sea squirt, fruitfly and yeast). Predicted protein sequences from these species were
obtained from Ensembl and searched using profile HMMs representing the MHC class I
APD and the C1-type Ig domain, which are characteristic of MHC class I genes, and
the MHC class II ? domain, with HMMer. The separate domain searches were integrated
and MHC class I proteins predicted using a simple heuristic: proteins were annotated
as predicted MHC class I proteins if they had a significant match to the MHC class
I APD or a weak match to the APD and a significant match to the Ig domain model in
the correct order, with the additional requirement that the MHC class I APD model
matched with higher score than the class II ? domain model. MHC class I genes frequently
encode multiple isoforms; in these cases, we selected the longest protein as the representative
protein. A variety of HMMs were tested (e.g. PFAM, SUPFAM and iteratively constructed
custom models; see Methods for details) and the most sensitive combination was adopted.

Our search identified 348 MHC class I proteins across the 15 jawed vertebrate species
searched (summarized in Table 1). This included all 24 known human and 41 mouse MHC class I proteins with no false
positives. Searches of several negative controls—lamprey, sea squirt, fruitfly and yeast—did not identify any MHC class I proteins. Aligning all PFAM-A domain models to the
set of predicted MHC class I proteins using hmmpfam showed that for each protein the
strongest matches consisted only of the MHC class I APD, Ig and in some cases the
conserved MHC C-terminal domains, with no other unexpected high quality matches. MHC
class II genes were never misidentified as class I genes in the searches of any jawed
vertebrate protein databases. Taken together these observations indicate the approach
has high sensitivity and specificity.

Table 1. Summary of the number of MHC class I genes across species. The number MHC class I
genes identified in each species by searching annotated proteins using customized

Sensitive genome searches for MHC class I genes

Next, we set out to identify any unannotated MHC class I genes in these genomes using
a highly sensitive search method designed to take advantage of the conserved exon/domain
organisation of MHC class I genes (Fig. 1a). Profile HMMs representing the MHC class I APD, C1-type Ig, MHC C-terminal, and
MHC class II ? domains were used to search the six-frame translation of each genome.
The domain matches in the 6-frame translation were transformed back to genomic coordinates
and the ?
1
, ?
2
, Ig and C-terminal domains within the model matches were identified. In each species,
we found thousands of matches to these domains (summarized in Additional file 1: Table S3). For example in the opossum genome, we found 2127 matches to the ?
1
domain, 3571 matches to the ?
2
domain, 5028 matches to the Ig domain and 5546 matches to the MHC C-terminal domain.
The majority of these matches had low scores. However, both isolated and clustered
high scoring matches were also apparent (Fig. 1b). Genomic features matching the expected structure of an MHC class I gene, that is
a chain of ?
1,
?
2
and Ig domains and optionally a C-terminal domain on the same strand and at intron-like
distances (for example Fig. 1c) were identified by aligning a canonical model of an MHC class I gene, taking match
score and the gaps between domains into account (Fig. 1d, and Materials and methods for details). Once again, a variety of HMMs were tested
(e.g. PFAM, SUPFAM, and custom models based on the protein search results; see Methods
for details). The custom models were adopted as the most sensitive.

From the 388,409 domain matches across all species, the genome search identified 361
genomic features possessing the MHC class I gene structure (summarized in Table 1; Additional file 2: Table S4 for details). These included 26 putative MHC class I genes in the human
genome, 49 in mouse, and 40 in the opossum. Again, searches of the negative controls
identified no MHC class I genes, as one would expect. These genomic features included
annotated genes, and both annotated and unannotated pseudogenes. Merging the protein
and genome searches produced a total of 449 MHC class I genes and proteins across
the species searched (Additional file 3: Table S5), including a total of 33 in human, 55 in mouse and 47 in the opossum.

The most dramatic differences between the results of searching annotated class I proteins
and an unbiased search of the whole genome arose in the marsupials and monotremes.
The annotation of the opossum genome (Ensembl Release 75) contains 28 MHC class I
genes, but 40 putative MHC class I genes (genomic features with structural similarity
to MHC class I genes) were identified in the sensitive genome search results. Seven
of the annotated proteins were missed in the genome search, as the corresponding loci
lack Ig domains. Fifteen of the loci identified by the genome search were unannotated
in the Ensembl genebuild. In some cases, de novo gene predictions from genscan or
evidence-based prediction with N-scan (UCSC Genome Browser, accessed 17 April 2015)
did identify overlapping open reading frames, however, these annotations were typically
of poor quality (data not shown) with multiple run-on annotations linking two or more
MHC class I gene features. Five of these unannotated features contained in-frame stops,
including opossum CD1, UH, and a MIC-like gene (MIC2). These in-frame stops may be due to sequencing errors in the draft opossum genome,
polymorphisms in the individual sequenced or the fact that our model does not take
splice sites into account and may erroneously include short segments of intronic sequence
in the domain matches, resulting in the genomic feature going out of frame. In fact,
CD1 is known to be a pseudogene in opossum 44] and does not show evidence of transcription; MIC2 also shows no evidence for transcription;
while UH does show evidence of transcription (data not shown). Consequently, we retain
all genes in our analyses. Thus, a total of 47 putative MHC class I genes were identified.
A similar pattern emerged in other marsupial and monotreme genomes.

Phylogenetic analysis

To annotate these genes and understand the evolutionary relationship between them,
we inferred the phylogenetic relationships between all MHC class I genes identified
in the selected vertebrates using a Markov Chain Monte Carlo (MCMC) method on the
JTT?+?IGF model. Four MCMCs were run (see Additional file 1: Figure S2 for traces of posterior probability) and the consensus tree from the last
500 steps of each run was taken to represent the evolutionary history of the genes
(Fig. 2a). Additionally, a smaller phylogeny consisting of just human and opossum class I
genes and the mouse Mill genes was also inferred by maximum likelihood (Additional
file 1: Figure S3).

Fig. 2. Phylogeny of class Is predicted in representative species spanning the jawed vertebrates
estimated by MCMC on the JTT?+?IG model. Numbers at nodes represent the frequency
with which that split is observed. Gene families are labelled around the outside.
The species label shows the location of classical MHC class I for each species or
group of species. Key gene families or species’ classical class I genes are highlighted
in colour

While support in parts of the trees is low, the phylogenies provide a number of insights
into the evolution of MHC class I genes in vertebrates. The large tree provides additional
evidence for the previous observation that the non-classical MHC class I gene family
MR1 is found only eutherians and marsupials 45]. Similarly, it suggests that the FCGRT, HFE and AZGP1 gene families are specific to eutherians and marsupials. It demonstrates that the
PROCR gene family is found across the amniotes. It suggests that MIC is duplicated in opossum
(md_chain40), though this contains in-frame stops. The small tree supports the previous
observation that marsupials may have a member of the ULBP gene family (ENSMODG00000015798) 46]. It identifies a possible expansion of AZGP1 in opossum (ENSMODG00000024063, ENSMODG00000027380, ENSMODG00000028158, and ENSMODG00000029679).
The phylogenies also reveals two new opossum MHC class I genes that are located in
the MHC, but have not previously been identified, which we have denoted UA3 and UA4. These appear to be closely related to UA1 and UA2.

Strikingly, the phylogenetic tree identifies an extensive and entirely novel clade
of MHC class I genes in marsupials and monotremes, which we have named UT. There are 17 UT family genes identified in the opossum genome, 9 in tammar wallaby, 13 in the Tasmanian
devil and 7 in the platypus. The numbering of UTs is based on location in the gene cluster in the opossum and clear orthology, or
lack of it in other marsupials. Platypus UTs are numbered independently as these appear
to form a distinct clade. This is highlighted by the UT gene tree (Additional file
1: Figure S4), which was estimated using maximum likelihood with the JTT?+?IGF model
and reconciled with the species tree using NOTUNG. No UTs were identified outside of the marsupials and monotremes in our searches.

Chromosomal location

The UT family of MHC class I genes is encoded in a gene cluster on chromosome 1 in the opossum
genome (Fig. 3). This region is approximately 460 kilobases in size. Interestingly, the cluster
is located at an evolutionary breakpoint and is flanked by genomic regions that share
synteny with different chromosomes in human (chr2 and chr20) and mouse (chr6 and chr2).
The tammar wallaby genome assembly (Meug1.0) is highly fragmented and scaffolds are
not mapped to chromosomes. Fluorescence In-Situ Hybridization (FISH) shows that the
UT gene cluster is also located on chromosome 1 in the tammar wallaby genome (Fig. 4), as predicted by conserved synteny between the tammar and opossum 47]. Interestingly, the FISH also shows a signal on the tammar Y chromosome. As all marsupial
genomes sequenced were female, this locus was not detected in genome-wide searches
and the significance of this signal is not yet understood. Based on the digital karyotype
of the Tasmanian devil 22], the UT gene family is also located on chromosome 1.

Fig. 3. Comparative map of the UT gene cluster. a Genomic region containing the UT cluster in opossum showing the non-synteny of flanking genes between the opossum
and human (hs) and mouse (mm) genomes. b Comparative map of UT cluster in opossum, tammar wallaby and Tasmanian devil. The fill colour summarises
the evidence for expression: RT-PCR in opossum thymus detected (green) and not detected
(red); 454 sequencing data (blue); in-frame stop (yellow); not tested (white)

Fig. 4. Fluorescence In-Situ Hybridisation showing location of UT cluster on chr1 in the tammar wallaby. A signal is also observed on the Y chromosome

Sequencing and gene expression

Of the 17 putative opossum UT genes, the expression of 8 genes, consisting of UT4, UT5, UT6, UT8, UT9, UT10, UT15, and UT17, was confirmed in opossum thymus using RT-PCR (Additional file 1: Figure S5). Predicted sequences obtained from our sensitive search method were confirmed
using RT-PCR to obtain amplicon sequences from within exons 2 and 3 (Additional file
4: Table S6). A further 4 UT loci, UT2, UT3, UT7, and UT16, were confirmed as expressed in Roche 454 sequencing data from an opossum thymus
cDNA library (data not shown).

Transcription of tammar wallaby UT8, UT15, UT16, UT17, UT18, UT19, UT20, and UT21 was confirmed in 454 data from thoracic and cervical tammar thymus cDNA libraries.
There was support for Tasmanian devil UT2, UT8, and UT11 in 454 cDNA data from devil spleen, but no UTs were detected in a lymph node library.

Limited transcriptome sequence data is available from immune tissues of other species
of marsupial. An EST (id: 161106CS44009845FFFFF) from brushtail possum immune tissues
with homology to meUT2 was also identified, providing support for the existence of functional UTs in the possum. No UT transcripts were detected in BLASTN searches of 1318 bandicoot thymus EST library,
probably due to the small size of this library. No platypus immune tissues transcriptome
data was available.

Homology mapping

To investigate the function of UT family members we predicted the protein structure of selected UTs (opossum UT4, UT5 and UT8) using homology modelling with the I-TASSER method 43]. Protein structures from the Protein Data Bank (PDB) that were closest to the predicted
models comprised both classical and non-classical MHC class I genes from chicken,
cow, mouse and human (Additional file 1: Table S7; Fig. 5a for an annotated sequence alignment of 7 of the top matches). The structure of the
classical chicken MHC class I protein B21 (3BEV 48]) was the best match for UT8 and appeared in the top 5 templates for all UTs examined. The backbone structural alignment of UT4 with 3BEV and 3P73 49], the top 2 structural analogs for UT4 and UT8, are shown in Fig. 5b. The peptide-binding grooves of UT4, 5 and 8 are shown in Fig. 5c.

Fig. 5. Predicted structure of UT proteins. a Sequence alignment of the ?
1
and ?
2
domains of UT4, 5, 6, 8 with 7 of the top 10 structural analogs from Protein Data Bank (PDB) identified by
I-TASSER. Orange bars show ?-helices. Green arrows highlight ?-strands. Major differences
between the UTs and templates are indicated by arrows, or red lines, and the consequence of these
on the protein structures are shown in Fig. 5c. b Overlay of the backbone of the peptide binding groove of mdUT4 with its top 2 structural analogs, 3BEV_A (ggB21) and 3P73_A (ggYF1). c Superposed model structures of mdUT4, 5 and 8 with the B21 template shows the antigen-binding groove is open, but possibly short. Filled sphere
view shows the ?-helices and ribbons show the ?-sheets of the peptide-binding platform
on the modeled protein structures. The residues indicated with arrows in Fig. 5a cause the binding grooves to be short or narrow (e.g. UT4: the distance between Pro81 C? to Tyr144 OH is 3.0 Å; UT5: the distance between Pro81 C? to Tyr144 OH is 2.9 Å, and a close hydrophobic contact
between
64LTQW 67
and
161MNLY 154
; UT8: the distance between Phe75 C? to Phe138 C? is 2.9 Å)