Characterization of fossilized relatives of the White Spot Syndrome Virus in genomes of decapod crustaceans

Viral sequences in the crab genomic library

Genome assembly

De novo assembly of the M. depressus genomic library resulted in 563 contigs (max. contig size 32,245 bp; N50?= 1,931 bp)
including 20,563 reads. Of these, 13 contigs were chosen for semi-manual scaffolding.
These contigs were composed of 14,972 reads (8.01 % of the whole genomic library)
and the mean coverage achieved 34.1x per site after read mapping. Three scaffolds
with lengths of 100,139, 62,154 and 46,520 bp (208,813 bp in total, GC content 44.1 %)
successfully incorporated all 13 contigs of putative viral origin. Nevertheless, considerable
variation was observed between different reads in terms of single nucleotide polymorphisms
and length variation of STRs. Most of the low-coverage and contig junction regions
were confirmed with PCR and Sanger sequencing with the exception of one low-coverage
region on the scaffold III (primer pairs C9:1 and C9:2: see Additional file 1: Table S1). It was also not possible to fill gaps between the scaffolds.

Reconstructed ORFs

Overall the three scaffolds encompass 67 different ORFs (or their fragments) with
homology to WSSV (Table 1, see also Additional file 1: Table S2). Unlike all other ORFs, the homolog of wsv294 was initially identified
based on positional evidence only: no BLAST hits with an E-value threshold of 0.01
were obtained for its amino acid sequence, but its conserved position between homologs
of wsv295 and wsv293a and moderate similarity to wsv294, identified via pairwise alignment
justify the homologization. In addition, one ORF showed high similarity to inhibitors
of apoptosis proteins (IAPs) from a number of animal species, but not to known WSSV
proteins (see below). Alignments of the annotated ORFs with their WSSV homologs suggest
that most of them are intact and of full length (see Table 1). The cases of internal frameshifts tested with Sanger sequencing (21 positions)
were exclusively associated with either common 454-sequencing artifacts (i.e. homopolymer
effects: 33])) or polymorphic STR. In seven cases the predicted protein sequences were considerably
shorter than their counterparts in the WSSV genome, of which five were confirmed independently
with conventional Sanger sequencing. One of the confirmed truncated ORFs (homolog
of wsv293a), if functional, seems to utilize a non-canonical starting codon ACG.

Table 1. Orthologs of WSSV genes identified on the three viral scaffolds of Metopaulias depressus

Thirty-six (53.7 %) of the reconstructed genes are homologous to previously functionally
characterized genes of WSSV (and other organisms in the case of IAP. Out of the 21
non-structural genes reported for WSSV 1], 2], 8], 34], 35] twelve were found on the reconstructed Metopaulias scaffolds. These include most of the proteins involved in replication and nucleotide
metabolism: the DNA polymerase, helicase, non-specific nuclease, dUTPase and both
subunits of the ribonucleotide reductase, while no homologs of the chimeric thymidine/thymidylate
kinase and the thymidylate synthase were found. Additionally, homologs to both homologous-region-binding
proteins were identified: one of them (wsv021: 36]) is represented by a considerably truncated ORF in M. depressus, while the second one (wsv078: 34]) is presumably of full length (see Table 1).

Out of the 17 major virion proteins currently recognized in WSSV 1] we found homologs of genes coding for nine of them: VP664, VP160B, VP51C (nucleocapsid);
VP95, VP39A, VP26 (tegument); and VP53A, VP28, VP19 (envelope). An intact ORF with
a predicted amino acid sequence similar to the collagen protein from WSSV (wsv001,
VP1684) was also identified, but contained much less of the tandem collagen tripeptide
(Gly-Xaa-Yaa) repeats (388 vs. 26 units).

Among the 550 unscaffolded contigs, 344 had BLASTx hits when searched for the protein
sequences from WSSV. Besides the homologs of WSSV genes already discovered on the
scaffolds, four homologs to additional genes were identified after filtering (Table 2). None of these WSSV proteins were previously functionally characterized.

Table 2. Fragments homologous to WSSV ORFs identified only on non-scaffolded viral contigs
from Metopaulias

Analysis of the BLASTx hits for all contigs taken individually revealed that the vast
majority of WSSV homologs were represented by more than one fragment (median: 3; 5-95 %
quantiles: 1–19). Although the relative contribution of contig overlaps, mis-assembly
and true multiple insertions could not be clearly disentangled, there were seven outliers
with more than 19 fragments each, representing different parts or sequence variants
of the respective ORFs (Table 3).

Table 3. Homologs of WSSV genes represented by more than 20 fragments on scaffolds and/or unscaffolded
contigs

Interspersed repeats

A screen for interspersed repeats was performed both for the WSSV genome and the M. depressus viral scaffolds (Fig. 1). Most of the repeated regions in WSSV correspond to the previously identified “homologous
regions” (hrs) (see 35]). The M. depressus scaffolds possess significantly shorter, but one order of magnitude more abundant
repetitive elements. Nevertheless, no repeats similar to the hrs of WSSV in sequence or structure could be identified.

Fig. 1. Frequency distributions of lengths of individual units in families of interspersed
repeats as identified by RECON in the genome of WSSV and scaffolded viral sequences
from Metopaulias depressus

The longest identified interspersed repeat (see Fig. 1) corresponds to a region on scaffold III intersecting intact homologs of wsv327 and
wsv147 interrupted by a fragment homologous to wsv191, and a similar region with broken
copies of all three ORFs on scaffold II. Among WSSV homologs represented multiple
times the homologs to wsv327 and wsv191 were detected at least thrice: wsv327 with
five fragments of varying lengths, one of them being the intact ORF neighboring a
homolog of wsv332 (the adjacent ORF in the WSSV genome); and wsv191 (the endonuclease)
represented by three different copies, one of them being the intact ORF neighboring
a homolog of wsv192.

Transposon insertions

Fragments of transposon proteins were identified in 36 contigs, with the majority
of them belonging to reverse transcriptase from jockey-like retrotransposons (data
not shown). In 21 cases these contigs contained also fragments similar to WSSV ORFs
(Fig. 2). One of the contigs containing a transposon-like sequence (3’-terminal fragment
of the piggyBac ORF) was incorporated at the end of scaffold I (Fig. 2b). Abrupt coverage drop-out indicates that the transposon-like sequence is not present
in all of the copies of the respective region, which was confirmed by a manual inspection
of alternative assembly paths. Based on this information the region beyond the coverage
decline point was discarded during preparation of the final scaffold sequence.

Fig. 2. Examples of contigs containing transposons in Metopaulias depressus with respective per-base coverage graphs shown above. (a) Contig c152 (partly incorporated in scaffold I) with a fragment homologous to wsv035
and a 3’-fragment of piggyBac transposase. (b) Contig c14 (not scaffolded) with a fragment similar to reverse transcriptase of
jokey-like transposons and a fragment with homology to wsv360

Prevalence of WSSV-like sequences in Metopaulias individuals

Our PCR assay was designed on the basis of the homologs of wsv360 (VP664) and wsv514
(the DNA polymerase) represented on the scaffolds. This choice was motivated by the
functional significance of these genes in WSSV and apparent lack of variation in the
respective fragments in the reference specimen (but see below). The PCR test was positive
for all 22 DNA samples from Metopaulias individuals from across Jamaica (Additional file 2: Figure S1). For twelve individuals the DNA polymerase fragment was Sanger-sequenced
in both directions to produce 715-bp sequences. No nucleotide position was variable
among individuals, but at least one position appeared to be polymorphic in each one
of the sequences: a synonymous A/G variation at position 111, which was also confirmed
for the reference specimen (Additional file 2: Figure S2).

WSSV-like sequences in the shrimp genomic libraries

Although the inventories of the WSSV-like genes were provided in the original publications
17], 18], we decided to perform a new analysis similar to the one performed for the M. depressus library, primarily to clarify gene boundaries and extract protein sequences. For
the P. monodon repetitive families our BLASTx search yielded 76 hits (of which 48 were reported
in the original study) to 48 different WSSV genes (see Additional file 1: Table S3). Seven hits to seven WSSV genes were obtained for the P. japonicus BAC library, all of which were originally reported and numbered (genes 11, 13–18)
(see Additional file 1: Table S4).

Additionally, in both libraries single dUTPase ORFs and several IAP genes were confirmed
in locations indicated in the original publications. All genes were represented by
single continuous ORFs with the exception of one IAP gene from the P. japonicus BAC library, which we interpret as consisting of at least four exons. In contrast
to Koyama and co-authors 17], we consider it separately from the IAP gene next to it and designate them as genes
06A and 06B respectively (see below).

WSSV-like sequences in EST libraries

A conservative search for WSSV-like sequences in all non-human and non-mouse EST libraries
available in the NCBI EST database revealed 12 positive hits with similarity to WSSV
on the protein level only. These hits were distributed among four species, all of
which belong to decapod crustaceans (see Fig. 3).

Fig. 3. Decapod species with fossilized WSSV-like viruses (marked red), with WSSV-like sequences
in ESTs libraries (yellow) or with evidence of absence of WSSV-like sequences in genomes
(green). The cladogram to the left is a simplification of the phylogenetic relationships
between the species considered based on (Tsang et al. 2008; Ma et al. 2011; Tsang
et al. 2014), all other decapod groups are omitted. The numbers of WSSV-like ESTs
discovered in our BLAST analysis are provided in parentheses

Cross-comparison of WSSV and WSSV-like genomes

Diversity and similarity of proteins with homology in WSSV

Out of the 48 unique WSSV homologs in the P. monodon library, 38 (79 %) were found in the M. depressus scaffolds (Fig. 4, Additional file 3). Under the hypothesis of a common ancestor with a genome composition identical to
that of WSSV, the probability of obtaining at least this number of matches by chance
from the set of 106 WSSV genes with known function or homology equals to 0.0017. Furthermore,
if the gene set is expanded to the 180 genes currently hypothesized to be encoded
by the WSSV genome 1], 8] the probability decreases to 3.64?×?10
?12
. Of the seven genes with WSSV homology from the P. japonicus BAC clone all are identifiable in the other shrimp library and only one (wsv306)
is missing from M. depressus (see Additional file 3). BLASTx searches revealed no genes shared by the shrimp repetitive
sequences and crab scaffolds but absent from the WSSV genome, with the noticeable
exception of IAPs (see below).

Fig. 4. Similarities between WSSV-like genes in the crab and shrimp genomic libraries visualized
as a hive-plot. The three axes correspond to 1) genes from WSSV with a reported function
(Leu et al., 2009) and/or homology to sequences from the decapod libraries; 2) homologs
of WSSV-genes from the Metopaulias library and 3) homologs of WSSV-genes identified in the repetitive sequences from
Penaeus monodon (Huang et al., 2011) and P. japonicus (Koyama et al., 2010). Genes on the axes are arranged in the order on the WSSV-CH
genome. Lines connecting homologs are colored according to amino acid similarities
(see Additional file 3). When multiple alignments were available for the same gene only one was chosen.
In cases of multiple paralogs in the penaeid shrimp libraries only one representative
was taken with the preference given to proteins from P. japonicus

The proteins from the shrimp libraries appear significantly less similar to their
homologs in the two other gene sets in comparison to identity values between WSSV
and M. depressus with an average difference in p-distances of 0.2 (Fig. 4, Table 4). Nonetheless all three sets of pairwise distances are strongly correlated with each
other (see Table 4).

Table 4. Comparisons of p-distances between 38 proteins shared by the WSSV genome (W), M. depressus scaffolds (M) and penaeid libraries (P): summary statistics, correlation coefficients
and results of Wilcoxon signed rank tests. Asterisks indicate significance levels
after Holm’s correction: * – p??0.05, **** – p??0.0001

Many of the WSSV-homologs are found several times on different fragments in the P. monodon library. Due to the fact that many of them are represented by incomplete gene sequences,
the exact number of genes with several WSSV paralogs could not be estimated. However,
22 pairs of overlapping fragments representing 19 unique genes could be determined
(see Additional file 3) with the mean (±SD) p-distance of 0.687?±?0.0728. No unambiguous cases of more than two overlapping paralogs
were found. Twelve fragments from P. monodon could be compared to the seven WSSV homologs from the P. japonicus library with the resulting mean p-distance of 0.593?± 0.2340.

Synteny between WSSV and M. depressus scaffolds

The gene order on the Metopaulias viral scaffolds was compared to the WSSV genome annotations (Fig. 5). Overall, 14 syntenic blocks covering 43 genes were identified. Half of the blocks
include just single pairs of neighboring genes, while the largest block spans eight
genes (wsv306-wsv324). No cases of inversions were identified in individual blocks:
neighboring genes show consistently the same orientations in both genomes.

Fig. 5. Synteny conservation between the WSSV genome (represented by WSSV-CH) and Metopaulias depressus viral scaffolds. The scaffolds are labeled with roman letters, with scaffold III
being broken at two points of ambiguous joins (see Material and Methods for details).
Lines are colored according to respective Metopaulias scaffolds

Analysis of breakpoints

In order to enable comparison of gene orders with the data from the shrimp libraries,
we utilized a breakpoint-oriented approach. For the set involving WSSV, M. depressus scaffolds and P. monodon repeat families, 50 pairs of neighboring genes could be inspected for their presence/absence.
Resulting breakpoint distances were nearly identical for all three pairs of viral
genomes: WSSV genome vs. crab scaffolds – 28 breakpoints, WSSV genome vs. shrimp library
– 28 breakpoints, crab scaffolds vs. shrimp library – 30 breakpoints (no statistical
test performed). In total only seven neighboring gene pairs were shared by all three
gene sets (the “-” sign denotes genes on the opposite strand): wsv023/wsv021, -wsv327/wsv332,
wsv306/wsv308, wsv293a/wsv289, wsv139/wsv137, -wsv427/wsv433 and wsv433/wsv440. Due
to the low number of genes n the fragment containing WSSV-like sequences from P. japonicus, its addition decreased the number of gene pairs amenable to the analysis to ten
with only a single pair of neighbors shared by all four gene sets: -wsv327/wsv332,
which was also the sole pair shared by P. japonicus and P. monodon.

Phylogenetic analyses

DNA polymerase

Single genes coding for DNA polymerases were found in WSSV and in the M. depressus library, while the P. monodon library contained two distinct polymerases. Phylogenetic analysis shows three well-supported
major clades corresponding to polymerases from 1) eukaryotes and three viral groups;
2) archaeans, Myoviridae and a halovirus; 3) and the WSSV and WSSV-like polymerases
(Fig. 6). Within this last clade the two polymerases from P. monodon do not cluster together and occupy basal positions in relation to the two other polymerases
(see also Additional file 2: Figure S3).

Fig. 6. Maximum likelihood (ML) phylogenetic tree for DNA polymerase amino acid sequences
(alignment length 833 residues). The tree is formally rooted on the split between
Eukaryota and Archaea. Bootstrap support values???50 % are indicated above the branches.
The proteins from WSSV and from the three decapod genomic libraries are in red

Helicase

The WSSV genome and the M. depressus scaffolds contained single-copy helicase genes each, while two distinct helicase
sequences are identifiable in the P. monodon library. The helicases from WSSV and the decapod libraries cluster in a well-supported
clade presumably at the base of eukaryotic helicases while bacterial helicases form
a distinct clade (predominantly Bacteroidetes) (see Additional file 2: Figure S4). Among the four helicases of viral origin the sequences from P. monodon occupy basal positions and do not form a clade with each other.

Ribonucleotide reductase subunits

The phylogenetic tree reconstructed for the RR1 gene is largely poorly resolved (Fig. 7, see also Additional file 2: Figure S5). The RR1 proteins from WSSV and M. depressus cluster together within a large clade covering opisthokonts. Results for the RR2
alignment show a very similar pattern with similarly low resolution (see Additional
file 2: Figure S6).

Fig. 7. ML phylogenetic tree for ribonucleotide reductase large subunit (RR1) amino acid sequences
(alignment length 769 residues). The tree is rooted by the sequence from Flavobacterium indicum (Bacteria)

TATA-box binding protein

Phylogeny reconstructed for the TBP sequences shows the proteins from WSSV and M. depressus to branch off close to viruses from the NCLDV group with only a weak support (Fig. 8, see also Additional file 2: Figure S7).

Fig. 8. ML phylogenetic tree for TATA-box binding protein (TBP) amino acid sequences (alignment
length 173 residues). The tree is formally rooted by the sequences from Archaea

Endonuclease

The viral endonucleases from WSSV and M. depressus both possess a eukaryotic signal peptide and cluster together at the base of double-stranded
ribonucleases bearing similar signal peptides identified in crustaceans and insects
(Fig. 9, see also Additional file 2: Figure S8).

Fig. 9. ML phylogenetic tree for endonuclease amino acid sequences (alignment length 144 residues).
The root corresponds to the branching of of endonucleases bearing eukaryotic signal-peptide:
see Wynant et al. (2014)

Protein kinase

Among the two WSSV PK enzymes the sole protein kinase identified in M. depressus is presumably an ortholog of WSSV PK1, given the higher similarity and shared location
of the ORF next to the homolog of wsv421. The two kinases from the P. monodon library are roughly equally distant from WSSV PK1 and PK2 and from each other (see
Additional file 3) and no positional information is available for them. The alignment yielded a tree
with poor resolution around the five protein kinases (see Additional file 2: Figure S9). Nevertheless, they form a distinct clade in the PK tree with high bootstrap
support. The two P. monodon proteins group with WSSV PK2, but do not form a cluster with each other, while WSSV
PK1 clusters with the sole PK sequence from M. depressus with weak support. Among the PK-sequences sampled with BLASTp, serine-threonine kinases
were the only class of PKs with explicitly specified activity.

Inhibitor of apoptosis proteins

The tree for the BIR-domains of the IAPs associated with WSSV-like sequences in the
decapod libraries and IAPs from different animal taxa is poorly resolved due to the
very low number of positions (72 amino acids) (Additional file 2: Figure S10). Still, three or four large clades involving IAPs from the three decapod
libraries and four previously identified IAPs from penaeids and a caridean shrimp
are discernible and roughly correspond respectively to the first, second and third
BIR domains of the decapod three-BIR-domain IAPs. Three second BIR domains from the
shrimp repetitive sequences form a separate clade close to the clade of the first
BIR domains. The IAP from M. depressus has only two BIR domains: one of them clusters together with the group of second
domains of other IAPs, and the other one does not belong to any of the three decapod
clades. Another outlying BIR domain is the third BIR domain of the IAP coded by gene
06A from the P. japonicus library. Not all of the IAPs contain RING domains, and the phylogeny reconstructed
based on the respective alignment (35 positions only) shows a grouping of the RING
domains from two IAPs from the two shrimp libraries together with IAPs independently
identified in penaeid and caridean shrimps (see Additional file 2: Figure S11). The RING-domain from the M. depressus IAP does not belong to this clade.

dUTPase

The phylogenetic tree based on the alignment for the dUTPase protein sequences also
lacks resolution (see Additional file 2: Figure S12). The four dUTPases from WSSV and the three decapod libraries do not form
a single monophylum and cluster together with other sequences in different parts of
the tree, albeit with low support values. At the same time, the dUTPases from the
two shrimp libraries are closely related and form a well-supported grouping.