HTJoinSolver: Human immunoglobulin VDJ partitioning using approximate dynamic programming constrained by conserved motifs

May 23, 2015

Partitioning sequences using conserved motifs

Similar to the original JOINSOLVER algorithm, conserved motifs initiate the alignment
process 1]. In preparation for heavy chain VDJ alignment, the rearrangements are split into
smaller regions using the conserved 3â€™ VH-motif â€œTAT TAC TGTâ€ and JH-motif â€œC TGG
GGâ€. If a motif is not found, we fall back to other methods of finding the motif,
which are described below. FigureÂ 1 provides an overview of the partitioning process with an example sequence. In the
figure, many of the V and J nucleotides are replaced with dots to preserve space.
First the conserved motifs are found in the sequence (FigureÂ 1a, the motifs are bold). The sequence is split just before the highly conserved 3â€™
V-motif (FigureÂ 1b). The sequence on the 5â€™ side of the V-motif, which includes the nucleotides encoding
codons 1â€“101 of the V-germline using IMGT numbering 13], is aligned using our approximate backwards DP algorithm (3â€™ to 5â€™). In the figure,
the arrows show the alignment direction. The sequence on the 3â€™ side of the V-motif
consists of the 3â€™ of the V-segment, the VD junction, D-segment, DJ junction, and
the J-segment. Our V-end algorithm, an overlap DP algorithm described below, aligns
the sequence on the 3â€™ side of the V-motif and identifies the end of the V-segment.
The two parts of the V-segment are merged to produce a completely aligned V-segment.
The remainder of the rearrangement consists of the unaligned VD junction, the D-segment,
the DJ junction, and the J-segment.

Figure 1. Overview of V(D)J Partitioning. Partitioning Ig VDJ rearrangements at conserved VH
JH motifs for alignment with the approximate backwards algorithm and other DP algorithms.
a) A VDJ nucleotide sequence before subdivision and algorithm processing. The dots between
CAGâ€¦GTA and GGAâ€¦CAG represent the nucleotides that are omitted for brevity. The V
and J motifs, TAT TAC TGT and C TGG GG, respectively are shown in bold face type.
b) The VDJ rearrangement is divided into 2 sections: the 5â€™ end of the V-segment containing
codons 1â€“101; and the 3â€™ end of the V-segment, the VD junction, the D-segment, the
DJ junction, and the J-segment. The 5â€™ end of the V-segment is aligned backwards (3â€™
to 5â€™), and the reset of the sequence is aligned forwards (5â€™ to 3â€™). The V-end is
identified and the two parts of the V are merged. c) The rest of sequence is split just before the J motif, which is where the J-Start
DP algorithm aligns the sequence to a J gene (left arrow) and determines the 5â€™ end
of the J-segment. The V-end algorithm is also used to identify the 3â€™ end of the JH
(right arrow). The 5â€™ and 3â€™ ends of the J are merged. d) A specialized local DP algorithm is used to align a D-gene within the VD-D-DJ subunit.
In the figure, the V-, D-, and J-segments within each partition of the sequence are
labeled. The intervening nucleotides labeled VD and DJ represent N addition nucleotides
in the junctions.

Next the partitioning method identifies the JH segment using a highly conserved J
motif in a process similar to the VH alignment (FigureÂ 1c). The remaining unaligned sequence is split just before the J-motif. A DP algorithm
is used to align this fragment to J germline genes and to identify the 5â€™ start of
the J-segment. The V-end algorithm is used to align the 3â€™ end of the J-segment. The
two J fragment alignments are merged to produce a fully aligned J-segment. Finally,
the D-segment is matched using a specialized local DP algorithm (FigureÂ 1d) to produce a fully partitioned and aligned rearrangement (FigureÂ 1e). Those nucleotides that are not partitioned with the VH, D, or JH segment are considered
N addition nucleotides.

The approximate backwards algorithm

Unlike the DP algorithms commonly used for sequence alignment, the approximate backwards
algorithm starts from a known point and works backwards to the start of the matrix
formed by the germline gene and the unknown or query sequence. In this study, the
scoring rules are +5 for a match and ?4 for a mismatch; gap opening (indel formation)
is penalized ?30, and gap extension (continuation of indel) is penalized ?1. The numbers
in the matrix elements are the scores of the best alignment up to the bases corresponding
to the column and row. The algorithm starts at the location of the 5â€™ T of the conserved
V-motif in both the germline sequence and the query Ig sequence. The score is initialized
to zero. The algorithm steps one base backwards, toward the 5â€™ end in both sequences.
A traceback, which points to the previous matrix element, is kept to mark the alignment.
As long as the bases match, the algorithm continues stepping in the 5â€™ direction.

If the bases donâ€™t match, the algorithm considers the possibility of a mutation or
an indel. The algorithm steps forward 10 bases (toward the 3â€™ end) and calculates
a small 40?Ã—?40 DP matrix using an algorithm similar to the Needleman-Wunsch algorithm
9],10] to find the best alignment in this small section of the sequences. The small matrix
is initialized by assuming inserts, which fill the elements in last column, and deletions,
which fill the bottom row. Every other element in the small matrix is the maximum
of a match, a mismatch, an insertion, or a deletion using the scoring rules. The tracebacks
maintain the alignment through the small matrix. By calculating all elements of the
small matrix, the algorithm determines the best alignment through the matrix taking
insertions, deletions, mutations, and matches into account. The algorithm continues
aligning from the highest scoring entry in first row or column of the matrix.

FigureÂ 2 demonstrates how the algorithm aligns sequences. In the examples provided in FigureÂ 2, the starting point uses the most 5â€™ T from the conserved V-motif, TAT TAC TGT. The
germline and unknown sequences match for the next 3 bases (G, T, and G), so the algorithm
walks along the diagonal in the 5â€™ direction of the DP matrix. When a mismatch occurs,
it traces back and calculates the score over a small rectangular submatrix. In FigureÂ 2a, which corresponds to a single C to A mutation, the high score traces straight along
the diagonal. In FigureÂ 2b, which corresponds to a two base insert (AA), the traceback has two steps down.
FigureÂ 2c, which has a two base deletion (TC), the traceback has two right steps. In all the
examples in the figure, arrows represent the trace backs, the top rows and first columns
are shown in bold, and the highest scores are circled. The algorithm continues stepping
back in the 5â€™ direction along the diagonal from the high scoring element. Finally,
the algorithm terminates at the top left matrix element. To conserve space in the
figure, the algorithm traced back 1 step and a 5?Ã—?5 square matrix was calculated.
However, these parameters are too small to avoid falling into local maxima, thus,
in actuality, the algorithm uses a 40?Ã—?40 square matrix on a mismatch, and traces
back 10 steps.

Figure 2. Matrix calculation of the alignment score for a sequence with a mutation or indel.
Matrix calculation of the alignment score for a sequence with a mutation or indel.
(a) Matrix for a single nucleotide mismatch. (b) Matrix with a two-base insertion (CG??CAAG). (c) Matrix with a two-base deletion (TC????). The dynamic programming matrix for the
approximate backwards algorithm begins at the initial T of the VH-motif (last row
and column, score?=?0). The algorithm goes backwards along the diagonal until it hits
a mismatch, in which the algorithm backs up a step and generates a submatrix (solid
lines). The algorithm can choose to step up (deletion), step to the left (insertion),
or continue diagonally (match/mismatch). For a deletion or insertion, the score initially
decrease by 10, but subsequent indels have a score decrease of 4. Matches increase
the score by 5, and mismatches decrease the score by 4. The maximum score in the first
column or row (bold box) is selected (circled). The algorithm continues stepping backwards
on the diagonal. Backtraces are shown as arrows, and label the alignment of the sequences.

After completing the alignment, sequences with indels are re-aligned from 5â€™ end of
the alignment to the 3â€™ end of the V-motif. The two alignments are compared for consistency.
This quality control step catches suboptimal alignments caused by local maxima.

3â€™ V-end alignment algorithm

After aligning the 5â€™ end of the V-segment, we align the 3â€™ of the V-segment starting
at the conserved VH-motif. The V-end alignment algorithm, which is a type of overlap
DP algorithm 11], begins at the first â€œTâ€ in the VH-motif for both the query and germline sequences.
A full matrix calculation is required, however the matrix is fairly small because
only a few nucleotides occur after the V-motif in the V-germline genes. The 3â€™ end
of the V-segment is defined by the maximum score, which can occur anywhere in the
matrix. The 5â€™ J-end alignment uses the same algorithm, starting at the â€œCâ€ from the
conserved JH-motif.

J-start alignment algorithm

The start of the J-segment is aligned using a DP algorithm similar to a Needleman-Wunsch
algorithm; however, the maximum score can be anywhere in the DP matrix, and the algorithm
is run backwards from the J-motif. The maximum score corresponds to the start of the
J-segment. The alignment is determined by tracing back from the highest scoring matrix
entry.

D matching alignment algorithm

The D alignment uses a local alignment algorithm similar to a Smith-Waterman algorithm.
A match is given a score of +1, however a mismatch sets the score back to 0. When
the score?=?3, the algorithm looks back 4 steps to check for a potential mutation,
if the score before the match start is greater than 3, the current score is increased
by the previous score. This allows a mismatch to be recognized as a mutation in the
D-segment if and only if there are at least 3 consecutively matching bases on both
sides of the mutation. The probability of randomly matching 3 bases is around 1.5%
or 3% on either side of the D-segment, and 3 was chose to keep the probability of
random matches below 5%. The termination condition is the maximum score anywhere in
the matrix. These modifications to the Smith-Waterman algorithm maintain the basis
of the original JOINSOLVER, matching D-segments based on consecutive matching bases.

Falling back to other algorithms

In the event, the V-motif is not found in the sequence, the algorithm looks for the
motif with one mutation. If the motif with a single mutation is found, the approximate
backwards algorithm is run using the mutated motif location. Because the J-motif is
shorter, mutations are not allowed in the J motif.

When the location of the V-motif is wrong, the alignment is very poor resulting in
a low score. Setting a score threshold of 3.1/base effectively catches an incorrect
motif location. Before relinquishing to a slower algorithm, one last attempt is made
to identify the location of the motif. The sequence is aligned to the first germline
sequence in our database (IGHV1-18*01). The motif location from this alignment is
used to align the query sequence to all other germline genes to determine the best
match or alignment. If the highest scoring alignment is still below 3.1/base, we run
an overlapping sequence DP algorithm for the V or a Smith-Waterman algorithm for the
J alignments. Our only modification is that the end of the V- or J-segment occurs
at the maximum score anywhere in the matrix, not just in the last row or column.

Simulations

Artificial VDJ rearrangements were generated by randomly recombining a VH-, D-, and
JH-germline gene from the JOINSOLVER germline database. A random number of terminal
nucleotides from the 3â€™ V, 5â€™ 3â€™ D and 5â€™ J were removed to mimic exonuclease activity.
Various random numbers of nucleotides were added to the V-D and D-J junctions to mimic
TdT activity. In our simulation, each base had a fixed probability of being mutated
(referred to as the mutation probability). The mutation probability should not be
confused with the mutation frequency, which is the number of mutations is a sequence
divided by the sequence length. The number of indels in our simulation was randomly
selected using a distribution that heavily favors one indel per sequence, P(n)?=?cne^-3(n-1), where n is the number of indels, and c?=?(1-e³)² is a normalization constant. In our simulator, the indel length is selected from
a Poisson distribution. If a length of zero is selected, the distribution is re-sampled.
TableÂ 1 shows the default parameters of our simulations. Our simulations do not intentionally
mimic the molecular mechanism of Activation-Induced Cytidine Deaminase (AID), which
specifically targets the G in RGYW motifs and the C in WRCY motifs or the AT bias
introduced by mismatch repair mechanisms 14].

Table 1. Default parameter used in simulations, unless otherwise mentioned

Two sets of simulations were performed. In the first simulation, 10,000 rearrangements
were generated with mutation probabilities ranging from 0% to 95% in increments of
5% throughout the VDJ sequence. Using these rearrangements, the approximate backwards
algorithm was compared against a complete DP algorithm, to calculate the success rate.
The second set of simulations with 1,000 rearrangements engineered with a mutation
probability of 3.5% was performed to compare HTJoinSolver with the original JOINSOLVER
using the original JOINSOLVER germline database. The results of an alignment are considered
a successful match to a germline if the algorithm returns the correct gene; the exact
allele is not required.

Comparison with a standard dataset of biological VDJ rearrangements

Additionally 13,153 human sequences from the Stanford.S22 dataset 15] were analyzed to compare HTJoinSolver with several other frequently employed algorithms.
Briefly, the Stanford_S22 data were produced by 454 sequencing of peripheral blood
mononuclear cells from a single donor. The S22 individual genotype was determined
using an individual analysis of iHMMune-align 16] results. The Stanford_S22 dataset and the S22 genotype form a standard dataset to
evaluate the performance of VDJ partitioning algorithms. An online evaluation tool,
Evaluation of IGH partitioning tools (17], http://www.emi.unsw.edu.au/~ihmmune/IGHUtilityEval/evalForm.html) compares the performance of VDJ partitioning tools on the Stanford_S22 dataset.
As a performance metric, the tool reports the percentage of V, D, or J assignments
to germlines that are not present in the predetermined S22 genotype.

Germline database

The germline genes used in HTJoinSolver are IMGT reference sequences 18],19], www.imgt.org/download/GENE-DB/IMGTGENEDB-ReferenceSequences.fasta-nt-WithGaps-F+ORF+inframeP, downloaded July 30, 2014). HTJoinSolver provides methods to download and reformat
germline genes from IMGT. These methods allow the user to maintain an up-to-date library
of germline genes usable by HTJoinSolver. When comparing HTJoinSolver with JOINSOLVER,
we used the JOINSOLVER germline database, which was derived from IMGT and included
pseudogenes.

Access to HTJoinSolver

HTJoinSolver is available for public use, and can be downloaded from https://dcb.cit.nih.gov/HTJoinSolver.