TRIg: a robust alignment pipeline for non-regular T-cell receptor and immunoglobulin sequences

T-cell receptor (TR) and immunoglobulin (Ig, also known as antibody) are essential in adaptive immune system as they recognize a wide variety of antigens, triggering immune response [1]. Each TR and Ig gene contains many coding regions, which are classified into variable (V), diverse (D, only in TR?/? and IgH genes) and joining (J) regions. For example, TR? has 67 V, two D, and 13 J regions [2]. To recognize numerous antigens, TR and Ig genes undergo V(D)J recombination (i.e., selection and concatenation of a V, (D), and J region) at the DNA level for generating a large repertoire of structurally diverse receptors [3]. During recombination, the diversity is further enhanced via deletion and non-template addition of nucleotides within the so-called complementarity determining region 3 (CDR3), which is crucial for antigen recognition [4]. The knowledge of V(D)J recombination and CDR3 is thus important for studying immune response.

Several alignment tools have been available to analyze the complex recombination of TR and Ig genes, e.g., IMGT/V-QUEST [5]. After the introduction of next-generation sequencing (NGS), which generates a large amount of data, new tools for analyzing TR and Ig sequences are all geared toward faster speed. These include IMGT/HighV-QUEST [6], Decombinator [7], and the recent IgBLAST [8] and MiTCR [9]. Despite their distinct algorithms, all these tools do alignment only to the V(D)J regions instead of the whole gene to enhance speed. Software for subsequent analysis of diversity and clonality, e.g., tcR [10] and IMEX [11], are also available.

These tools have been quite useful for studying TR and Ig sequences, which are often prepared via a multiplex PCR approach [12, 13], in which multiple primers are designed to target different V and/or J regions. Such amplicon approaches are efficient in capturing regularly recombined TR and Ig genes, but likely suffer from amplification bias and miss non-regular TR and Ig sequences due to aberrant recombination in diseases [14, 15], cancerous cells [16, 17], or even healthy individuals [18]. Although amplification bias can be reduced [19], a complete removal of bias is still not warranted. To avoid amplification bias, 5? RACE (rapid amplification of cDNA ends) strategy is promising [20] and has been applied in recent studies of immune repertoire [21, 22]. In addition, the strategy allows for detection of aberrant recombination and non-regular splicing events [2325].

For RACE data, however, current tools can make mistake because they all assume regular recombination, which is not valid in many RACE sequences [26]. To fully utilize RACE data, we propose TRIg to handle non-regular TR and Ig sequences. Unlike all current programs, TRIg does alignment to the whole immune gene instead of only to the VDJ regions. With this strategy, TRIg avoids false V(D)J annotations to non-regular immune sequences. The strategy, however, is computationally challenging because full-length TR and Ig genes are relatively long and contain many repeats, which may result in multiple hits and the authentic hits need to be identified. The challenges have been properly managed in the TRIg pipeline.

On real RACE data, TRIg revealed several types of non-regular TR? sequences, e.g., the expression of pseudo-gene J2-2P and concatenation of two J regions or J and intergenic regions. TRIg avoided false V(D)J annotation to those reads, thereby providing a more accurate description of immune repertoire. Accurate frequencies of V(D)J recombination have been used as biomarkers for health and disease [27, 28]. For those studies, an unbiased and accurate description of immune repertoire can be obtained using TRIg and RACE data. Besides, TRIg can unveil the rich behaviors of TR and Ig genes toward maturation, providing materials for a deeper understanding of the regulatory mechanisms. Therefore, we expect TRIg to benefit various immune researches.