HIA: a genome mapper using hybrid index-based sequence alignment

Hybrid mapping: finding candidate alignment region (CAR)

Hybrid mapping follows the same seed-and-extend approach used by all HT-based tools. A MR (matching region) is a common substring between the reference sequence and the read. Let ‘sp’ be the starting position in the reference sequence where the MR occurs. We will indicate each MR as 3-tuple dv, ro, L, where ‘ro’ (read offset) is the starting position in the read where the MR occurs, ‘dv’ (diagonal value) is defined as dv = sp ? ro, and ‘L’ is the length of the MR. Given a length-m? MR, there are (m? ? q + 1) q-grams having the same diagonal values and consecutive read offsets. Diagonal values having same values implies that the corresponding MRs are close each other in the reference sequence. A CAR (candidate alignment region) is a list of MRs, which are close each other and ordered by ‘ro’. We define a CAR as a seed and align only the unmatched regions in CAR.

The procedure for finding MRs and CARs of a read is as follows: (1) retrieving range of SA of each q-gram using HT and SA; (2) computing diagonal values; (3) sorting by diagonal value and offset; (4) grouping MRs with same diagonal value and successive offsets; (5) merging the adjacent MRs into CARs; (6) sorting the CARs by matched bases in descending order. For example, given a read (r = GCCATG) and q-gram length (q = 2) and the hybrid index constructed in Fig. 1, we can find MRs and CARs as follows:

  1. I.

    Retrieve range of SA of each q-gram using HT and SA

0-th q-gram positions: (GC: SA[10, 11])

1-th q-gram positions: (CC: SA[8])

2-th q-gram positions: (CA: SA[6, 7])

3-th q-gram positions: (AT: SA[3, 4])

4-th q-gram positions: (TG: SA[15])

  1. II.

    Compute diagonal values

(GC, 5, 0), (GC, 11, 0), (CC, 11, 1), (CA, 4, 2), (CA, 11, 2), (AT, ?2, 3), (AT, 4, 3), (TG, 4, 4)

  1. III.

    Sort by diagonal value and offset

(AT, ?2, 3), (CA, 4, 2), (AT, 4, 3), (TG, 4, 4), (GC, 5, 0), (GC, 11, 0), (CC, 11, 1), (CA, 11, 2)

  1. IV.

    Group MRs with same diagonal value and successive offsets

MR0: (AT, ?2, 3, 2) ? (AT, ?2, 3)

MR1: (CATG, 4, 2, 4) ? (CA, 4, 2), (AT, 4, 3), (TG, 4, 4)

MR2: (GC, 5, 0, 2) ? (GC, 5, 0)

MR3: (GCCA, 11, 0, 4) ? (GC, 11, 0), (CC, 11, 1), (CA, 11, 2)

  1. V.

    Merge the adjacent MRs into CARs and set matched bases

CAR0: (MR0; 2)

CAR1: (MR1, MR2; 5)

CAR2: (MR3; 4)

  1. VI.

    Sort the CARs by matched bases

CAR1: (MR1, MR2; 5)

CAR0: (MR3; 2)

CAR2: (MR2; 2).

Although diagonal values of two adjacent MRs are different, they could be located in a same CAR if there were inserted or deleted bases between them. In the case of CAR1, the difference value between diagonal value of MR1 and diagonal value of MR2 is 1 and there is one inserted base (C) between MR1 and MR2. We refer to this value as adjacency and we use the value in order to set the permitted size of insertion and deletion between MRs.

The formula (2) is the cumulative distribution function of X. We are able to calculate the rate of reads with at most k errors according to the formula (2). If we set k = 1.5?, the rate of reads with at most 1.5? errors approach to 0.9 and we can use the q-gram with length m/(1.5? + 1). Using a q-gram with the same length as the common substring decreases the number of MRs and CARs.

Secondly, since a q-gram that occurs in many regions in the sequence is not a good discriminator, such q-gram was given less weight than one that occurs in few regions. This heuristic is based on the inverse document frequency (IDF) commonly used in the field of information retrieval. The IDF is a measure of whether or not the term is common across all documents [16]. Applying this heuristic, we can filter out the less weight q-grams and consequently skip undesirable MRs and CARs.

Finally, given a read (r) and two strings, S1 and S2, both of length m, if the number of matched bases between r and S1 is greater than the number of matched bases between r and S2, then the number of differences between r and S1 is smaller than the number of differences between r and S2. This heuristic can be used to rank CARs by the number of matched bases and to filter out the lower-ranked CARs.