CHiCAGO: robust detection of DNA looping interactions in Capture Hi-C data

In this paper, we present the CHiCAGO algorithm for Capture Hi-C analysis and demonstrate its efficacy in detecting interactions enriched for regulatory chromatin features and relevant GWAS SNPs.

Our approach is based on the assumption that “significant” interactions emerge as outliers on a distance-dependent local background profile. This assumption is shared by most other tools for interaction detection in 3C-like data and seems reasonable for the purposes of identifying regulatory interactions. Indeed, it can be expected that regulatory events such as transcription factor binding will stabilise the chromatin loop, leading to interaction frequencies or retention times beyond those generated by random collisions due to Brownian motion. This expectation is supported by the observation that CHiCAGO-detected interactions are selectively enriched for regulatory chromatin features, even when located in regions with high background interaction levels.

While the conceptual interpretation of “significant” interactions is shared between CHiCAGO and algorithms developed for other types of 4C and Hi-C data, there are key differences in terms of the underlying background model, the normalisation strategy and the multiple testing procedure.

Existing tools model Hi-C background with a broad range of distributions, both discrete (binomial [16, 29], negative binomial [6]) and continuous (Weibull [7, 9], normal [13]). In CHiCAGO, we instead opted for a two-component convolution model that incorporates two count distributions: a negative binomial and a Poisson. In doing so, we were motivated by the fact that distance-dependent Brownian collisions and technical variability are two distinct background count-generating processes whose properties are best learned separately on different subsets of data. Indeed, signals from Brownian collisions ostensibly dominate the background at short distances, to the extent that technical variability is barely detectable. In contrast, at large linear distances between fragments, Brownian collisions are too weak for their count distribution to be estimated directly. Thus, we infer this distribution by extrapolation.

Borrowing information across baits to learn the background model, as CHiCAGO does, requires careful normalisation across interactions. While Hi-C background depends on a number of known parameters, such as fragment length and GC content [10], we, along with others [7, 8, 30], have opted to avoid any specific assumptions about noise structure, particularly given the increased complexity and asymmetric nature of capture Hi-C noise compared with conventional Hi-C. Assuming that interactions are subject to multiplicative bait- and other-end-specific bias, as we did in learning the Brownian background component, parallels the assumptions of the Hi-C iterative correction approach by Imakaev et al. [8] and is generally consistent with data from molecular dynamics simulations of chromatin fibres [18]. In modelling technical noise, we assumed it to be reflected in the numbers of trans-chromosomal interactions involving the same fragment. A similar strategy has been applied independently in a recently published Capture Hi-C study [6]; the same authors also proposed an iterative correction algorithm for Capture Hi-C data [7] (software not publicly released) that may complement the approaches taken here.

Multiple testing issues are important in genomic analyses and, in attempting to address these issues, a number of bespoke approaches have been developed [20, 31]. The specific challenge of multiple testing in Hi-C data is that we expect the fractions of true positives to vary depending on the genomic distance between the fragments; in fact, the majority of tests are performed with interactions spanning large distances or spanning different chromosomes, where true positive signals are least expected. CHiCAGO’s multiple testing procedure is based on the p value weighting approach by Genovese et al. [15], which is a generalisation of a segment-wise weighting procedure by Sun et al. [32]. These approaches have been used successfully to incorporate prior knowledge in GWAS [3335] and are emerging in functional genomics analyses [36, 37]. In using the reproducibility of significant calls across replicates as an estimate of the relative true positive rate, we have taken inspiration from the irreproducible discovery rate (IDR) approach [38] used to determine peak signal thresholds in other types of genomics data, such as ChIP-seq.

Note that, in this setting, IDR cannot be used verbatim for choosing signal thresholds, as the relationship between Capture Hi-C signal and reproducibility does not satisfy IDR assumptions, likely because of undersampling issues (not shown). Importantly, conventional false discovery rate (FDR)-based approaches for multiple testing correction [39] are also unsuitable for these data. Indeed, CHi-C observations (read-pair counts) are discrete and many of them are equal to either zero or one. This leads to a highly non-uniform distribution of p values under the null, violating the basic assumption of conventional FDR approaches. The “soft-thresholding” approach used in CHiCAGO shifts the??log-weighted p values such that non-zero scores correspond to observations, where the evidence for an interaction exceeds that for a pair of near-adjacent fragments with no reads. More robust thresholds can then be chosen based on custom criteria, such as maximising enrichment of promoter-interacting fragments for chromatin features (Fig. 6; a user-friendly function for this analysis is provided as part of the Chicago R package—see the package vignette provided as Additional file 3). Based on this approach, we chose a signal threshold of 5 for our own analyses.

The undersampled nature of CHi-C data (particularly at longer distance ranges), although robustly handled by CHiCAGO, may lead to significant sensitivity issues when using thresholded interaction calls in comparative analyses. We therefore suggest performing comparisons based on the continuous score range. Potentially, differential analysis algorithms for sequencing data (such as DESeq2 [40]) may also be used to formally compare the enrichment at CHiCAGO-detected interactions between conditions at the count level, although power will generally be a limiting factor. As undersampling drives down the observed overlap of interactions called on different samples (Additional file 2: Figure S4c), methods such as [41, 42] may be considered for formally ascertaining the consistency between datasets. Additional filtering based on the mean number of reads per detected interaction (e.g., removing calls whose mean N is below 10 reads) will also reduce the impact of undersampling on the observed overlap, but at the cost of decreasing the power to detect longer-range interactions.

The p value weighting approach used here is similar in spirit to an empirical Bayesian treatment, with the p value weights related, but not identical, to prior probabilities. Bayesian approaches are widely used (including, recently, for signal detection in conventional Hi-C [43]) and the Bayes factors and posterior probabilities they generate are potentially more intuitive than weighted p values. However, the p value weighting approach used here has the advantage of not making any specific assumptions about the read distributions of “true interactions”, beyond their having a larger mean. Both approaches open the opportunity of incorporating prior knowledge, beyond the dependence of reproducibility on distance—for example, taking into account the boundaries of topologically associated domains (TADs) [44], higher-order contact domains and chromosomal territories. We choose not to do this currently because the exact relationship between these genomic properties and looping interactions still requires further investigation, and incorporating these relationships a priori prevents their investigation in post hoc analyses. Active research in this area makes it likely that much more will be known about the determinants of loop formation in the near future, enabling a more extensive use of prior knowledge in interaction detection, potentially with a formal Bayesian treatment.

The downstream analyses of CHiCAGO results provided in this paper confirm the enrichment of promoter-interacting regions for regulatory features and disease-associated variants. These results demonstrate the enormous potential of CHi-C for both functional genomics and population genetics, and this assay will likely be applied in multitudes of other cell types in the near future. Therefore, user-friendly, open-source software for robust signal detection in these challenging data will be a welcome addition to the toolkits of many bioinformaticians and experimentalists alike. We have developed CHiCAGO with the view of addressing this need. Furthermore, we expect the statistical foundations of CHiCAGO, particularly the convolution background model and the multiple testing procedure, to be potentially useful in a broader range of Hi-C-related assays.