Populational landscape of INDELs affecting transcription factor-binding sites in humans

Identification of TFBS-ID

Fig. 1 shows a schematic representation of the computational pipeline used in all analyses
reported here. To build a catalogue of TFBS-ID, we first indexed all TFBS identified
by the ENCODE project in the human reference genome (hg19 version). Data from the
1000 Genomes project regarding the position of INDELs in the reference genome was
then compared to the position of TFBS and those cases in which an INDEL overlapped
with a TFBS were selected. This strategy rendered us a total of 259,864 TFBS affected
by at least one INDEL. Since a significant fraction of TFBS overlap at the sequence
level, the non-redundant number of TFBS-ID in the above set was 100,182 (an average
of 2.59 TFBS per INDEL). Due to the presence of long INDELs affecting many TFBS at
once, we decided to limit our analysis to those INDELs shorter than 200 bp, which
gave us a total of 99,642 TFBS-ID and 258,686 TFBS. Although the superior limit was
set to 200 bp, the final set of 99,642 TFBS-ID is strongly biased towards shorter
indels. More than 99.8 % of all indels were equal or shorter than 20 bp. Next, TFBS-ID
close (?5 KB) to the transcription start site (TSS) of known human genes (as defined
by the Reference Sequence set) were selected. In total, 7,313 human genes had at least
one TFBS affected by a polymorphic INDEL in the 1000 Genomes dataset. This set of
7,313 genes had a total of 38,339 TFBS affected by INDELs and 10,528 TFBS-ID. A complete
list of this dataset is available at Additional file 1: Table S1. Since many reports have also used a window that flanks the TSS of known
genes 16],17], we have also defined a different window of same size (5 KB) now encompassing 2,5 KB
in each side of a given TSS. For this window, we found that 9,733 human genes had
at least one TFBS affected by a polymorphic INDEL in the 1000 Genomes dataset (with
a total of 69,959 TFBS affected by indels and 14,665 TFBS-ID). The complete dataset
found for the 5 KB window flanking TSS can be found at Additional file 2: Table S2.

Fig. 1. Analysis overview. Schematic representation of the strategy used here to identify
and analyse TFBS affected by polymorphic INDELs in human populations

TFBS-ID showed a biased distribution in terms of location within both 5Kbp windows
proximal to the TSS of known genes. As seen in Fig. 2, their distribution tend to be closer to the TSS of genes (Fig. 2a) (the 3’ end of the 5Kbp window upstream of the TSS) while in the window with the
TSS at center the distribution of TFBS-ID is symmetrical with a slight higher frequency
at the upstream half of the window (Fig. 2b). When we split the TFBS-ID per type of transcription factor, the same biased distribution
is observed for both windows, especially for some transcription factors (Additional
file 3: Figure S1).

Fig. 2. Relative position of TFBS-IDs within the 5 KB window adjacent to TSS of human genes.
Overall distribution of TFBS-ID in both the 5 KB window upstream of TSS (A) and the
5 KB window flanking TSS (B)

We were also interested in knowing what types of TF were more frequently interrupted
by INDELs. A Monte Carlo simulation was performed testing the enrichment of specific
TF within our TFBS final sets. Table 1 lists the top 20 transcription factors enriched for binding sites near genes (both
5 KB windows) and affected by INDELs compared to all TF binding sites near genes.
Some of the TFs shown in Table 1 have already been identified in other analyses. Yokoyama et al.3], for example, have recently shown that hominid-specific binding sites for GATA1 and CTCF are enriched near genes related to sensory-related function and neurological pathways.
CTCF binding sites have also been shown to be under positive selection in several Drosophila
species 18]. POL2 has also been studied in humans and chimpanzees by Kasowski et al.8] who found inter-species divergence in the respective binding affinities.

Table 1. Transcription factors enriched in the set of TFBS-ID close to the TSS of known human
genes. “TF” refers to the name of the transcription factor; “Number of TFBS” refers
to the number of binding sites for the respective TF within the TFBS-ID set; “p-value”
refers to the degree of significance for the respective TF enrichment with the final
TFBS set against all TFBS near genes.

Evaluation of the effect of TFBS-IDs in the expression of corresponding genes

It has been shown that even small changes, like SNVs, in TFBS affect the affinity
of the corresponding transcription factor and consequently the expression of the associated
gene 8]. Therefore, we wondered whether the presence of an INDEL affecting at least one TFBS
would change the expression pattern of the corresponding gene. RNA-Seq data for 465
individuals (all of them from the 1000 Genomes project) from the Geuvadis initiative
14] was used to compare expression and genotype data for the same individual. A statistical
analysis was performed to identify those genes whose presence of a TFBS-ID was associated
to a change in its expression (comparing individuals according to their genotype:
homozygous for the absence of an INDEL, homozygous for the presence of an INDEL and
finally heterozygous individuals). Out of the 7,313 genes with at least one TFBS-ID
in the 5 KB window upstream of TSS, 6,248 were informative for this expression survey.
Out of these 6,248 genes we found that 18.5 % (1,155 genes considering q-value???0.05
as a threshold) had its expression affected by the presence of a TFBS-ID (again by
comparing individuals homozygous for absence of the TFBS-ID, homozygous for the presence
of the TFBS-ID and finally, heterozygous). This is significantly higher than expected
by chance (p-value??10
-5
; OR 1.16). For the window flanking the TSS, we found that 1,804 genes (q-value??=
0.05) had its expression affected by the presence of a TFBS-ID (18.4 % of the total).
Again, this is significantly higher than what one expect by chance (p?=?0.04; OR 1.09).
It is important to emphasize that the effect of the INDEL in the expression of the
corresponding gene is certainly underestimated by our analysis since only one cell
type was evaluated regarding expression. If a given gene is not expressed in the limphoblastoid
cell lineage, no differential expression could be detected. The same is true regarding
the expression of a given transcription factor whose binding site was affected by
the INDEL.

What type of change is observed in the genes associated with a TFBS-ID? For the 5 KB
window upstream of TSS, out of the 1,155 genes whose expression was changed, 654 were
up-regulated and 553 were down-regulated in the individuals carrying a certain TFBS-ID,
a significant difference from the null expectation (p-value??0.01; OR 1.06). We could not observe any difference between the two datasets (up-regulated
and down-regulated genes) regarding the type of transcription factors whose binding
sites were affected by INDELs (q-value??0.3). For the 5 KB window flanking the TSS, we found 990 and 912 up and down-regulated
genes, respectively (a significant difference, p-value?=?0.04). Like for the 5 KB
window upstream of TSS, there was no enrichment of any specific transcription factor
in either gene set (up or down-regulated – q-value =0.6). In both situations, the
sum of up and down-regulated genes does not match the total number of differentially
expressed genes because few genes are present in both lists, due to their different
behaviour depending on the composition of subjects with a given genotype.

TFBS-affecting INDELs with high differentiation between human populations

We next wondered whether we could identify in our set of TFBS-ID alleles that present
a high differentiation between human populations represented in the 1000 Genomes Project.
These frequency differences between populations are considered signatures of geographically
restricted selection and have been used previously to identify regions under positive
selection 13],19]. We restricted this analysis to a set of 911 individuals representing the three major
continental groups: 246 Africans (AFR), 379 Europeans (EUR) and 286 Asians (ASN).
To identify those INDELs with high differentiation between populations, we calculated
the minimal frequency differences (?) of the derived alleles between all pairs of
populations and took into consideration all differences???20 % (????0.2). This threshold
was based in statistical analysis of the distribution of all ? reported here, in which
20 % represents about two standard deviation from the mean (Additional file 4: Figure S2).

For the TFBS-ID identified in the 5 KB window upstream of TSS, this analysis generated
a set of 1109, 507 and 663 TFBS-IDs that have a significant ? in AFR, EUR and ASN,
respectively. When expression data is taken into consideration, 346, 149 and 132 TFBS-ID
(out of the numbers above) seem to affect the expression of the corresponding genes
in AFR, ASN and EUR, respectively. Table 2 reports the top 10 TFBS-ID with highest differentiation for all three populations.
A complete list is presented at Additional file 5: Table S3. For the TFBS-ID identified in the 5 KB window flanking TSS, we found 1482,
679 and 885 that have a significant ? in AFR, EUR and ASN, respectively. A complete
list for the TFBS-ID identified in the 5 KB window flanking TSS is presented at Additional
file 6: Table S4.

Table 2. TFBS-ID within the 5 KB window upstream of TSS and with highest ? in AFR, ASN or EUR.

One interesting gene found in our analysis is MC1R, known to be associated with skin pigmentation in humans 20],21]. A TFBS-ID (rs201097793) associated to this gene has a higher allelic frequency in
AFR (0.70) and ASN (0.64) when compared to EUR (0.17). This supports the suggestion
from Vernot et al.10] that regulatory polymorphisms, under recent selection, have an influence in pigmentation
phenotypes. Another gene reported to have a TFBS-ID with a differential frequency
is VDAC3, a voltage-dependent channel essential for sperm mobility 22]. We found a TFBS-ID (rs145074200) with a higher frequency in AFR (??=?0.26), as similarly
reported by Colonna et al.23] for a different polymorphism in the same gene.

Taste perception has been crucial in human evolution especially for the detection
of toxins. Not surprisingly, bitter taste receptors have been show to be under positive
selection in human populations 24]. Our analysis (Table 2) shows that a TFBS-ID associated with TAS1R3, a sweet receptor, shows a high ? in ASN. Shi and Zhang 25] concluded, based in a comparison of several vertebrate species, that both bitter
and sweet receptors are under positive selection. TAS1R3 is also a component of the dimeric protein TAS1R1/TAS1R3, which is the umami taste receptor 26]. The umami taste is a common feature of many foods in Asia and is reasonable to speculate
that this variant is being selected in Asians 27].

Response to parasites and microbes has been constantly subject to adaptations in human
evolution 28],29]. We found a TFBS-ID (rs139999735) with a higher allelic frequency in AFR (0.34 compared
to 0.11 in ASN and 0.12 in EUR). The gene associated with this TFBS-ID is APIP (APAF1-interacting protein) whose protein has been shown to be an inhibitor of pyroptosis
and apoptosis, both a response to Salmonella infection 30]. Based on that it is predicted that the TFBS-ID would cause a decrease in APIP expression. Indeed, Fig. 3A shows that expression of APIP decreases significantly in individuals homozygous for the TFBS-ID (r s
-0.13; p-value?=?0.012). We propose here that this TFBS-ID is under positive selection
in AFR due to a down-regulation of APIP1 and consequently a better response to Salmonella infection. Fig. 3B shows that a selective sweep analysis supports this proposal. Individuals homozygous
for the presence of the TFBS-ID show a decreased genetic heterogeneity around the
TFBS-ID position (vertical dashed line in Fig. 3B).

Fig. 3. APIP expression is likely adapted in AFR. A TFBS-ID (rs139999735) with a ??=?0.22
in africans and associated with the APIP gene affects gene expression as seen in A. In B, individuals homozygous for the TFBS-ID
(continuous line) had a lower genetic heterogeneity around the INDEL position (vertical
dashed line) when compared to individuals homozygous for absence of the INDEL (dashed
line)

To gain further insights on what types of genes are associated with TFBS-ID showing
a high differentiation between the three human populations, an ontology analysis was
performed. Fig. 4 shows the major GO categories enriched (using a threshold of p???0.01) in the dataset of genes associated to TFBS-ID for each of the three populations
used in this study (5 KB window upstream of TSS). Two GO categories were enriched
in all three populations: “Regulation of Transcription” and “Histone 3’ end mRNA processing”.
“Urea transport” is enriched in both ASN and EUR. All the other categories are enriched
only in one population, as seen in Fig. 4. Overall, there are a large number of categories related to immunological response.
Interesting categories enriched in Africans and Asians are “Response to protozoans”
and “Response to biotic stimulus”, respectively. In Europeans one enriched category
is “UV protection”, known to be under positive selection in this population 31]. For the 5 KB window flanking the TSS, some of the categories seen for the 5 KB window
upstream of TSS are still present (Additional file 7: Figure S3) although several categories clearly linked to recent selection in humans
are missing.

Fig. 4. Ontology analysis for genes associated to TFBS-ID with ???= 0.2 in the respective
population (5 KB window upstream of TSS). Color of the circle refers to the p-value
of the enrichment while size of the circle refers to the numbers of genes within that
GO category

TFBS-ID match regions known to be under positive selection in the human genome

In the last few years, several genome-wide strategies have been used to identify regions
in the human genome that are under positive selection 15],28],29],32],33]. The recent availability of data from the 1000 Genomes project has catalysed such
approach and hundreds of regions have been identified. To evaluate whether our set
of TFBS-IDs correspond to genetic units that are under selection, a comparison was
made with one of the most complete, in terms of the number of metrics used, of such
studies 15].

When we compared our total set of 10,520 TFBS-ID close to the 7,313 human genes (5 KB
window upstream of TSS), we found that 3,499 (33.2 %) matched regions under selection
as defined by Pybus et al.15] within a 95 % confidence interval. With a 99 % confidence interval, we found 797
TFBS-IDs (7.5 %) that matched genomic regions under selection. For the 5 KB window
flanking TSS, we found that 4,747 (32.3 %) TFBS-ID match regions under selection within
a 95 % confidence interval. With a 99 % confidence interval we found 1,061 (7.2 %)
TFBS-ID matching regions under selection. Fig. 5 shows the results for a gene ontology enrichment analysis (p???0.01) with the set of 797 TFBS-ID (5 KB window upstream of TSS) that matched genomic
regions under selection. Three major categories are evident: ontologies associated
with immunological responses, response to radiation and haematological/cardiac processes.
All these processes have been shown to be under recent positive selection in humans
15], 23], 28], 31], 32], 34]. For the set of 1,061 TFBS-ID matching genomic regions under selection and within
the 5 KB window flanking the TSS, we found while some categories are still present,
when compared to the 5 KB window upstream of TSS, several differences exist (Additional
file 8: Figure S4). Overall, the gene ontology analysis presented here (Figs. 4, 5, Additional file 7: Figure S3 and Additional file 8: Figure S4) suggests that the inclusion of a region downstream of TSS diluted the
selection signal observed for the 5 KB window upstream of TSS. This in in accordance
with a recent finding from the GTEx Consortium about a higher frequency of eQTLs located
upstream of TSS 17].

Fig. 5. GO enrichment analysis for TFBS-ID matching regions known to be under selection in
the human genome (5 KB window upstream of TSS). Color of the bars refers to the p-value
of the respective enrichment. Length of the bar refers to the number of genes within
the respective GO category