Selected heterozygosity at cis-regulatory sequences increases the expression homogeneity of a cell population in humans

Data processing for the identification of balanced SNPs

To detect the signature of balancing selection, we used the genotype information from
the East Asian (ASN, n?=?504), African (AFR, n?=?661), and European (EUR, n?=?503)
panels from the 1000 Genomes Project 28] with a focus on SNPs in cis-regulatory or TF binding regions. Our workflow of selection signature detection is
shown in Fig. 1a. For the accurate identification of cis-regulatory regions, we used DNase footprint data across 41 cell types from the ENCODE
Project 29]. Nucleotide resolution analysis of the DNase cleavage patterns enabled us to discover
the footprints of binding TFs ranging in size from 6 to 40 bp. The mean size of the
footprint was 15 bp and the median was 9 bp 30].

We sought to eliminate the potential effects of non-allelic nucleotide variations
among duplicates or repetitive sequences on nucleotide diversity patterns, based on
which balanced polymorphisms can be detected (described in the next section). First,
highly similar sequences collapsed in the reference genome assemblies (GRCh37/hg19)
will result in an unusually high mapping depth. Therefore, to exclude regions with
an exceptionally high depth of aligned short reads and poor mappability, we used the
depth information inferred from the alignment of short-read sequences from the 1000
Genomes Project 31]. Specifically, the SNPs whose 50-kb flanking regions overlapped the regions of the
top 0.1 % of the mapping depth were filtered out. We also applied a mappability filter
to discard SNPs for which any 50-mer overlapping the site could be mapped to more
than one location allowing up to two mismatches. The CRG 50-bp mappability score 32] was available from the UCSC Genome Browser. We also excluded putative genomic duplicated
regions (?1 kb and ?90 % identity) 33], 34], which were not collapsed in the reference genome because of relatively lower sequence
similarity. These correctly annotated duplicates will not cause an unusually high
mapping depth but can cause false allelic variant calls due to the incorrect mapping
of some reads from paralogues.

In addition, we discarded SNPs with minor allele frequencies (MAFs) less than 1 %,
as well as SNPs that violated the Hardy–Weinberg equilibrium (HWE) 35]. A Bonferroni adjusted P value of 0.05 was used as the threshold of the HWE test. It should be noted that,
under strong heterozygote advantage, we should observe an excess of heterozygous individuals
at sites in the vicinity of the site under balancing selection. In other words, a
balancing selection signal could be lost due to such filtering because deviations
from the HWE are expected under heterozygote advantage. However, selection required
to cause HWE violation as distinctly as at the adjusted P value??0.05 should be extremely strong, and such strong selection would be detected
using almost any methods. Moreover, while the methods for detecting balancing selection
capture the genetic signatures of selection in the past, deviation from the HWE reflects
selection acting in the present-day population. Thanks to medical advances, many selective
disadvantages are not so fatal in the contemporary population as they were in the
past. Therefore, HWE filtering will miss only a minor fraction of truly balanced SNPs.
Well-established examples of balancing selection in the human genome, such as the
selection in the MHC region, are not lost because of HWE filtering 36]. In essence, our study observes genome-wide patterns; thus, losing a small portion
of balancing signatures is not critical. On the other hand, the major concern with
sequencing data is mapping error; thus, HWE filtering is necessary to reduce the confounding
effects of regions with these bioinformatic artifacts and the problems caused by many
unidentified paralogues in the human genome.

Statistical analysis for detection of balancing signatures

For screening the departure signals from the expected patterns of neutral variation,
we calculated the Tajima’s D statistic 14] for the 5-kb regions centered on focal SNPs residing within DNase footprint regions.
Tajima’s D is computed based on the difference between the mean number of pairwise differences
and the number of segregating sites. These two diversity measures are scaled such
that they are expected to be the same under neutrality. Excess polymorphisms at intermediate
frequencies result in a positive D value. The statistics were calculated with the following parameters: n is the number of chromosomes, S n
is the number of polymorphic sites observed, p i
is the derived (nonancestral) allele frequency of the ith SNP, and q i
is the ancestral allele frequency of the ith SNP. The Tajima’s D score was given as:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M2','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M2View MathML/a

where

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M3','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M3View MathML/a

and

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M4','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M4View MathML/a

Negative values of Tajima’s D indicate an excess of rare variations, which is in line with positive selection.
Positive values of Tajima’s D, on the other hand, suggest an excess of common variations in a region which can
be consistent with balancing selection. Among them, only the top 1 % of the distribution
of SNPs was considered to have the footprint of balancing selection and the middle
80 % were used as a control set for all downstream analyses (Additional file 1: Figure S1).

To test whether the increased diversity is a consequence of balancing selection acting
on different alleles at a given locus, the mean pairwise ? was computed within and
between haplogroups that were defined by the allele that each haplotype carried at
the focal SNP (Fig. 4; Additional file 1: Figure S3). For regions with evidence of a long-lived balancing selection (presence
of SNPs segregating commonly in humans and chimpanzees), we calculated the mean pairwise
? in humans and chimpanzees (using the phased PanMap set based on mapping to the chimpanzee
reference 37]) in 500-bp sliding windows. To this end, we identified the orthologous chimpanzee
positions for each human SNP by running the UCSC liftOver tool between the panTro2
and hg19 reference assemblies and selecting the coincident sets of human and chimpanzee
SNPs.

To verify the signatures of Tajima’s D, we also performed a Hudson–Kreitman–Aguade (HKA) test in the EUR panel (Additional
file 1: Figure S2a, b). The HKA test compares the level of polymorphism (within-species
diversity) to the level of substitution (between-species divergence). We conducted
the maximum likelihood HKA test 16] by using the MLHKA software (http://wright.eeb.utoronto.ca/programs/). The surrounding 1-kb regions of the DNase footprint SNPs were compared with 99
neutrally evolved regions that were selected as previously described 11], 38]. The number of segregating sites and pairwise number of between-species differences
in each region were used as the input. Chimpanzee was used as an outgroup in this
analysis. To test for selection, the program was run under a neutral model in which
the number of selected loci was zero, and then under a selection model in which the
surrounding 1-kb region of the focal SNP was regarded as the only selected locus.
Statistical significance was assessed by the likelihood ratio test, in which twice
the difference in log likelihood between the selection and neutral models approximately
follows the x2
distribution with a degree of freedom 1 (number of the selected loci). To ensure the
robustness of the outputs, we applied a chain length of 100,000. For each test site,
the selection parameter k and the P value were obtained from the likelihood ratio test. The selection parameter k indicates a k-fold elevation to diversity over the neutral expectation at the given locus. Therefore,
k??1 supports balancing selection.

Selection that favors the same alleles in all populations lowers F st
, an indicator of population differentiation 17]. Low F st
values can indicate alleles that are maintained at similar frequencies in different
populations against the tendency for neutral drift to cause these frequencies to vary
39], 40]. The pairwise F st
among the three populations (ASN, EUR, and AFR) was calculated as F st
?=?1???H s
/H T
, where H S
denotes the average subpopulation heterozygosity and H T
denotes the total heterozygosity (Additional file 1: Figure S2c). It can also be used to infer adaptive evolution between populations.

Prior to the functional analysis of the identified balanced SNPs, we further excluded
a few candidate SNPs that have a lower Tajima’s D score than that of linked nonsynonymous SNPs because of the concern that those balancing
signals may have been attributed to the regulatory SNPs hitch-hiking the nonsynonymous
variants that are actually subject to balancing selection (Fig. 1; Additional file 1: Figure S4). The linkage disequilibrium (LD) was calculated in each population with
an r
2
??0.9 and a chromosomal distance 500 kb as the parameters of the SNAP proxy tool
(http://www.broadinstitute.org/mpg/snap/) 41].

Enrichment tests for balanced SNPs

To test for the enrichment of the balanced SNPs in open chromatin regions in different
cell types, we obtained the whole-genome DNase hypersensitive site (DHS) datasets
for 125 cell types as provided by the ENCODE Project 42] and those for 349 samples (encompassing 53 distinct tissue types) studied in the
Roadmap Epigenomics Project 43]. We used these footprint data for 41 cell types 30], which were initially used to identify balanced regulatory SNPs. For an enrichment
score that reflects the degree to which a set of given SNPs is overrepresented in
a specific cell type, the ratio of the number of the balanced SNP overlaps to the
number of the control SNP overlaps in each cell type was computed (Fig. 1b; Additional file 1: Figure S5). The full list of the tested cell types along with the enrichment test
results are provided in Additional file 2. To conduct a similar analysis in TF binding regions, we obtained the chromatin immunoprecipitation
sequencing (ChIP-seq) datasets for 161 different TFs from the ENCODE Project 42] (Fig. 1c; Additional file 1: Figure S6). The same enrichment analysis was performed for the TF binding regions.

Analysis of target gene functions

To identify the genes regulated by the cis-regulatory region containing the SNPs of interest, we used a couple of whole-genome
promoter–enhancer mapping datasets. We defined a promoter as the region 500-bp upstream
of the transcription start site of a gene based on GENCODE v.19 annotation and obtained
genome-wide multiple cell-type enhancer information from two previous studies. The
first dataset for the distal enhancer-to-promoter connections was created by using
the correlation of the sequencing tag density between distal DHSs and proximal DHSs
across cell types 44]. The second dataset was developed based on the correlations between enhancer RNA
levels and messenger RNA levels as measured by the Cap Analysis Gene Expression (CAGE)
method across the FANTOM5 panel of ~1000 distinct cell types 45].

To determine the function of the genes that are targeted by the balanced SNPs in cis-regulatory regions, we conducted a gene set enrichment analysis using gene targets
that were identified as described above. The hyper-geometric enrichment test was performed
for each Gene Ontology (GO) term using the online tool Web-Based Gene Set Analysis
Toolkit (WebGestalt) 46]. To reduce the type I error, we conducted the Benjamini–Hochberg (BH) correction
for multiple testing 47]. We selected GO terms with the adjusted P values??0.05 as significantly enriched function. These procedures were repeated
for the set of the balanced SNPs detected in each subpopulation (Additional file 3). A total of 646 target genes mapped to 986 balanced SNPs from the EUR panel, 651
genes mapped to 910 SNPs from the ASN panel, and 1006 genes mapped to 1082 SNPs in
the AFR panel were used in our GO analysis.

Allelic imbalance and motif divergence

To test whether the balanced SNPs have regulatory consequences, we examined the differential
contribution of the two alleles at each locus to overall chromatin accessibility at
the region centered on the focal SNP. We screened all heterozygous variants in the
41 cell types of DNase footprint sequencing BAM files from the ENCODE Project 29]. For a more accurate variant detection, we carried out several clean-up procedures,
including removal of duplicates using the Picard tools (http://broadinstitute.github.io/picard/), and performed local realignment and base quality recalibration using the Genome
Analysis Tool Kit (GATK) 48]. After variant calling, GATK variant filtration was performed to retain the sites
for which the map quality (MQ) was ?30 and the Phred scaled probability that a polymorphism
exists (QUAL) was ?30. We used heterozygous variants with a minimum read depth of
10 in the footprint data and calculated the degree of allelic imbalance as the log2
ratio of the number of reads carrying each allele (Fig. 2a). To set the median of the allele imbalance score of the control SNPs to zero for
the purpose of normalization, we subtracted the median of the control set of SNPs
from the allelic imbalance distribution of the balanced SNPs as well as the control
SNPs in each cell type. As another test for the regulatory effects of the balanced
SNPs, we examined whether the alleles at each SNP result in differential TF binding.
To identify TF binding motifs, we searched the TRANSFAC 49]–51] and JASPAR 52]–55] databases with a P value threshold of 10
?4
using FIMO (http://meme-suite.org/doc/fimo.html) 56]. The motif search was performed for ±20-bp sequences containing either allele at
each footprint SNP. We obtained the number of TFs that were predicted to bind specifically
to one of the two alleles and the number of TFs that were associated with both alleles.
For a motif divergence score, the ratio of the number of allele-specific TFs to the
number of common TFs was used (Fig. 2b). Binding QTLs for five TFs in Yoruban lymphoblastoid cells 21] were used to test the allelic differentiation of the AFR balanced SNPs in TF binding
(Additional file 4). The Fisher’s exact test was performed for the overlap of the QTLs with the balanced
versus control SNPs.

Gene expression association of balanced SNPs

We investigated whether the balanced SNPs affect gene expression. Such effects were
first estimated with a linear regression model using the RNA sequencing data of 465
lymphoblastoid cell lines 57], 58] from the Geuvadis Consortium and the matched whole-genome genotypes retrieved from
the 1000 Genomes Project 28]. We calculated the coefficient of determination (R 2
) and its P value from the linear regression between the genotype of the regulatory SNP and the
expression level of the target gene (Fig. 2c). The Bonferroni adjustment was used to address multiple testing. The target genes
were discovered using the methodologies described in the “Analysis of target gene
functions” section above. In another test, we used the eQTL data for 850 samples in
adipose, lymphoblastoid, and skin cells obtained from the MuTHER (Multiple Tissue
Human Expression Resource) Project 59]. When a footprint SNP was not directly available in the genotyping array used in
the project, we looked for the closest tag SNP within 500 kb in LD (r2
??0.9) with the footprint SNP of interest 44] based on the EUR population. The significance of the association between the genotype
of the footprint SNP or tag SNP and the expression level of the target gene was adjusted
by the Bonferroni correction (Fig. 2d).

Mathematical model of transcriptional noise in homozygotes and heterozygotes

For the concentration of an mRNA (m) and its cognate protein (p), the reaction rate theory of the Central Dogma can be written as:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M5','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M5View MathML/a

(1)

where ? m
is the rate of transcription, ? p
is the rate of translation, ? m
is the rate of mRNA decay, and ? p
is the rate of protein decay. The function f([TF]) indicates the equilibrium concentration of the transcription initiation complex
consisting of the DNA (D), RNA polymerase (RNA
p
), and transcription factor (TF). If the TF is an activator, f([TF]) is an increasing function. If the TF is a repressor, f([TF]) is a decreasing function. To find an explicit form of f([TF]), we impose:

The equilibrium condition on DNA–protein binding:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M6','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M6View MathML/a

(2)

where the dissociation constant Ki is defined in Fig. 3b.

From the constancy of the gene copy number (set to 1):

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M7','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M7View MathML/a

(3)

the concentration of the bare, or unbound, DNA satisfies

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M8','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M8View MathML/a

(4)

where R?[RNA
p
]/K1
and T?[TF]/K2
is the concentration of the RNA polymerase and TF scaled by the polymerase–DNA dissociation
constant K1
and the TF–DNA dissociation constant K2
, respectively. The positive and negative effect of TF action is captured by s?K1
/K3
, which is larger than 1 for an activator while smaller than 1 for a repressor. An
activator that promotes the recruitment of RNA
p
(K3
??K1
) leads to s??1 and ???1, whereas for a repressor, the inequalities point in the opposite direction. Rearranging
the terms, we obtain 60]:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M9','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M9View MathML/a

(5)

where ??=?(1?+?R)/(1?+?sR),???=?f(?)/f(0)?=?(R?+?1)/(R?+?S??1
) is the fold change, and f(0) represents a value corresponding to the basal level of gene expression.

For the cells heterozygous for the given cis-regulatory sequences, where the binding equilibrium between the TF and its cognate
binding site is controlled by two distinct equilibrium (dissociation) constants K2
and ?K2
for the two alleles, the rate equation Eq. 1 can be generalized to:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M10','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M10View MathML/a

(6)

The corresponding homozygous cells under consideration have a pair of identical regulatory
sequences with a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M11','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M11View MathML/a. For a fair comparison of noise levels between heterozygous and homozygous gene expression,
we require the same mean expression level. That is, from Eq. 5:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M12','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M12View MathML/a

(7)

where the angled brackets denote the statistical ensemble average. For a gamma-distributed
concentration of the TF 61] with the shape and scale parameters a and b, that is [TF]?~??(a,b), the rescaled random variable z[TF] also follows the gamma distribution as ?(a, zb).

Now we consider the fluctuations, particularly in [TF]. The aim is to compare protein
noise levels between a homozygote with the normalized transcription rate a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M13','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M13View MathML/a and a heterozygote with a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M14','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M14View MathML/a that have the same level of average gene expression. With that, the variance in mRNA
and protein levels is given by:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M15','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M15View MathML/a

(8a)

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M16','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M16View MathML/a

(8b)

where ????? is the expectation over the statistical ensemble. Thus, the noise level
defined as the squared coefficient of variation:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M17','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M17View MathML/a

(9a)

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M18','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M18View MathML/a

(9b)

gives a fractional measure of the stochastic fluctuations. Note that the first term
in ? p2
(Eq. 9b), which scales with the inverse of the mean protein level, is reminiscent of a Poisson
process and indicates (i) the so-called intrinsic noise originating from random births
and deaths of the individual protein molecule. Additional terms reflect an extrinsic
noise stemming from (ii) the mRNA noise at a given strength of TF–DNA binding and
from (iii) the allele-dependent variation of TF-binding affinity, which is contained
in a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M19','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M19View MathML/a .

In a steady state, where a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M20','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M20View MathML/a, the protein noise can be additively decomposed as:

a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M21','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M21View MathML/a

(10)

where the zygosity comes into play only through a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M22','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M22View MathML/a. To calculate the noise level in a homozygote with the normalized transcription rate
a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M23','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M23View MathML/a, we first determine a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M24','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M24View MathML/a, as a function of ?, that satisfies a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M25','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M25View MathML/a from the requirement of Eq. 7. Thus, we obtained the difference in the noise level, a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M26','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M26View MathML/a, as a function of ? or other parameters across a range of in vivo biochemical parameters 25], 26], 62]. Figure 3c shows ??2
as a function of ?.

Expression and chromatin noise from single cell data

We employed RNA sequencing results for 62 single cells from the GM12878 cell line
63]. This cell line was derived from the NA12878 sample, for which a fully phased genome
sequence is available 28]. The coefficient of variation, measured as the standard deviation divided by the
mean (a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M27','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M27View MathML/a), is the most direct and unambiguous measure of gene expression noise 64]. We calculated the CV of the read counts normalized across the samples. To work only with genes with uniform
power to detect high or low variance, we excluded transcripts with an average FPKM
100 and hence a saturated CV65]. We fitted the curve using the parameterization a onClick=popup('http://www.genomebiology.com/2016/17/1/164/mathml/M28','MathML',630,470);return false; target=_blank href=http://www.genomebiology.com/2016/17/1/164/mathml/M28View MathML/a to capture the dependence of the CV2
on ?65] (Fig. 3e). We differentiated homozygous and heterozygous loci using the GM12878 genotypes
available from the 1000 Genomes Project 28] and discovered target genes based on the GM12878-specific enhancer–promoter maps
created by the Chromatin Interaction Analysis by Paired-End Tag sequencing (ChIA-PET)
66], chromosome conformation capture technology (Capture Hi-C) 67], and Integrated Methods for Predicting Enhancer Targets (IM-PET) 68]. ChIA-PET interactions with four or more PET counts were used. We ran HOMER (http://homer.salk.edu/homer/ngs/) to identify significant (P??10
-6
) promoter–capture Hi-C interactions. The IM-PET method was used for our main analyses.
We used the combination of the two experimental datasets (ChIA-PET and Capture Hi-C)
to confirm that the observed expression noise-selection relationships are not dependent
on enhancer–promoter mapping data (Additional file 1: Figure S10).

To analyze cell-to-cell variation in chromatin accessibility, we used the assay for
transposase-accessible chromatin sequencing (ATAC-seq) data set for 254 individual
cells from the GM12878 cell line 69]. This dataset was downloaded from the NCBI Gene Expression Omnibus (GEO) under accession
number GSE65360. We modified and ran the analysis scripts provided by the authors.
We used as input the chromosomal coordinates of chromatin accessibility peaks (top
50,000 non-overlapping 500-bp summits) and various features associated with each peak,
including fragment counts and sequence bias scores. For our analysis, we assigned
the Tajima’s D score and heterozygosity in GM12878 to each accessible chromatin peak. We computed
the aggregate measure of cell-to-cell variability in chromatin accessibility for peaks
falling within a given range of Tajima’s D. The standard deviation of the calculated variability was obtained by bootstrapping
cells as previously described 69] (Fig. 3f). For a gene-wise aggregate measure of chromatin accessibility noise, the average
of the cis-regulatory regions connected to the same gene was obtained by interrogating the IM-PET-based
GM12878 enhancer–promoter map 68].

Analysis of gene expression level and transcriptional responsiveness

To test whether the advantage of the selected SNPs lies in flexibility of transcriptional
responses to various external signals, we sought to analyze gene expression data for
the human lymphoblastoid cell line GM12878. This cell line was derived from the NA12878
sample, for which a fully phased genome sequence is available 28]. For the first dataset, we downloaded the data for gene expression changes in response
to doxorubicin treatment from the NCBI GEO under accession number GSE51709 70]. The raw expression measures were treated with the Affymetrix Expression Console
using the gene-level RMA summarization and sketch-quantile normalization methods.
The second gene expression dataset was downloaded from the GEO under accession number
GSE26835. In this dataset, the expression measures were obtained prior to ionizing
radiation and at 2 and 6 h after exposure to 10 Gy of ionizing radiation 71]. This dataset was created with the Affymetrix Expression Console using the MAS5 probe
summary method and global scaling normalization method. For each gene, the degree
of transcriptional responsiveness was calculated as the maximum absolute gene expression
change in response to the treatment among different time points or samples. To link
the heterozygosity and selection strengths of each regulatory SNP with the transcriptional
response of its target gene, we used the GM12878-specific enhancer–promoter maps generated
by the IM-PET 68] methods. The degree of gene expression changes in response to doxorubicin or ionizing
radiation was computed for the control SNPs (corresponding to the middle 80 % of the
Tajima’s D distribution), homozygous SNPs, and heterozygous SNPs with varying Tajima’s D or HKA k (Additional file 1: Figure S11a, b). The degree of chromatin changes in response to TNF-? treatment
was calculated for the same groups of footprint SNPs. We used the ATAC-seq data set
for GM12878 single cells 69] described above (Additional file 1: Figure S11c). The aggregate measure of chromatin accessibility across the single
cells was obtained for each chromatin peak.

We sought to test whether the balanced SNPs are advantageous when their target genes
are highly expressed, in which case extrinsic noise is a dominant source of cell-to-cell
variability. We chose white blood cells for which mRNA sequencing data were available
from the Roadmap Epigenomics Project, namely, CD4 naïve primary cells, CD4 memory
primary cells, CD8 naïve primary cells, and peripheral blood mononuclear primary cells.
The IM-PET-based enhancer–promoter map for CD4 naïve T cells 68] was used to associate the Tajima’s D score of the regulatory SNPs with the expression level of the target genes calculated
as FPKM. Additionally, we used the two Affymetrix datasets used for the above responsiveness
tests, namely, the doxorubicin response data (GSE51709) and the radiation response
data (GSE26835). We performed sample-wise normalization, in which the mean expression
level of all genes in each sample (CD4 naïve primary cells, CD4 memory primary cells,
CD8 naïve primary cells, peripheral blood mononuclear primary cells, GM12878 before
doxorubicin treatment, GM12878 after doxorubicin treatment, GM12878 before irradiation,
and GM12878 after irradiation) was set to zero.