The role of climate and out-of-Africa migration in the frequencies of risk alleles for 21 human diseases


Genetic disease risk for type 2 diabetes follows Out-of-Africa migration patterns

To determine whether the worldwide distribution of risk allele frequencies is compatible
with the serial founder effect model of human migration out of Africa, we regressed
average heterozygosity and average frequency of the risk allele on distance from Africa.
Consistent with previous analyses 16], 17], we found that type 2 diabetes had a significant relationship with distance from
Africa, with risk decreasing as distance from Africa increases; for this disease,
average heterozygosity of the 15 disease risk SNPs regressed on distance from Addis
Ababa had an R
2
of 0.69 and a slope of –(8.283 * 10^
?6
)x(distance[km] from Addis Ababa), which is slightly steeper than the R
2
and slope we calculated for average heterozygosity using all SNPs and regressed on
distance from Africa (R
2
=0.62, slope = –(4.268 * 10^
?6
)). When the heterozygosity is adjusted to account for the neutral expectation (see
“linear regressions” section), the R
2
is reduced from 0.69 to 0.45 (Fig. 1). This can be compared to the R
2
values from the regressions of the average heterozygosity of the 10,000 resampled
sets of SNPs that were matched in allele frequency to the diabetes risk SNPs, on distance
from Africa, and for which the mean R
2
value was 0.33 with a standard deviation of 0.20. The empirical p-value for comparing
the diabetes risk SNPs with the null distribution of resampled sets is 0.03. (see
“Null Distribution” in the Methods section) However, when corrected for the tests
of nine environmental variables and distance, the adjusted regression of 0.45 is not
significantly different. The slope of the diabetes risk alleles on distance is steeper
than expected for neutral alleles, which might suggest some type of selection against
type 2 diabetes along the direction of Out-of-Africa migration, but it is very similar
to the slope that Ramachandran et al. 8] report for 783 microsatellite loci (slope= -7.68*10-6) and slightly less steep than
the slope Li et al. 9] report using 600,000 SNPs (slope= -1.44*10-5). The average frequency of type 2 diabetes
risk alleles had an R
2
of 0.57 for distance from Africa, and, after correction for multiple tests, was not
significantly different from that obtained from random alleles. Thus, this is a borderline
case, and we cannot make a strong case for selection on this disease.

Fig. 1. Regression of average heterozygosity of type 2 diabetes risk alleles on distance from
Africa. Each point represents the average heterozygosity for one of the 61 populations
studied in this paper

Other regressions on distance

For most diseases besides type 2 diabetes, the distribution of risk alleles did not
show a strong correlation with distance from Africa. Although we found a strong relationship
between heterozygosity and distance for Crohn’s disease and Parkinson’s disease (R
2
= 0.65 and 0.57, respectively), no significant relationships between distance and
heterozygosity were found for SNPs associated with any disease, including those which
Corona et al. 17] suggested showed signatures of migration (Table 1). These discrepancies are likely due to the way in which genetic risk scores for
each population were used by Corona et al. to construct a phylogeny of populations
from which a pattern of migration was inferred by comparing the observed phylogeny
with randomly generated phylogenies. In our approach, we sought an overall trend in
allele frequencies that could be compared directly to expectations under the serial
founder effect out of Africa. As shown by the linear relationship of heterozygosity
on distance from Africa, the effect of genetic drift is constant as populations move
away from Africa. Thus, by searching for a global pattern, we are able to identify
genetic drift caused by sequential subsampling during the migration as opposed to
genetic differentiation events that occur only between pairs of populations. We also
regressed average frequency of the risk alleles on distance. Average frequency of
risk alleles for type 2 diabetes, systemic sclerosis, and polycystic ovary syndrome
showed the highest correlations with distance, followed by pancreatic cancer and alopecia
areata.

Table 1. Correlation coefficient, r, between average heterozygosity and average risk allele
frequency on distance from Africa, with p-value

No evidence for the thrifty gene hypothesis

The thrifty gene hypothesis was proposed by Neel 21] as a possible explanation of the high prevalence of type 2 diabetes despite the potentially
decreased reproductive fitness of those who have it. He argued that genes that allow
for rapid and efficient metabolism of food due to an overproduction of insulin, and
thus an increased risk for type 2 diabetes in the presence of certain diets, were
likely to be beneficial among hunter-gatherers when food was scarce.

Under the thrifty gene hypothesis, we would expect to see positive selection of type
2 diabetes risk alleles. Because the risk alleles under positive selection would increase
in frequency more than expected for neutral alleles, it is more likely that they will
remain in the migrating populations. Thus average heterozygosity should not decrease
in a pattern similar to that seen for microsatellites 8], which we consider “neutral”. In this study, we cannot distinguish the decrease in
heterozygosity of type 2 diabetes risk alleles from the decrease that is expected
due to Out-of-Africa migration. Further, under the “thrifty late” hypothesis, where
the risk alleles are considered not to have been beneficial until humans migrated
out of Africa 19], we would expect to see positive selection in the populations that are outside of
Africa. To test these hypotheses, we ran two regressions of average heterozygosity
of type 2 diabetes on distance from Africa: one excluding the 11 African populations
and one excluding all populations except the 11 African populations. In the former
case, the R
2
was 0.78 with a slope of –(9.598*10
-6
) x(distance[km] from Addis Ababa), which is similar to that reported in our analysis
using all populations, as well as those in Li et al.’s 9] analysis of haplotype heterozygosity using SNPs and Ramachandran et al.’s 8] analysis of microsatellites. When average heterozygosity was regressed on distance
from Addis Ababa using only the African populations, the R
2
was 0.05, which suggests a random distribution of these risk alleles in Africa. The
average heterozygosity of type 2 diabetes risk alleles in Europe, Asia, and the Americas
decreases with distance from Africa with a similar slope and R
2
to that of SNPs that are not associated with disease. Although this does not indicate
positive selection on these alleles, it should be stressed that these SNPs are not
known to be causal for type 2 diabetes and our study does not include all SNPs known
to be associated with the disease.

Environmental variables explained more change in disease risk allelic frequency measures
than distance from Africa

For many of the diseases in this study, allelic frequency measures correlated more
closely with environmental variables than with distance from Africa. In particular,
for neuroblastoma, the R
2
was 0.001 for regression of average heterozygosity on distance, but was 0.58 for average
heterozygosity on latitude. SNP statistics for asthma, prostate cancer, and celiac
disease also showed much higher R
2
values for regressions on a single environmental variable than for regressions on
distance. To determine the variation of allele frequencies that was not due to drift,
we created adjusted statistics of average risk allele heterozygosity and average risk
allele frequency. For each population, we subtracted the average heterozygosity and
average allele frequency using all SNPs from the average heterozygosity and average
allele frequency of the risk alleles. For many diseases, the regression of these adjusted
statistics on environmental variables had a higher R
2
than the regressions of the adjusted statistics on distance from Africa.

Some of the environmental variables were correlated with each other and with distance;
the absolute value of the correlations ranged from 0.09, between winter solar radiation
and distance, to 0.84, between summer solar radiation and summer precipitation. The
correlation of latitude and summer radiation is 0.99, and for this reason, we only
included latitude in our analysis. Distance and longitude had a correlation of 0.74,
because most of the Out-of-Africa migration was across longitude lines.

Comparison with null distributions of neutral alleles showed significant relationships
for many environmental variables

To determine how our regressions for disease risk SNPs compared to regressions of
the same allelic measures for SNPs that were not chosen because of an association
with disease, we created 10,000 sets of SNPs, matching the number of risk-associated
SNPs and allele frequencies of the risk alleles. As with the disease risk alleles,
we regressed adjusted average heterozygosity and adjusted average frequency of the
resampled alleles on the nine environmental variables. The R
2
of the disease risk set was compared with the R
2
of the resampled sets and an empirical p-value was created based on the percentage
of the resampled sets that had a higher R
2
than the disease risk set (Fig. 2). Because the null distributions for each disease were created using different numbers
of SNPs with different global allele frequencies, we assumed the diseases were independent
of each other. We applied a Bonferroni correction for the 10 variables (nine environmental
variables and distance from Africa) and the two allelic statistics. We also report
significance adjusted for a false discovery rate of 0.2, which gives us more power
to detect a signal. With an FDR of 0.2, we get eight significant results, and with
a Bonferroni correction for the multiple environments we get four. Although we cannot
infer the mechanism, our results support some selection acting on the risk alleles
that has produced the observed relationship between an environmental variable and
the risk allele statistics.

Fig. 2. Null Distributions. Blue histograms represent the binned R
2
values for each of 10,000 sets of resampled SNPs regressed on an environmental variable.
Each resampled set contains random SNPs that match the number of risk alleles and
global allele frequency of the risk alleles for that disease. Red lines indicate values
of R
2
, adjusted as in Methods, with 0.45 for type 2 diabetes on distance from Africa (a)
and 0.03 for celiac disease on longitude (b). Before adjustment, the R
2
values were 0.69 for type 2 diabetes on distance from Africa and 0.13 for celiac disease
on longitude. The null distributions for these two diseases are different because
each null distribution is created using resampled sets that are matched for number
and global allele frequency of the risk alleles. Our analysis included 15 risk alleles
for type 2 diabetes and 26 for celiac disease

Although correlations with distance from Africa were not significant after Bonferroni
or FDR adjustments for any diseases, the R
2
values for the disease risk allelic statistics regressed on environmental variables
were significant when compared to the resampled sets for several diseases (Figs. 3 and 4). Summer humidity is significant for three diseases, and latitude, longitude, summer
temperature, and winter radiation each for one. Five of the six diseases for which
we report significant environmental correlations are autoimmune diseases or otherwise
related to immune function. To identify functional categories that are enriched in
certain environments, we ran DAVID 22], 23] to compare enrichment in the risk genes of diseases that showed significant correlation
with an environmental variable to the enrichment in all disease risk genes. We did
not find any significant results upon using the Bonferroni correction.

Fig. 3. P-values for average heterozygosity regressed on environment. P-values are calculated by comparing the R
2
of the disease risk allele heterozygosities to the null distribution created using
10,000 resampled sets of SNPs matched for number of and global allele frequency of
disease risk SNPs. * Indicates significance after adjustment for an FDR of 0.2. +
Indicates significance after Bonferroni correction (see “Null Distributions” section
in Methods)

Fig. 4. P-values for average risk allele frequency regressed on environment. P-values are calculated by comparing the R
2
of the disease risk allele frequencies to a null distribution created using 10,000
resampled sets of SNPs matched for number of and global allele frequency of disease
risk SNPs. * Indicates significance after adjustment for an FDR of 0.2. + Indicates
significance after Bonferroni correction (see “Null Distributions” section in Methods)

Analysis using Bayenv shows environmental adaptation for specific SNPs and diseases

We ran Bayenv 2.0 24] to assess whether there was a signal of local environmental adaptation on the disease
risk SNPs in our study. Many disease risk alleles were significant with p-values 0.05 in Bayenv (Fig. 5). Additionally, most of the disease/environmental variable combinations that we found
to be significant in comparison to the null distributions (see Methods and “Comparison
with null distributions of neutral alleles” section) had at least one risk allele
that was significant in Bayenv. This confirms the relationships we found between our
risk allele statistics and environmental variables.

Fig. 5. Enrichment of disease risk SNPs in the 0.05 empirical tail in Bayenv. Enrichment is
indicated by color. Permutations were carried out to determine whether the number
of SNPs with low p-values was more than expected given the total number of risk SNPs for each disease.
(see “Enrichment of SNPs with low Bayenv p-values” section in Methods) A star indicates significance at p0.05 after Bonferroni correction

We permuted the SNP labels to determine whether certain diseases have more SNPs that
have undergone environmental adaptation than would be expected from a random set of
the same number of SNPs (see Methods and “Enrichment of SNPs with low Bayenv P-values” section), and several diseases showed environmental adaptation. Biliary liver
cirrhosis, alopecia areata, ulcerative colitis, Parkinson’s disease, Crohn’s disease,
systemic sclerosis, and asthma showed significant signals of environmental adaptation
for at least one environmental variable. Alopecia areata showed strong signals of
adaptation for the most environmental variables, including latitude, longitude, summer
temperature, winter temperature, and summer radiation. Interestingly, most of these
diseases are autoimmune, which suggests there is a strong environmental effect on
immune related genes or diseases.

Identifying effects of specific environments

Berg and Coop report that Crohn’s disease and ulcerative colitis show signals of adaptation
to principal components of summer and winter environmental variables. Our analysis
confirms these signals and may suggest which environmental variables, as opposed to
the principal components, drive this adaptation. In particular, Berg and Coop find
that Crohn’s disease has a significant correlation with their summer PC2, summer PC1
and winter PC1. Summer PC2 is loaded strongly on precipitation, humidity, and radiation,
and we find strong correlations for Crohn’s disease with summer precipitation and summer
humidity. Similarly, summer PC1 is loaded strongly on winter temperature, which is
correlated with Crohn’s disease in our analysis. Berg and Coop find that summer PC2
is also correlated with ulcerative colitis, and in our analysis ulcerative colitis
is correlated with summer radiation. When compared to the null distributions, none
of our correlations for Crohn’s disease or ulcerative colitis is significant after
FDR or Bonferroni adjustments, but our p-values are similar to those that Berg and Coop report.

Specific genes and environmental factors

Our results for the local environmental adaptation of disease risk SNPs led us to
examine gene annotations of SNPs that had significant p-values from Bayenv as well as R
2
values that were significantly higher than those derived from the null distributions
created with sets of resampled SNPs.

Many genes that showed signals of adaptation in Bayenv 24], 25] and our analysis, and all genes that showed both signals with two or more environmental
variables, had functions related to the immune system. In some cases, one SNP was
associated with multiple environments and in other cases, multiple SNPs near the same
gene accounted for the environmental associations. Genes that were significant with
at least two variables include: BTNL2 for alopecia areata, NOD2 for Crohn’s disease,
PLA2R1 for membranous neuropathy, LMO1 for neuroblastoma, and TNPO3, UBE2L3, HLA-DRA,
and HIC2 for systemic lupus erythematosus.

For asthma, several genes were significant in both Bayenv and our analysis. SNPs located
within 10 kb of the DENND1B gene and SNPs within 10 kb of the CRB1 gene were significant
for summer humidity. The DENND1B protein is important in the innate and adaptive immune
response to previously encountered antigens. It is part of the signaling pathway for
inflammatory response, and is associated with moderate to severe cases of asthma 26]. SNPs within 10 kb of the RORA gene and SNPs within 10 kb of the IL2RB gene were
significant for summer radiation flux. SNPs near the RORA gene were also significant
for latitude, summer maximum temperature, and winter minimum temperature. RORA is
a component of the mammalian circadian clock 27], and is involved in lipoprotein metabolism 28] and lipid homeostasis in muscle cells 29]. These functions could help explain the signals of adaptation of this asthma-associated
gene with summer and winter temperature.

For prostate cancer, SNPs located within 10 kb the KLK3 gene were significant for
summer radiation flux, winter humidity, and latitude in Bayenv. The KLK3 gene produces
prostate specific antigen (PSA), a widely used biomarker for prostate cancer 30].