The population genetics of wild chimpanzees in Cameroon and Nigeria suggests a positive role for selection in the evolution of chimpanzee subspecies


Dataset preparation

We gathered fecal (n?=?247) and hair samples (n?=?223) from wild-living, non-habituated
chimpanzee populations from 36 remote regions spanning eastern Nigeria through southern
Cameroon (Figure 2 and Additional file 1). Most samples originating from the regions of Belgique (BQ), Boumba Bek (BB), Diang
(DG), Dja Biosphere (DB), Duomo Pierre (DP), Ekom (EK), Lobeke (LB), Mambele (MB)
and Minta (MT) were collected between 2004 and 2006 by the team of BHH and MP. Hair
and tissue samples from Nigeria and Cameroon were collected during a series of studies
from 1994 – 2010 by the team of MKG. We recruited a team of ~9 field assistants for
each mission and base camps were established in the vicinity of forests were chimpanzee
presence was reported by local hunters and villagers. We initially walked transects,
hunters’ paths or elephants paths in search of chimpanzees presence (e.g. night nests,
foot prints and vocalizations). The Campo Ma’an (CP), Cross River (CR), Deng Deng
(DD), Diang (DG), Dja Biosphere (DB), Douala-Edea (DE), Gashaka Gumti (GG), Mbam et
Djerem (MD) and Mount Cameroon (MC) sites are located in National Parks or forests
reserves, whereas the remaining field sites are located in unprotected forests. Fecal
samples were identified to be of likely chimpanzee origin by experienced trackers
and/or by the researchers. Fecal samples (15–20 g) were placed into 30 or 50 mL tubes
and mixed with equal amounts of RNAlater® (Ambion, Austin, TX). Hair samples were collected directly from abandoned sleeping
nests and a minimum of three hairs per nest were stored into glassine envelopes, and
kept dry in silica gel. Fecal samples were inspected to estimate their likely time
of deposition, whereas night nests age was estimated according to a leaf decay index
58]. Time, date, location, longitude, latitude and name of collector were also recorded.
Fecal and hair samples were generally kept at ambient temperature for no longer than
2 weeks and subsequently stored at ?20°C once back in Yaoundé, Cameroon. Samples were
shipped to the United States at ambient temperature, then stored at ?20°C upon receipt.
All samples were transported from Cameroon to the United States in full compliance
with Convention of International Trade in Endangered Species of Wild Fauna and Flora
(CITES) and Center for Disease Control (CDC) export and import regulations. This research
was carried out with IACUC approval from the University at Albany – State University
of New York.

We extracted fecal DNA using the QIAamp Stool DNA Mini kits (Qiagen, Valencia, CA)
following well-established protocols 59]. We briefly resuspended 1.5 mL of fecal RNAlater® mixture in stool lysis buffer and clarified them by centrifugation. We treated the
supernatants with an InhibitEx tablet, subjected them to proteinase K digestion, and
passed them through a DNA binding column. Bound DNA was eluted in 100 ?L elution buffer.
We extracted DNA from hair samples using a chelating resin protocol 60] followed by filtration using Microcon 100 columns (Millipore, Billerica, MA) to concentrate
DNA extracts. We conducted the preparation of the DNA quantification standards, reagents
and reactions according to the Quantifiler® DNA Manufacturer’s protocol 61]. We analyzed the data using a 7500 System v1.2.3 software as an ‘Absolute Quantification
(standard curve)’ assay with the settings recommended in the Quantifiler® Kit User’s
Manual 61].

We used primers L15997 (5?-CACCATTAGCACCCAAAGCT-3?) and H16498 (5?-CCTGAAGTAGGAACCAGATG-3?)
to amplify a 460 – 500-bp mtDNA fragment spanning the hypervariable D-loop region
(HVRI) in all samples that were newly collected for this study, using methods described
in previous studies 10],13],59]. We assembled and aligned the resulting sequences with SEQMAN DNASTAR software (Lasergene,
DNASTAR, Inc., Madison, WI), along with georeferenced sequences from previous studies
9],10],59],62]. We deposited all newly generated sequences from this study in GenBank under accession
numbers KM401682-KM401815.

We used twenty-one autosomal microsatellite loci to produce genotype profiles from
both hair and fecal samples. Additional file 14 lists information about the markers including the primers flanking the selected regions
and the fluorescent dye set (Applied Biosystems, Foster City, CA) chosen. These markers
were originally developed for use in humans 63], but have been shown to be highly informative for use in chimpanzee population genetics
studies 39],64]. We also included the Amelogenin locus to determine the sex of the individual sampled.
We divided these 22 loci into 4 multiplex PCR reactions that were performed using
the Quiagen Multiplex PCR Kit (Qiagen, Valencia, CA) in Eppendorf Mastercyclers (Eppendorf,
Westbury, NY). We used 0.5-1 ng DNA, along with Q-Solution (included in the kit).
We adopted the following PCR conditions: an initialization at 95°C for 15 min, followed
by 40 cycles including a denaturation step at 95°C for 30 sec, an annealing temperature
of 60°C for 1 min, an elongation at 72°C for 1 min and 30 sec, and a final extension
at 72°C for 10 min. For samples that failed to produce reliable genotype profiles,
we increased the amount of DNA template to up to 8 ?L per reaction, regardless of
the DNA quantitation results. Many of the hair samples from the Nigerian locations
had been typed previously for some of the loci selected, and were adjusted to the
differences in base pair sizes due to apparatus and protocol discrepancies 13],65]. All PCR reactions included negative control samples for quality assurance. We analyzed
each multiplex PCR product on an ABI 3130 capillary array genetic analyzer (Applied
Biosystems, Foster City, CA). We determined fragment sizes against a Genescan 600
Liz size standard (Applied Biosystems, Foster City, CA), and allele sizes using the
Genemapper ID version 2.7 software (Applied Biosystems, Foster City, CA). We scored
alleles between three and 6 times to avoid problems associated with allelic dropout,
which frequently occurs when genotyping low-yield DNA samples 66]. Samples that did not include at least 15 (70%) or more loci after multiple attempts
at PCR fragment amplification were excluded from this study.

We subjected all 21 autosomal microsatellite loci to outlier tests using LOSITAN 67]. LOSITAN is a Java based selection detection platform, based on the fdist FST outlier methods 68]. We ran 1,000,000 simulations of the data while (i) assuming a stepwise mutation model, (ii) assuming three populations, (iii) forcing a correct mean FST, and (iv) calculating a “neutral” mean FST. We also subjected the 21 autosomal microsatellites, in addition to the mtDNA HVRI
sequences, to an exact test of Hardy-Weinberg equilibrium 69] using the Arlequin version 3.5 software package 37]. We tested deviations from Hardy-Weinberg equilibrium for each locus against 1,000,000
random permutations of the data.

We calculated pairwise estimates of relatedness for all samples that produced reliable
microsatellite genotypes using the software program COANCESTRY 23]. COANCESTRY implements three inbreeding estimators and seven relatedness estimators 24],70]-76] to estimate relatedness between individuals and inbreeding coefficients from multi-locus
genotype data. We tested 95% confidence intervals for relatedness and inbreeding estimates
for all genotyped individuals against 1000 bootstrap permutations of the data. Samples
estimated to come from the same individual or close relatives, with a relatedness
index 24] above 0.75, were excluded in order to remove all pairs of individuals down to first
cousins from all further data analyses.

mtDNA diversity analysis

We generated haplotype networks for HVRI mtDNA sequences via the median-joining algorithm
of Network 4.5 (http://www.fluxus-engineering.com). Because it allows for reticulation, the median-joining approach for inferring haplotype
relationships is appropriate for the analyses of mtDNA control region sequences, which
exhibits high levels of homoplasy in humans 77],78]. We identified hypermutable sites by post-processing using the Steiner maximum parsimony
algorithm within Network 4.5 and were excluded from the network analyses.

We performed an AMOVA using Arlequin version 3.5 software package 37]. We tested population differentiation values against 10,000 random permutations of
the data. For the AMOVA, we grouped individuals according to their origin either north
or south of the Sanaga River. In addition to the AMOVA, we ran a spatial AMOVA (SAMOVA)
at k?=?2, to confirm the groupings used in the AMOVA, as well as to test how geospatial
partitioning of the sampled populations affected population differentiation 28].

We generated mismatch distributions by computing distribution of the number of pairwise
differences between mtDNA haplotypes using Arlequin version 3.5 software package 37]. We generated distributions by grouping individuals into either two or three separate
groups (as identified by microsatellite cluster analysis), using 100 bootstrap replicates.

Microsatellite genotype analysis

We examined the relationship between genetic differentiation and geographic distance
by carrying out partial Mantel tests on the microsatellite data using the Arlequin
version 3.5 software package 37]. Pairwise FST values were generated between all sampling locations and genetic differentiation
was plotted against straight-line, geographic distance. We fit the data with a linear
regression, calculated a correlation coefficient and tested for statistical significance.

We used the EIGENSOFT software package 79] to perform PCA on individual genotypes to identify individuals that could be grouped
into significantly different populations. We converted microsatellite data into a
false SNP format by scoring the presence or absence of each of n-1 alleles (where
n is the number of alleles in the sample) using a script in MATLAB (The MathWorks,
Natick, MA) described previously 39]. We processed this file in SmartPCA, which produced eigenvectors and eigenvalues.
We conducted this analysis blindly to a priori population labels for individuals in the dataset. We tested the statistical significance
of each eigenvector using Tracy-Widom statistics. Each significant eigenvector recovered
by this PCA approach separates the samples in such a way that the first and subsequent
eigenvectors distinguish, in order, the most to least differentiated populations in
the sample 79]. All analyses using EIGENSOFT were performed blinded to a priori population labels.

We examined population structure and individual ancestry using a Bayesian clustering
approach implemented in the TESS (version 2.3) software package 30]. This program estimates the shared population history of individuals based on their
genotypes and geographic origin under a model that assumes Hardy–Weinberg equilibrium
and linkage equilibrium, thereby making no a priori assumptions regarding population classifications. TESS estimates individual proportions
of ancestry into K clusters, where K is specified for the program in advance across
independent runs and corresponds to the number of putative ancestral populations.
The program then assigns admixture estimates for each individual (Q) from each inferred ancestral population cluster. It also produces posterior predictive
values of Q for geographic areas located in between and around the individual data points.

TESS runs were performed: (i) with a model that allows individuals to have ancestry in multiple populations (CAR
model 80]); (ii) with correlated allele frequencies; and (iii) blinded to a priori population labels. Runs were performed with a burn-in step of 500,000 Markov Chain
Monte Carlo (MCMC) iterations and 1,000,000 MCMC iterations. Fifty runs each for K?=?2
through K?=?8 were performed for all datasets. We processed TESS outputs with CLUMPP
81] and a G-statistic 99% was used to assign groups of runs to a common clustering pattern.
CLUMPP outputs for each K value were plotted with DISTRUCT 82]. We used a combination of methods to infer a maximum number of chimpanzee populations
(KMAX) including, (i) the K value that had the highest average DIC 31], (ii) high stability of clustering patterns between runs, (iii) the KMAX value at which KMAX?+?1 no longer split the cluster distinguished by KMAX83], (iv) correspondence between maximum PPD values from TESS runs and significant eigenvectors
recovered by PCA, and (v) calculating an ad hoc statistic, ?K 32].

We created spatial interpolations of the Q matrix using the option in TESS to generate posterior predictive maps of admixture
proportions. We created a template map of the extent of probable chimpanzee ranges
in Nigeria and Cameroon in ArcMap Version 10 (ESRI Corp., Redlands, CA) in the ASCII-raster
format and input into TESS in order to generate posterior predictive maps of admixture
proportions. Using these predictive maps, we generated spatial interpolations of the
Q matrices by implementing a matrix based vector algorithm in the program TESS Ad-Mixer
33]. These spatial interpolations use color mixing to predict admixture proportions across
space in order to better visualize the spatial partitioning of genetic differentiation.

We calculated allele size range, observed and expected heterozygosity, and an M-Ratio
84] for each locus for three populations as identified by cluster analysis. We tested
these values against 10,000 random permutations of the data using the Arlequin version
3.5 software package 37].

We calculated three measures of population genetic differentiation using the Arlequin
version 3.5 software package 37]: D2, RST and ??2. The D234] genetic distance is based on a model in which genetic drift is the only force influencing
allele frequency differences across populations and is sensitive to recent differentiation
events. RST35] and ??236] are similar to D2, but both assume a stepwise mutation model (SMM). Consequently, RST and ??2 are more likely to capture whether differences in the mutation processes are important
in driving population differentiation and enable assessment of sensitivity to mutation
model assumptions 83]. These latter models differ in that RST is based on the fraction of the total variance in allele size between populations
and is analogous to FST35], whereas ??2 is based on differences in the means of microsatellite allele sizes 36]. Recent work has shown convincing evidence that the loci typed for this study appear
to follow the SMM in both chimpanzees and bonobos 64],85]. D2 calculations were completed on untransformed allele size calls. Since RST and ??2 assume the SMM, we transformed allele sizes to repeat size units prior to analysis
in Arlequin version 3.5 37]. We transformed allele sizes such that the smallest allele for each locus was scored
as n and each subsequent allele was scored as n?+?1. In infrequent cases where repeat
unit sizes did not follow the n?+?1 model, and instead repeat units skipped x repeat(s),
we scored the next allele in the data as (n?+?× +1), as described in a previous study
39]. We determined each pairwise genetic distance calculation by 100,000 replications
in Arlequin, and the significance of these pairwise population genetic distances were
evaluated by a significance test at p??0.05.

We calculated distributions of alleles within and between populations using ADZE 86]. ADZE is a generalized rarefaction approach that counts alleles private to populations
as well as combinations of populations. We used this program to infer (i) the mean number of distinct alleles per locus, per population; (ii) the mean number of private alleles per locus, per population; and (iii) the mean number of uniquely shared private alleles per locus, between populations.

Population history

We used the program IMa 38] to estimate: (i) the population mutation parameter, ?, for both descendant populations (?1 and ?2) and an ancestral population (?A); (ii) rates of migration from population 1 to population 2(m1) and from population 2 to population 1 (m2); (iii) and time of population divergence (t) for all possible pairs of the three assumed populations: P. t. ellioti in western Cameroon Nigeria, P. t. ellioti in central Cameroon, and P. t. troglodytes in southern Cameroon. We used IMa instead of the more recently released IMa2 40] for several reasons. The dataset is composed mostly of microsatellites with poorly
characterized mutation rates 39],64],87]. Also, due to the complexity of the IMa2 model, the high number of microsatellite
alleles, and few number of microsatellite loci, we used the version of the IM model
with the fewest assumptions. Additionaly, IMa2 requires the user to input a well resolved,
rooted phylogeny of the populations to be tested, and based on the results of the
cluster analysis, it was counterproductive to swap through an already well resolved,
and simple phylogeny.

We ran multiple iterations of IMa in parallel, for each paired population, using a
multi-processor computer cluster located at the SUNY College of Nanoscale Science
and Engineering. First, we ran the IMa analysis using “M-Mode” (MCMC Mode) with a
full complement of model parameters, and a broad range of priors for all parameters
(?1, ?2, ?A, m1m2, and t). Each run was performed with heated chains using the two-step scheme 88]. For each run, we ran burn-in replicates for a total of five days. We then reduced
the ranges of the model parameters over repeated runs in order to sample more densely
their respective posterior distributions for each of the three population comparisons.
This resulted in different lengths of analysis for each comparison: the western Cameroon/central
Cameroon comparison generated 1,004,512 genealogies using 58 chains; the western Cameroon/southern
Cameroon comparison generated 684,192 genealogies using 40 chains; and the central
Cameroon/southern Cameroon comparison generated 947,723 genealogies using 38 chains.
After the runs converged toward the same values, we loaded the saved genealogies from
multiple runs into “L-Mode” (Load Trees Mode) in order to calculate values for model
parameters for each pair of populations pooled across these multiple independent runs.

We converted the migration parameters, m1 and m2, into the number of migrants exchanged per generation (Nm) using the equation Nm?=?(?*m)/2. We converted the demographic parameter (t) into years by scaling by mutation rate (t/?). We scaled the t parameter for mtDNA assuming a mutation rate (?) of 1.64 × 10?789]. We scaled the t parameter for microsatellite loci using several microsatellite mutation rates, given
the uncertainty in microsatellite mutation rates in the literature 36],39],64],87]. We used a slow mutation rate of 3.53 × 10?564], an intermediate mutation rate of 7.75 × 10?5 (calculated from the geometric mean of rates from Wegmann and Excoffier 87]), and a fast mutation rate of 1.6 × 10?436],39]. We scaled final values for demographic parameters assuming a 20-year generation
time for chimpanzees 90]. We scaled effective population sizes (NE) using ? and a per generation ? using the equation NE?=??/(4*?*20).