Genetic determination of height-mediated mate choice

Genotype quality control

We obtained data for 152,736 individuals genotyped in phase 1 of the UK Biobank genotyping
programme. These comprise 49,979 individuals selected as part of the BiLEVE study
based on their lung function phenotypes 17], 102,750 individuals selected at random amongst the remaining UK Biobank participants,
and seven individuals with missing information, who were removed from further analyses.
Genotypes were assayed using two platforms, the Affymetrix UK BiLEVE Axiom array for
the BiLEVE cohort and the Affymetrix UK Biobank Axiom array 18] for all remaining individuals. The data consist of genotype calls for 847,441 markers,
approximately 95 % of which are present on both genotyping platforms employed. Details
regarding the genotyping procedure and genotype calling protocols are provided elsewhere
19] and in the following we only summarise any subsequent QC and processing performed.
We excluded individual markers from further analysis if they were multi-allelic, their
overall missingness rate exceeded 2 %, or if they exhibited a strong platform-specific
missingness bias (Fisher’s exact test, P??10
?100
). Individuals were excluded from further analysis if they exhibited excess heterozygosity,
as identified by UK Biobank internal QC procedures 19], if their missingness rate exceeded 5 %, or if their self-reported sex did not match
genetic sex estimated from X chromosome inbreeding coefficients. These criteria resulted
in a reduced dataset of 151,532 individuals. Finally, we filtered out markers exhibiting
a departure from Hardy–Weinberg equilibrium (P??10
?50
) or with minor allele frequency below 0.05 within the subset of couples identified
as ethnically White-British as described below, which left 318,852 SNPs for analysis.
The genotype QC was performed using PLINK 20].

Ethnicity

The UK Biobank cohort includes individuals of diverse ethnicities that may confound
analyses. We therefore identified a core subset of 123,847 individuals of White-British
ethnicity by combining self-reported and genotype information. Specifically, we performed
a principal components analysis (PCA) of all individuals passing genotypic QC using
an LD-pruned set of 99,101 autosomal markers (http://biobank.ctsu.ox.ac.uk/crystal/refer.cgi?id=149744) that passed our SNP QC protocol. Amongst individuals who self-reported their ethnicity
as White-British, we then retained individuals for whom the projections onto the leading
20 genomic PCs fell within 3 standard deviations (SD) of the mean.

Couples

Using household sharing information we identified a set of 105,381 households with
exactly two members in the cohort that we considered to be couples. For 94,651 out
of those 105,381 households, both residents report the same household size and relationship
to other household members to be ‘Husband, wife or partner’ or both ‘Husband, wife
or partner’ and ‘Son and/or daughter (include step-children)’. Hence, for ~90 % of
the pairs we have additional confirmatory information that these were couples. Our
univariate and bivariate analyses included only those couples whose coefficient of
relatedness (r) was less than 0.0625, of which only seven pairs had r 0.025.

Of those 105,381 identified couples, we used 13,068 White-British couples and 3,726
mixed-race couples (where one member of the couple was classified as White-British
and the other as non White-British) that had been genotyped in phase 1. The bulk of
the analyses in the paper were performed on the 13,068 White-British couples. Our
predictions of the partner’s height based on the individual genotype were performed
using 15,437 couples where only one of the partners had been genotyped in phase 1.
For this, we predicted the total additive effect of height for the person genotyped
and estimated the correlation (i.e. the prediction accuracy) of that ‘polygenic score’
with the height of their partner. In this case, and because we could not confirm that
the partners were unrelated using genotypes, we set additional filtering criteria.
We used only pairs where both individuals were self-reported White-British; the genotyped
person was classified as White-British based on genotype; individuals reported different
ages for one or both parents; and individuals had an age difference of less than 10 years,
were of opposite gender, and reported to live with their partner or partner and children.

Phenotype quality control

We defined outliers as males and females that were more than 3 SD from their gender
mean, and removed them from the analyses.

Model fitting

The PCA and all mixed linear models were fitted using DISSECT 21], a software tool designed to perform genomic analyses on large volumes of data in
a high performance computing (HPC) environment without the need to perform mathematical
approximations. DISSECT is an open access software that can be downloaded from our
dedicated web site (http://www.dissect.ed.ac.uk) under a GNU GPL v3 license. The availability of the software and access to the UK
National Supercomputer (ARCHER) allowed us to fit these computationally intense analyses
to the large dataset.

Univariate mixed linear model

We fitted the following univariate mixed linear model:

where ? is the mean term and e i
the residual for individual i. L is the number of fixed effects, x il
being the value for fixed effect l for individual i and ? l
the estimated effect for l. M is the number of markers and z ij
is the standardised genotype of individual i at marker j. The vector of random SNP effects a is distributed as N(0, I? u2
). The phenotypic variance-covariance matrix is var(y) =?V?=?ZZT
?
u2
?+?I?
e2
. The SNP effects are estimated using the equation 16]:

Because ?
j=1M z ij a j
is the total additive genetic effect (g i
) for individual i, this model can also be expressed as,

In this model, the vector of genetic effects g is distributed as N(0,?A?
g2
). Where A is the genetic relationship matrix and ? g2
?=?M? u2
. Accordingly, the total phenotypic variance-covariance matrix is var(y)?=?V?=?A?
g2
?+?I?
e2
. From the equivalence between these two models, DISSECT can estimate the total additive
effect from the equation:

DISSECT estimates ? g2
and ? e2
using the expectation maximization (EM) method for the first step 16], followed by AI REML method steps 22], 23].

Bivariate mixed linear model

We fitted the following bivariate mixed linear model 24]:

where ?i
is a vector of equal mean terms and ei
the vector of residuals for the trait i. Xi
is the incidence matrix of the fixed effects ?i
for the trait i. gi
is the vector of the individuals’ genetic effects for the trait i with covariance matrix:

where Ai
is the genetic relationship matrix between the individuals measured for trait i and Aij
the genetic relationship matrix between the individuals measured for trait i and trait j. are the genetic variance for trait 1, the genetic variance for trait 2 and the genetic
covariance between the two traits, respectively. The phenotypic covariance matrix
(V) is,

where are the environmental variance for trait 1, the environmental variance for trait
2 and the environmental covariance between the two traits, respectively. I is the identity matrix and I12
is a matrix where the elements in row i and column j are 1 if the individual i for the trait 1 is the same than the individual j of the trait 2 and 0 otherwise. As in the univariate case, DISSECT fits the variances
and covariances using the expectation maximization (EM) method for the first step
16], followed by AI REML method steps 22], 23].

Estimation of heritability before assortative mating

The heritability of height, before the population started assortative mating and reached
an equilibrium (h02
), was estimated as , where h2
is the current h2
(assumed to be 0.8), and m is the correlation of breeding values among mates 25].

Permutation based analysis

We swapped the male individuals between pairs of couples where both the male and female
where of similar height. This was achieved by ordering couples by both female and
male heights, and swapping the male individual between pairs of successive couples,
i.e., male 1 with 2, 3 with 4, and so on.

We first confirmed that this approach removed the genetic structure arising due to
assortment by geography by regressing a couple’s relatedness on the distance of birthplaces
(Fig. 1). Furthermore, we examined some of the available covariates that we thought might
be related to social and geographical structure and which individually explained more
than 0.5 % of variation in height, excluding covariates specific to only one sex and
‘Comparative height size at age 10’. For continuous covariates we computed the between
partner Pearson’s correlations for the observed and permuted couples. For categorical
covariates we examined the mutual information between the partner’s covariates as
a measure of their dependence, computing the p value for the null hypothesis of zero mutual information from 1,000 permutations
of the female covariates. We found that our permutation approach severely reduced
dependence between partners in all variables (Additional file 1: Table S3). Although we found statistically significant associations for all but
one variable in the observed data, after permutation, associations for all but three
variables where not significantly different from zero. This was so despite the large
sample size, which would allow us to detect even very small associations. For the
three variables with statistically significant associations in the permuted data,
examination of histograms (Fig. 2) of the observed and permuted data did not suggest the presence of any strong remaining
structure.

Expected heritability of assortative mating driven by phenotype

Let us assume a standardised phenotype (P) which follows the standard additive genetic model, i.e., P i
 = A i
 + E i
, where A i
and E i
are the additive genetic effect and environmental component for individual i, respectively. If P drives mate selection the phenotype of the partner m(i) for a given individual i can be expressed as P m(i)
 = bP i
 + E m(i)
. However under the assumption of an ante-dependence model A i
 ? P i
 ? P m(i)
for selection, where the additive genetic effect A i
influences the phenotype P i
which in turn influences the choice of i’s partner’s phenotype, we have cov(A i
,?P m(i)
) = 0. Hence

with the genetic component being b2var(A i
) = b2h P2var(P i
) where h P2
is the heritability of P. Since for a standardised phenotype b?=?r p
where r p
is the phenotypic correlation and var(P i
) = var(P m(i)
), we have that is the expected heritability of partner’s phenotype.

Now assuming a heritability for height of h2
?=?0.8 and r p
?=?0.26, the expected heritability for choice of mate’s height is 0.043.

Ethical approval

The use of the UK Biobank dataset falls within the study’s ethical approval from the
North West Medical Research Ethics Committee (Reference 11/NW/0382).

Availability of supporting data

The data can be accessed through the UK Biobank (http://www.ukbiobank.ac.uk).