A simple method to separate base population and segregation effects in genomic relationship matrices

Material

In total, 7965 genotyped Fleckvieh (FV) and 4257 genotyped Brown Swiss (BS) and 143
genotyped Original Braunvieh (OB) bulls were available for this study. BS and OB data
were combined (hereafter called BS/OB, n?=?4400) into a single dataset because these two subpopulations actually originated
from a single breed. The term Brown Swiss is used to denote the modern Braunvieh,
which resulted from an exchange of genetic material between Europe and North America.
An OB animal is genetically characterized as a descendant of the old European Braunvieh
population, with no or only minor genetic contributions from the reimported US Brown
Swiss population. This labelling of OB animals within the European Braunvieh population
is not necessarily applied in a uniform manner and small differences in the definition
can occur between countries.

All animals were genotyped with the Illumina BovineSNP50 BeadChip (Illumina, San Diego,
CA). After removing SNPs with low call rates (90 %), minor allele frequencies less
than 2 %, or with a deviation from Hardy-Weinberg equilibrium with P??10
?5
, 37?718 and 41?254 SNPs were retained for the BS/OB and FV datasets, respectively.
Available pedigrees for genotyped animals included 7802 and 16?357 records for the
BS/OB and FV breeds, respectively. BS/OB base animals were assigned to nine groups
(Table 1) according to origin and date of birth. Since the genetic distances between German,
Austrian, Italian and Swiss BS base animals born before 1960 were small (results not
shown), they were combined into one base group called EU
b
. Base FV animals were assigned to 11 groups with nine groups assigned according to
origin and date of birth and two groups assigned to the Red Holstein breed (Table 2).

Table 1. Number of animals per defined base group for the BS/OB population

Table 2. Number of animals per defined base group for FV

We estimated DGV for three milk traits and three conformation traits from a dataset
that was reduced for the last four years of phenotypic data (referred to as the reduced
dataset). Daughter yield deviations (DYD) from the German-Austrian system 22] were used for FV bulls and deregressed MACE (multi-trait across country evaluations)
proofs from Interbull 23] for BS/OB bulls. Deregression was done using the method proposed by Garrick et al.
24]. Group effects were not accounted for in the deregression. Traits analyzed were milk
yield (MY), protein yield (PY), fat yield (FY), stature (STA), feet and legs (FL)
and udder conformation (UD). These traits were a priori assumed to have a large genetic trend and/or to show considerable differences between
base groups. DGV estimated from the reduced dataset were then compared to DYD and
deregressed proofs from the corresponding April 2014 evaluations (current dataset)
according to the guidelines of the Interbull GEBV test 25], 26]. In short, the validation group included bulls with no information on the offspring’s
performances in the reduced dataset but corresponding information in the current dataset.
Current information was assumed to be sufficient for the test when the effective daughter
contribution (EDC) 27] based on offspring performances was equal to at least 20. The remaining bulls from
2010 with an EDC of at least 1 were included into the training set (Calib).

Technically, we tested DGV by a weighted regression of current DYD or deregressed
proofs of the animals in the validation group on their DGV estimated from the reduced
set. The resulting test statistics are the intercept and slope (b) of this regression
as measures of bias and the coefficient of determination (R
2
) of this regression as a measure of the reliability of the DGV. The R
2
values were corrected for the uncertainty in DYD, as proposed by 28], i.e. they were divided by the average reliability of the DYD of validation bulls.

For presentation of results, we divided the animals of the validation group into different
sub-groups. FV validation animals were assigned to two groups: animals from Germany-Austria
(DEA) and others. BS validation animals were also divided into DEA and others, and OB validation animals were assigned to a third validation group (OB). Numbers of animals included in each validation group are in Table 3. The assignments of validation animals to origins used in this investigation for
the purpose of illustration were mainly based on ISO country codes 29] and do not necessarily correspond to assignments based on analyses of genetic contributions
from base groups.

Table 3. Number of animals per validation group for the BS/OB and FV populations and the seven
traits considered

Decomposition of G

Assume a common scenario in genomic prediction with n animals genotyped for m biallelic SNPs. Information on genotypes is collected in an n x m matrix C, using numerical coding that denotes the number of copies of the arbitrarily defined
reference allele (0, 1, 2). Let pT
be the vector of estimated allele frequencies at the m SNPs, which for each SNP j were derived from genotyped animals.


p^j=?i=1nCij2n
(1)

A genomic relationship matrix GT
can be calculated and used in GBLUP using these “current” allele frequencies as:


GT=MM’?j=1m2p^j1?p^j,
(2)

where M is an n x m matrix of recoded genotypes, for which each row (= animal) i of the matrix of numerically coded genotypes C is manipulated in the following manner 30]:


Mi=Ci?1?2pT?0.5.
(3)

Conceptually, this manipulation is equivalent to column-wise centering of C if current allele frequencies are used and if each marker is in Hardy-Weinberg equilibrium
in the genotyped population.

Assume a subdivision of the genotyped population into g groups that systematically differ in allele frequencies, as indicated for example
by sufficiently high F
st
values 31], 32]. Define a g x m matrix P of group-specific allele frequencies that are derived by applying Equation (1) within
each group. Using these group-specific allele frequencies, the vector of genotypes
for each animal can then be centered by applying Equation (3) using the allele frequencies
of the group that it is assigned to. Thus, for animal i assigned to group k with group-specific allele frequencies pk
, the corresponding row in C is manipulated as:


Mi*=Ci?1?2pk?0.5.

A G-matrix corrected for specific allele frequencies for different groups can then be
calculated as:


GS=M*M*’?j=1m2p^j1?p^j,
(4)

with the same denominator as in Equation (2), which is equivalent to expressing this
part of the covariance relative to the overall covariance. The discarded component
of the original covariance structure, which is caused by differences between group
allele frequencies and overall frequencies, can be summarized in a matrix GA
. Treating 2P as a matrix of average “genotypes” of groups, a matrix
M˜

is calculated by manipulating each group’s row g as follows:


MËœg=2Pg?1?2pT?0.5.

Finally, GA
is calculated as
M˜M˜’

divided by the same denominator as in Equations (2) and (4). The g x g matrix GA
can be treated and analyzed in the same manner as the standard G-matrix. It can be expanded to give an n x n matrix GA*
based on:


GA*=QGAQ’,

where Q is the matrix of genetic contributions of each base group to each animal, which can
be calculated as:


Q=TQ*,

where T is a lower triangular matrix that results from decomposing A into TDT’, as described in 33], and Q*
is an n x g design matrix that assigns genotyped animals to groups. Despite this increase in
dimensions, GA*
still has rank (g – 1). Also, note that:


GT=GS+GA*.
(5)

Although this decomposition is straightforward, its dependency on the current allele
frequencies and the grouping of current animals causes some problems due to ambiguous
genetic composition and might not be feasible under practical conditions since new
genotypes have to be successively integrated into the system. To circumvent this problem,
we propose to replace the current allele frequencies with estimates of base allele
frequencies using the estimation procedure developed by Gengler et al. 21]. Using a pedigree that relates genotyped animals to a set of arbitrarily defined
but usually ungenotyped base animals and calculating the conventional relationship
matrix A, the vector of overall base allele frequencies is calculated as a generalized least
squares mean by solving the following equation for each marker j (column of C):


pj*=0.51’A?11?11’A?1cj.
(6)

Similar to conventional estimation of GBV, base animals can be grouped according to
known or assumed population subdivisions and/or generations, when additional differentiation
due to considerable genetic trend has to be taken into account. To estimate base group-specific
allele frequencies, matrix 1 in Equation (6) is replaced by matrix Q. Matrices G T
, GS
and GA*
can then be calculated as described above, using estimates for global and group-specific
base allele frequencies and again GT
?= GS
?+ GA*
, as described above.

Models

In order to study the influence of different definitions of base group on the quality
of prediction, we examined several models. The general model is a standard mixed animal
model with:


y=Xb+Zu+e,

where y is a vector of DYD or deregressed proofs of genotyped animals, b is the vector of fixed effects, u is the vector of random animal effects, incidence matrices X and Z relate observations to levels of b and u, respectively, and e is the residual effect. Furthermore, it is assumed that y?~?N(Xb, V yy
), e?~?N(0,V e
) and u?~?N(0,V uu
), with V yy
?=?V uu
?+?V e
, V e
is diag(1/w)*?
2e
, where w is a vector of weights. The models to be compared are defined in the following.

Standard model (model 0, M0): X?=?1 and V
uu
?=?G
T
?×?? u2
.

Model 1 (M1): X?=?1 and V
uu
?=?G
S
?×?? u2
.

Model 2 (M2): X?=?[1 | Q] and V
uu
?=?G
S
?×?? u2
.

Note that M2 is equivalent to a model that fits standard fixed group effects 34]. Although genomic relationships corrected for unequal base allele frequencies (GS
) are used in M2, it can be shown by least-squares theory that the solutions are identical
to a model that uses GT
, if the same matrix Q is used to estimate the base allele frequencies and to model the fixed group effects
(see Appendix 1). Finally, it can be shown that using the standard genomic relationship
matrix G T
in standard GBLUP (standard model, M0) in the presence of base groups that differ
in allele frequencies gives solutions equivalent to the use of a more specific model
with genetic groups as random effects and equal variances for the base group and the
segregation effects (see Appendix 2), as in the following representation:


X’XX’ZX’QZ’XZ’Z+GS?1Z’QQ’XQ’ZQ’Q+GA??1b^u^g^=X’yZ’yQ’y,

where ??=?? u2
/? e2
and the final estimate for the breeding value is ?=?Q??+?û. We calculated solutions for the standard model using this more specific model, which,
in addition, allowed us to derive estimates for group effects and their prediction
errors.

Models were tested in forward prediction by means of the test described in the sub-section
Material. To better understand the factors that influence the predictive ability of
a specific model for different validation datasets, we analyzed the matrix of base
group contributions (Q) and derived base group estimates, as well as their prediction errors, using M0 and
M2. Differences between group effect estimates were calculated and tested by formulating
linear hypotheses.

Distance measures

We calculated F
st
statistics to illustrate the effects of the proposed decomposition of G. F
st
is a standard measure of genetic distance and can be calculated either by pairwise
analysis of differences in allele frequencies between known or assumed subpopulations
or breeds 18], or by direct calculation from relationship matrices 6] as:


Fst=f˜?f¯1?f¯,

where
f˜

is the mean coancestry over all subpopulations and
f¯
is the average coancestry within a given subpopulation. The term
1?f˜
is the average diversity (heterozygosity) and depends on the coancestry within the
given subpopulation. F
st
values are primarily used as a tool to visualize substructures within groups of animals
6], 10], 35]. An F
st
value of 0.05 can be interpreted as a strong indication of a relevant subdivision
31], 32].