Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge


Let’s refer to the case group’s data and the control group’s data collectively as
a database and call the data for an individual a record. We can think of the number
of cases, R, and the number of controls, S as fixed. Recall that we assume the control group’s data are known to the public.
Therefore, for a given genotype table, we assume that s0, s1, and s2 are fixed. Then the ?2 statistic can be written as a function of r0 and r1. How the value of the ?2 statistic changes when we change one record in the database is illustrated in Figure
1. In Figure 1, each dot represent a value of the ?2 statistic given r0 and r1. When we change one record in the case group, there are 6 possible changes to the
genotype table: (r0 ? r0 + 1, r1 ? r1), (r0 ? r0 + 1, r1 ? r1 ? 1), (r0 ? r0, r1 ? r1 ? 1), (r0 ? r0 ? 1, r1 ? r1), (r0 ? r0 ? 1, r1 ? r1 + 1), and (r0 ? r0, r1 ? r1 + 1); that is, r0 and r1 cannot increase or decrease by 1 simultaneously. The possible changes are shown as
arrows in Figure 1. A change in the genotype table results in a change in the allelic table, and we
get a new value for the ?2 statistic based on the new allelic table.

Figure 1. Legal moves in the space of genotype tables with fixed R, S, s0, s1, and s2.

Let p* denote a pre-specified threshold p-value and let c denote the ?2 statistic corresponding to p?, the p-value of the ?2 distribution with 1 degree of freedom. Then for a given SNP in the pool of candidate
SNPs, the genotype table of which we denote by D, the shortest Hamming distance is the smallest number of sequential changes made
to D such that the resulting genotype table, D’, satisfies YA(D’) ? c if YA(D) c and YA(D’) c if YA(D) ? c; that is, if we call c the significance threshold, then the goal is to make changes to the “insignificant”
(“significant”) table D so that the ?2 statistic of D’ goes above (below) the significance threshold c, and D’ becomes a “significant” (“insignificant”) table. The Hamming distance score is defined
as h = (shortest Hamming distance) ? 1 if YA(D) ? c and h = ?(shortest Hamming distance) if YA(D) c.

Let’s consider the space of genotype tables, , defined by a genotype table D: for all , D’ shares the same values of s0, s1, s2, R, S, and N with D, but D’ does not necessarily have the same values of r0, r1, and r2 as D. Let 10 = 2s0 + s1 denote the number of major alleles in the control group, and let x = 2r0 + r1 denote the number of major alleles in the control group, then we can write the ?2 statistic of a genotype table as a function of x:

where r0, r1 and x are derived from D’, and 10, R, S and N are the same for D and Dl. For notational convenience, when r0 and r1 are also derived from D, we will simply write the ?2 statistic as YA(D).

Lemma 1 YA is an increasing function of × when xS ? n10R 0, and it is a decreasing function of × when xS ? n10R 0.

Proof. [see Additional file 1].

Additional file 1. Proofs

Format: PDF
Size: 151KB Download file

This file can be viewed with: Adobe Acrobat Reader

To understand the implication of Lemma 1, let’s consider Figure 2. In Figure 2, each dashed line has slope = ?2, representing a value of x, which is defined as x = 2r0 + r1. Because we can consider each dot in Figure 2 to be a unique genotype table in the space of genotype tables with fixed control
data and a fixed number of cases, those tables that lie on the same dashed line will
have the same value of ?2 statistic. Furthermore, because 0 ? r0 + r1 ? R, r0 ? 0 and r1 ? 0, the space of genotype tables, represented as dots, fall within a triangle in Figure
2.

Figure 2. An example of a genotype table, D, in the space of genotype tables with fixed R, S, s0, s1, and s2. Each dot represent a genotype table. Each dashed line has slope = fl2, representing the lines x = 2r0 + r1. The red line is x = (2s0 + s1)R/S = 2r0 + r1, and the two black lines correspond to values of (2r0 + r1) such that YA(r0, r1; ) = c, where c is a pre-specified significance threshold value.

For the moment, let’s treat r0 and r1 as continuous values. In Figure 2, the red solid line represents the line 2r0 + r1 = x = 10R/S and the two solid black lines represent lines 2r0 + r1 = x such that . There are two black lines because by Lemma 1 YA(x) is an increasing function when x n10R/S and it is a decreasing function when x n10R/S; that is, there could be two values of x, say x1 and x2, such that and x1 10R/S x2. Because it is possible that

there could be genotype tables for which only one black line exists or no black line
exists at all; in such cases, we will use the lines 0 = 2r0 + r1 or 2R = 2r0 + r1 wherever appropriate.

In Figure 2, the genotype table D is insignificant and its ?2 statistic is below the threhold value. By Lemma 1, we know that the ?2 statistics of genotype tables, as represented by the dots on Figure 2, are greater than c when they are in the shaded area, outside of the area between the two black lines
and they are smaller than YA(D?) when they are inside the area between the two black lines. Therefore, finding the
Hamming distance score for D is to find the shortest Hamming distance from the genotype table D to genotype tables in the shaded areas.

For genotype tables that are significant, they will fall into the shaded areas in
Figure 2. Then finding the Hamming distance score for a significant genotype table is to find
the shortest Hamming distance from the genotype table in one of the shaded areas to
genotype tables in the non-shaded area.

Proposition 2 Given a significance threshold value c and an insignificant genotype table D (i.e.,
YA
(D) c), if there exists such that , then the shortest Hamming distance is min{H1, H2}, where H1 and H2 are defined as follows:

(i) H1 is the number of changes made to D in the following manner: (1) keep decreasing r0 until the new genotype table, D’, becomes significant (i.e.,); (2) when r0 is minimized but the new table is still insignificant, keep decreasing r1 until the new table becomes significant.

(ii) H2 is the number of changes made to D in the following manner: (1) keep increasing r0 until the new genotype table becomes significant; (2) if r0 can no longer be increased without decreasing r1 and the new table is still insignificant, increase r0 and decrease r1 in each change until the new table becomes significant.

If for all , , then we define the shortest Hamming distance as min, where and are defined as follows:

(i) When r0 and r1 are both minimized but the new table is still insignificant, set to 1 + d1, where d1 is smallest the number of changes needed to minimize r0 and r1.

(ii) When r0 and r1 are both maximized but the new table is still insignificant, set to 1 + d2, where d2 is smallest the number of changes needed to maximize r0 and r1.

Proof. [see Additional file 1].

Proposition 3 Given a significance threshold value c and a significant genotype table D (that is,
YA
(D) ? c), the shortest Hamming distance is min{H1, H2}, where H1 and H2 are defined as follows:

(i) If 2r0 + r1 (2s0 + s1)R/S, set H1 = ?; otherwise, H1 is the number of changes made to D in the following manner: keep decreasing r0 until the new genotype table, D’, becomes insignificant (i.e., YA(D’, D) c).

(ii) If 2r0 + r1 (2s0 + s1)R/S, set H2 = ?; otherwise, H2 is the number of changes made to D in the following manner: keep decreasing r0 until the new genotype table becomes insignificant.

Proof The proof is similar to that of Proposition 2.

Definition 6 (The Hamming distance score) Given a threshold ?2 statistic value c and a genotye table D, the Hamming distance score of D is

where d? is found using Proposition 2 and d+ is found using Proposition 3.

Corollary 4 The sensitivity of the Hamming distance score as defined in Definition 6 is 1.