Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge

Letâ€™s refer to the case groupâ€™s data and the control groupâ€™s data collectively as
a database and call the data for an individual a record. We can think of the number
of cases, R, and the number of controls, S as fixed. Recall that we assume the control groupâ€™s data are known to the public.
Therefore, for a given genotype table, we assume that s₀, s₁, and s₂are fixed. Then the ?²statistic can be written as a function of r₀and r₁. How the value of the ?²statistic changes when we change one record in the database is illustrated in Figure
1. In Figure 1, each dot represent a value of the ?²statistic given r₀and r₁. When we change one record in the case group, there are 6 possible changes to the
genotype table: (r₀? r₀+ 1, r₁? r₁), (r₀? r₀+ 1, r₁? r₁? 1), (r₀? r₀, r₁? r₁? 1), (r₀? r₀? 1, r₁? r₁), (r₀? r₀? 1, r₁? r₁+ 1), and (r₀? r₀, r₁? r₁+ 1); that is, r₀and r₁cannot increase or decrease by 1 simultaneously. The possible changes are shown as
arrows in Figure 1. A change in the genotype table results in a change in the allelic table, and we
get a new value for the ?²statistic based on the new allelic table.

Figure 1. Legal moves in the space of genotype tables with fixed R, S, s₀, s₁, and s₂.

Let p^*denote a pre-specified threshold p-value and let c denote the ?²statistic corresponding to p^?, the p-value of the ?²distribution with 1 degree of freedom. Then for a given SNP in the pool of candidate
SNPs, the genotype table of which we denote by D, the shortest Hamming distance is the smallest number of sequential changes made
to D such that the resulting genotype table, Dâ€™, satisfies Y_A(Dâ€™) ? c if Y_A(D) c and Y_A(Dâ€™) c if Y_A(D) ? c; that is, if we call c the significance threshold, then the goal is to make changes to the â€œinsignificantâ€
(â€œsignificantâ€) table D so that the ?²statistic of Dâ€™ goes above (below) the significance threshold c, and Dâ€™ becomes a â€œsignificantâ€ (â€œinsignificantâ€) table. The Hamming distance score is defined
as h = (shortest Hamming distance) ? 1 if Y_A(D) ? c and h = ?(shortest Hamming distance) if Y_A(D) c.

Letâ€™s consider the space of genotype tables, , defined by a genotype table D: for all , Dâ€™ shares the same values of s₀, s₁, s₂, R, S, and N with D, but Dâ€™ does not necessarily have the same values of r₀, r₁, and r₂as D. Let ₁₀= 2s₀+ s₁denote the number of major alleles in the control group, and let x = 2r₀+ r₁denote the number of major alleles in the control group, then we can write the ?²statistic of a genotype table as a function of x:

where r₀, r₁and x are derived from Dâ€™, and ₁₀, R, S and N are the same for D and D^l. For notational convenience, when r₀and r₁are also derived from D, we will simply write the ?²statistic as Y_A(D).

Lemma 1 Y_Ais an increasing function of Ã— when xS ? n₁₀R 0, and it is a decreasing function of Ã— when xS ? n₁₀R 0.

Proof. [see Additional file 1].

Additional file 1. Proofs

Format: PDF
Size: 151KB Download file

This file can be viewed with: Adobe Acrobat Reader

To understand the implication of Lemma 1, letâ€™s consider Figure 2. In Figure 2, each dashed line has slope = ?2, representing a value of x, which is defined as x = 2r₀+ r₁. Because we can consider each dot in Figure 2 to be a unique genotype table in the space of genotype tables with fixed control
data and a fixed number of cases, those tables that lie on the same dashed line will
have the same value of ?²statistic. Furthermore, because 0 ? r₀+ r₁? R, r₀? 0 and r₁? 0, the space of genotype tables, represented as dots, fall within a triangle in Figure
2.

Figure 2. An example of a genotype table, D, in the space of genotype tables with fixed R, S, s₀, s₁, and s₂. Each dot represent a genotype table. Each dashed line has slope = fl2, representing the lines x = 2r₀+ r₁. The red line is x = (2s₀+ s₁)R/S = 2r₀+ r₁, and the two black lines correspond to values of (2r₀+ r₁) such that Y_A(r₀, r₁; ) = c, where c is a pre-specified significance threshold value.

For the moment, letâ€™s treat r₀and r₁as continuous values. In Figure 2, the red solid line represents the line 2r₀+ r₁= x = ₁₀R/S and the two solid black lines represent lines 2r₀+ r₁= x such that . There are two black lines because by Lemma 1 Y_A(x) is an increasing function when x n₁₀R/S and it is a decreasing function when x n₁₀R/S; that is, there could be two values of x, say x₁and x₂, such that and x₁₁₀R/S x₂. Because it is possible that

there could be genotype tables for which only one black line exists or no black line
exists at all; in such cases, we will use the lines 0 = 2r₀+ r₁or 2R = 2r₀+ r₁wherever appropriate.

In Figure 2, the genotype table D is insignificant and its ?²statistic is below the threhold value. By Lemma 1, we know that the ?²statistics of genotype tables, as represented by the dots on Figure 2, are greater than c when they are in the shaded area, outside of the area between the two black lines
and they are smaller than Y_A(D^?) when they are inside the area between the two black lines. Therefore, finding the
Hamming distance score for D is to find the shortest Hamming distance from the genotype table D to genotype tables in the shaded areas.

For genotype tables that are significant, they will fall into the shaded areas in
Figure 2. Then finding the Hamming distance score for a significant genotype table is to find
the shortest Hamming distance from the genotype table in one of the shaded areas to
genotype tables in the non-shaded area.

Proposition 2 Given a significance threshold value c and an insignificant genotype table D (i.e.,
Y_A(D) c), if there exists such that , then the shortest Hamming distance is min{H₁, H₂}, where H₁and H₂are defined as follows:

(i) H₁is the number of changes made to D in the following manner: (1) keep decreasing r₀until the new genotype table, Dâ€™, becomes significant (i.e.,); (2) when r₀is minimized but the new table is still insignificant, keep decreasing r₁until the new table becomes significant.

(ii) H₂is the number of changes made to D in the following manner: (1) keep increasing r₀until the new genotype table becomes significant; (2) if r₀can no longer be increased without decreasing r₁and the new table is still insignificant, increase r₀and decrease r₁in each change until the new table becomes significant.

If for all , , then we define the shortest Hamming distance as min, where and are defined as follows:

(i) When r₀and r₁are both minimized but the new table is still insignificant, set to 1 + d₁, where d₁is smallest the number of changes needed to minimize r₀and r₁.

(ii) When r₀and r₁are both maximized but the new table is still insignificant, set to 1 + d₂, where d₂is smallest the number of changes needed to maximize r₀and r₁.

Proof. [see Additional file 1].

Proposition 3 Given a significance threshold value c and a significant genotype table D (that is,
Y_A(D) ? c), the shortest Hamming distance is min{H₁, H₂}, where H₁and H₂are defined as follows:

(i) If 2r₀+ r₁ (2s₀+ s₁)R/S, set H₁= ?; otherwise, H₁is the number of changes made to D in the following manner: keep decreasing r₀until the new genotype table, Dâ€™, becomes insignificant (i.e., Y_A(Dâ€™, D) c).

(ii) If 2r₀+ r₁ (2s₀+ s₁)R/S, set H₂= ?; otherwise, H₂is the number of changes made to D in the following manner: keep decreasing r₀until the new genotype table becomes insignificant.

Proof The proof is similar to that of Proposition 2.

Definition 6 (The Hamming distance score) Given a threshold ?²statistic value c and a genotye table D, the Hamming distance score of D is

where d^?is found using Proposition 2 and d⁺is found using Proposition 3.

Corollary 4 The sensitivity of the Hamming distance score as defined in Definition 6 is 1.