Amazon Research Introduces 3A (Approximate, Adapt, Anonymize): A Framework For Privacy Preserving Training Data Release For Machine Learning


Data synthesis has been presented as a feasible technique to share and analyze sensitive data in a way that is both morally and legally acceptable. The development of this technology and its potential benefits are slowed by the considerable legal, ethical, and trust problems associated with training and applying machine learning models in industries that deal with sensitive and individually identifiable information, such as healthcare. Depending on the privacy definition and objectives, creating a dataset that enables precise machine learning (ML) model training without sacrificing privacy is possible. For instance, data that cannot be used to identify a specific individual may be exempted from the GDPR.

Researchers at Amazon develop a system for creating synthetic data that protects privacy while enhancing its usefulness for machine learning. They are interested in methods that:

  1. Approximate the true data distribution.
  2. Maintain machine learning utility (ML models trained on the data release perform similarly to models trained on true data).
  3. Preserve privacy by DP for privacy-preserving ML using differentially private data release.

In this effort, they will depend on differential privacy, which, in contrast to weaker privacy criteria like k-anonymity, has been demonstrated to guard against identifying specific people.

More specifically, they suggest researching a group of data generation algorithms M that, given an initial dataset D = (Xi, Yi) i=1 to n with n data points Xi and labels Yi, generate a synthetic dataset D~ = M(D) that does the following: 

1. Approximate the underlying data distribution: estimate a parametric density p(x) by optimizing a log-likelihood objective.

2. Modify the estimated data distribution so that a classifier trained on data samples from it would lose less than a classifier trained on the actual data would lose. L1, the objective that encourages authentically conserving the data distribution, and L2, the aim that encourages matching classifier loss, must be balanced out in the overall optimization process.

3. Anonymize by making sure that the entire data publication mechanism has (?, ?) differential privacy, which makes it improbable that the involvement of a single data point can be identified. In other words, ensure the algorithm for releasing data is differentially private.

An improved version of Random Mixing to ensure privacy by keeping combinations of data points rather than individual data points to facilitate a “safety in numbers” approach to avoiding reidentification). It is possible to implement this overall architecture in several ways. In this work, they assess ClustMix, a straightforward algorithm that implements these 3 phases. They will select a Gaussian Mixture Model as the density estimator and a Kernel Inducing Point meta-learning algorithm as the loss approximator (to allow a trade-off between maintaining density and classifier fidelity).

Their main contributions are the flexible privacy-preserving data generation framework described above and the introduction of cluster-based instead of random Mixing for preserving differential privacy, which allows significant accuracy increases over previously published methods. Creating new training examples by taking convex combinations of existing data points has been successfully leveraged in machine learning, e.g., for data augmentation, learning with redundancy in distributed settings, and, more recently, private machine learning. 

Their differentially private (DP) data release technique uses random mixtures (convex combinations of a randomly picked subset of a dataset) and additive Gaussian noise. While some of these algorithms explicitly strive to retain the original data distribution, most samples are random and neglect data geometry. As a result, low-density areas near decision borders could not be saved, which could reduce machine learning’s downstream value. Moreover, combinations of random samples could not retain particular data distributions, including skewed and multimode continuous variables. Their method uses sampling from the vicinity of cluster centroids instead of random sampling to maintain data distribution. Noisy mixes can more closely approach the original data distribution by mixing related data points rather than random ones, losing less utility than competing techniques while having a greater DP guarantee. 


Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 15k+ ML SubRedditDiscord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.