Blog Post |
Evaluating and Overcoming the Limitations of Synthetic Data
As the need for balancing data value with data privacy grows, there has been an influx of offerings claiming to provide “synthetic data” as a solution. What is synthetic data, you may ask? “Synthetic data” is described as artificially generated data that contains “properties of the original data” without disclosing the “actual original data.” Products and solutions providing this capability often position “synthetic data” as a silver bullet—providing complete data value while ensuring record-level privacy.
The reality, however, is much more nuanced and complex. Although seemingly compelling at the surface, synthetic data has limitations as an approach because of fundamental mathematical constraints. Perfectly preserving both privacy and data value in a single dataset is mathematically impossible, and even close approximations of this goal range from far beyond the state of the art (think teleportation) to mathematically proven false (think perpetual motion).
Unfortunately, deciphering legitimate claims from misrepresentations requires a relatively deep knowledge of the field. To make matters worse, the consequences of deploying a spurious solution are potentially catastrophic, including breaches of customer privacy and regulatory violations.
The intention of this blog post is twofold:
- First, to categorize claims about synthetic data to enable enterprises to make informed decisions in this emerging space.
- Second, to describe a solution for how synthetic data can be leveraged by the enterprise to solve specific problems, given its fundamental limitations.
Part 1 – Evaluating Claims about Synthetic Data
Claims about synthetic data made in the market today fall under one of four categories:
- Mathematically impossible—these claims are provably false, as they violate fundamental mathematical principles.
- Theoretically possible, but far beyond the frontier of modern research—these claims, although theoretically possible, are highly implausible given what is known today among leading experts.
- At the frontier of modern research—these claims are achievable by a team of leading experts in the field of differential privacy and synthetic data generation and are indicative of a credible product
- Legacy approaches—these methods are widely used, but they demonstrably compromise both data privacy and data value.
The following sections discuss each of these four categories in greater detail, to help inform enterprise data and information security teams on how to interpret claims made in this complex space.
Category #1: Mathematically impossible
This first category covers synthetic datasets that claim to be statistically “identical” to the original dataset and to preserve any form of privacy. This category may also include descriptions that claim:
- Synthetic datasets that provide high accuracy statistics for any use case.
- Synthetic datasets that preserve record-level properties (e.g., anonymized longitudinal record of a patient, being able to join multiple synthetic datasets) and also claim to be differentially private.
These claims can simply not be believed, as they have been mathematically proven to be false. This does not mean that they are difficult to achieve or that they cannot be achieved with modern knowledge—it means something much stronger. The claims in this category contradict the fundamental principles of mathematics and are not achievable with any technology, including future technologies beyond the frontier of our imagination. Any product claiming to satisfy any of the above claims is not at all credible.
Category # 2: Theoretically possible, but far beyond the frontier of modern research
This second category covers synthetic datasets that claim broad capabilities to service a wide variety of analysis types and use cases while meeting the standard of differential privacy. This category may include descriptions that claim:
- Synthetic datasets that are differentially private and offer high utility for exploratory analysis/research, even for a single use case.
- Synthetic datasets that are differentially private and can be used broadly for training machine learning models.
- Synthetic datasets that claim to preserve a large number of statistical properties of the original dataset to a high degree of accuracy and to preserve any form of privacy.
Although the claims described have not been proven impossible, they are not achievable with modern technology. There are 1000s of highly qualified researchers in the fields of machine learning, differential privacy, and statistics who work on synthetic data, and the academic literature does not provide any mechanisms that are even close to supporting any of these claims.
If a product is claiming to achieve any of the above, it should be viewed with skepticism—it is most likely that the product is severely compromising privacy or not actually delivering on the data utility claims. In practice, organizations that make these claims are either misrepresenting their technology, or potentially worse, do not have the sophistication to understand the limitations of the techniques they are borrowing from the literature.
The most common failure patterns we have seen in products that make Category 2-style claims are:
- Not using a rigorous approach to privacy. For example, the product may be using algorithms that simply add noise to data and are misrepresenting this as meeting the standard of “differential privacy.”
- Using a rigorous algorithm but parameterizing it incorrectly. For example, the product may be using a state-of-the-art algorithm from the literature but using unreasonable parameters that effectively void all privacy assurances.
An organization can validate products or organizations that make these claims by considering the following lines of discovery:
- Confirm that the team has peer-reviewed academic experience in statistical privacy/differential privacy. PhD-level experience in fields such as machine learning, pure mathematics, or cryptography is useful, but not sufficient, for developing a working solution in this space.
- Confirm external validation of claims. It is highly unusual (if not impossible) for a commercial solution to be extremely far ahead, algorithmically, from the state of the art globally. There should be external, third-party validation from an independent expert in differential privacy, or peer review, of the approach.
- Review the literature references supporting development. An advancement that actually delivers on such claims should be based on an extensive analysis of the literature on differentially private data release. The product should clearly articulate how it differentiates its operation and why it is so far beyond the current state of the art.
- Ask questions about the implementation details. The developers should be able to give clear answers to the following questions:
- What is the proof of privacy for the algorithm?
- How is the algorithm parametrized in the implementation?
- How did they verify that this parametrization provided sufficient privacy empirically?
- What are the specific types of computations that are supported privately?
- What are the classes of computation types that are not supported?
- What is the bound on accuracy of the supported computation types?
- What is the intuition behind the approach/structural property of the dataset that differentiates between the supported and non-supported computation types?
Category # 3: At the frontier of modern research
Category 3 includes algorithms that achieve descriptions similar to those listed below, are being developed by leading experts in differentially privacy, and have been published in top conferences and journals. This is the forefront of the state of the art. This category may include descriptions that claim:
- Differentially private synthetic datasets for a pre-determined set of queries (where the queries are specified upfront, and the synthetic dataset is generated to be accurate on those queries).
- Differentially private synthetic datasets for a particular model type (these datasets are accurate for a specific machine learning model training algorithm—for instance, a specific tree-based model or a specific type of neural network).
It is certainly plausible that a commercial product achieves the claims above and includes caveats in the product description such as “the dataset is only accurate for the following set of computations” lend credibility to the solution. Any product that makes claims in this category should be able to respond to similar questions to those described for validating products claiming to be in Category # 2.
Category #4: Legacy approaches
These approaches for generating synthetic data do not provide any proven privacy (or even utility) guarantees and are behind the state of the art. These approaches may apply techniques including:
- Hashing personally identifiable information (PII) fields.
- Redacting or limiting sensitive elements (e.g., truncating ages to birth dates, deleting last names).
- Aggregating data (e.g., k-anonymity).
- Adding noise to data without formal guarantees (often products will do this and conflate it with differential privacy).
- Random data generation: creating “fake” records that match the frequency of the true data in some way.
- Data masking – a “catch-all” term for combinations of the above.
For specific use cases where the end user is trusted and the use case is very limited and well defined, these Category 4 methods may still be appropriate. For example, generating “test data” for testing applications in a developer environment without moving sensitive data to the developer environment could be a reasonable use case for these methods. However, use cases in which there are substantial privacy risks—for instance, sharing data across boundaries, such as with partners, across business lines, or across national borders—require much more robust solutions.
At LeapYear, we have spent five years understanding and developing some of the most rigorous approaches for unlocking value from sensitive data with mathematically proven privacy assurances. We have collaborated with leading researchers in the field, implemented the results of 100s of academic papers, and produced novel algorithms.
We have found that, given the state of the art, synthetic data is a valuable feature of a privacy-preserving machine learning system, but it is not sufficient as a standalone product. The reasons for this are outlined above in greater detail, but to summarize: it is not possible to preserve every statistical property of the underlying data while still preserving any notion of privacy.
The LeapYear platform implements a broad range of privacy-preserving algorithms into a differentially private platform. Unlike a synthetic data approach, this platform is context aware. This means that, instead of trying to release an entire dataset that preserves both privacy and analytical utility for every use case (which is mathematically not possible), LeapYear uses privacy-preserving algorithms to compute a result for the specific computation requested by the analyst. With this approach, the algorithm only needs to preserve a specific statistical property each time a query is executed while maintaining privacy, rather than every statistical property. The last 15 years of research in differential privacy has demonstrated that this model is achievable for a broad range of analytical algorithms, which effectively cover the entire data science workflow. LeapYear has implemented these algorithms into a single platform.
Even in an environment that supports privacy-preserving analytics and machine learning for a broad class of computations, there are clear use cases for synthetic data. Valid use cases that necessitate synthetic data are when an organization:
- Would like to not only obtain value from a sensitive, third-party dataset but also leverage its own computing environment.
- Is sharing data with a third party that is using its own proprietary machine learning algorithm.
For these use cases, LeapYear provides a synthetic data module within its broader platform. LeapYear’s synthetic data algorithm falls into category #3, but because of the functionality of the broader platform, it does not have the same limitations as synthetic data-only approaches. An analyst can use the privacy-preserving capabilities of the broader platform in combination with synthetic data generation to:
- Filter to the appropriate dataset for use in synthetic data generation
- Confirm the accuracy of synthetically generated data
- Use the outputs of algorithms built on the synthetic data against the original dataset (for example, model scoring and benchmarking)
- Run queries, analyses, and models on the original dataset through privacy-preserving mechanisms when using a synthetic dataset is not appropriate
This enables enterprises to systematically leverage synthetically generated data for use cases that require it, without being restricted by the fundamental mathematical limitations (which would either compromise data value or privacy) of a solution that is limited only to synthetic data release.
This blog post provides insight into the nuanced and complex challenges that surround the use of “synthetic data” to solve the dual goals of data utility and data privacy. Given the importance of protecting sensitive datasets and the fundamental mathematical limitations of a generalized synthetic data approach, prospective users of the technology should be sure to carefully understand and evaluate their potential solutions. This document provides some of the context necessary to assess the credibility of synthetic data products.
At LeapYear, we have spent nearly five years understanding and implementing some of the most rigorous approaches for unlocking value from sensitive data assets. Our platform provides a broad differentially private platform for manipulating and computing on highly sensitive datasets, augmenting it with a targeted synthetic data solution to overcome the innate limitations in this field. If you’d like to discuss in more detail differential privacy or synthetic data, please contact us, and we’d be happy to share our expertise.
Acknowledgment: We’d like to thank Aaron Roth for providing feedback on the state of the academic literature in this space.