As the need for balancing data value with data privacy grows, there has been an influx of offerings claiming to provide “synthetic data” as a solution. What is synthetic data, you may ask? “Synthetic data” is described as artificially generated data that contains “properties of the original data” without disclosing the “actual original data.” Products and solutions providing this capability often position “synthetic data” as a silver bullet—providing complete data value while ensuring record-level privacy.
The reality, however, is much more nuanced and complex. Although seemingly compelling at the surface, synthetic data has limitations as an approach because of fundamental mathematical constraints. Perfectly preserving both privacy and data value in a single dataset is mathematically impossible, and even close approximations of this goal range from far beyond the state of the art (think teleportation) to mathematically proven false (think perpetual motion).
Unfortunately, deciphering legitimate claims from misrepresentations requires a relatively deep knowledge of the field. To make matters worse, the consequences of deploying a spurious solution are potentially catastrophic, including breaches of customer privacy and regulatory violations.
The intention of this blog post is twofold:
Claims about synthetic data made in the market today fall under one of four categories:
The following sections discuss each of these four categories in greater detail, to help inform enterprise data and information security teams on how to interpret claims made in this complex space.
This first category covers synthetic datasets that claim to be statistically “identical” to the original dataset and to preserve any form of privacy. This category may also include descriptions that claim:
These claims can simply not be believed, as they have been mathematically proven to be false. This does not mean that they are difficult to achieve or that they cannot be achieved with modern knowledge—it means something much stronger. The claims in this category contradict the fundamental principles of mathematics and are not achievable with any technology, including future technologies beyond the frontier of our imagination. Any product claiming to satisfy any of the above claims is not at all credible.
This second category covers synthetic datasets that claim broad capabilities to service a wide variety of analysis types and use cases while meeting the standard of differential privacy. This category may include descriptions that claim:
Although the claims described have not been proven impossible, they are not achievable with modern technology. There are 1000s of highly qualified researchers in the fields of machine learning, differential privacy, and statistics who work on synthetic data, and the academic literature does not provide any mechanisms that are even close to supporting any of these claims.
If a product is claiming to achieve any of the above, it should be viewed with skepticism—it is most likely that the product is severely compromising privacy or not actually delivering on the data utility claims. In practice, organizations that make these claims are either misrepresenting their technology, or potentially worse, do not have the sophistication to understand the limitations of the techniques they are borrowing from the literature.
The most common failure patterns we have seen in products that make Category 2-style claims are:
An organization can validate products or organizations that make these claims by considering the following lines of discovery:
Category 3 includes algorithms that achieve descriptions similar to those listed below, are being developed by leading experts in differentially privacy, and have been published in top conferences and journals. This is the forefront of the state of the art. This category may include descriptions that claim:
It is certainly plausible that a commercial product achieves the claims above and includes caveats in the product description such as “the dataset is only accurate for the following set of computations” lend credibility to the solution. Any product that makes claims in this category should be able to respond to similar questions to those described for validating products claiming to be in Category # 2.
These approaches for generating synthetic data do not provide any proven privacy (or even utility) guarantees and are behind the state of the art. These approaches may apply techniques including:
For specific use cases where the end user is trusted and the use case is very limited and well defined, these Category 4 methods may still be appropriate. For example, generating “test data” for testing applications in a developer environment without moving sensitive data to the developer environment could be a reasonable use case for these methods. However, use cases in which there are substantial privacy risks—for instance, sharing data across boundaries, such as with partners, across business lines, or across national borders—require much more robust solutions.
At LeapYear, we have spent five years understanding and developing some of the most rigorous approaches for unlocking value from sensitive data with mathematically proven privacy assurances. We have collaborated with leading researchers in the field, implemented the results of 100s of academic papers, and produced novel algorithms.
We have found that, given the state of the art, synthetic data is a valuable feature of a privacy-preserving machine learning system, but it is not sufficient as a standalone product. The reasons for this are outlined above in greater detail, but to summarize: it is not possible to preserve every statistical property of the underlying data while still preserving any notion of privacy.
The LeapYear platform implements a broad range of privacy-preserving algorithms into a differentially private platform. Unlike a synthetic data approach, this platform is context aware. This means that, instead of trying to release an entire dataset that preserves both privacy and analytical utility for every use case (which is mathematically not possible), LeapYear uses privacy-preserving algorithms to compute a result for the specific computation requested by the analyst. With this approach, the algorithm only needs to preserve a specific statistical property each time a query is executed while maintaining privacy, rather than every statistical property. The last 15 years of research in differential privacy has demonstrated that this model is achievable for a broad range of analytical algorithms, which effectively cover the entire data science workflow. LeapYear has implemented these algorithms into a single platform.
Even in an environment that supports privacy-preserving analytics and machine learning for a broad class of computations, there are clear use cases for synthetic data. Valid use cases that necessitate synthetic data are when an organization:
For these use cases, LeapYear provides a synthetic data module within its broader platform. LeapYear’s synthetic data algorithm falls into category #3, but because of the functionality of the broader platform, it does not have the same limitations as synthetic data-only approaches. An analyst can use the privacy-preserving capabilities of the broader platform in combination with synthetic data generation to:
This enables enterprises to systematically leverage synthetically generated data for use cases that require it, without being restricted by the fundamental mathematical limitations (which would either compromise data value or privacy) of a solution that is limited only to synthetic data release.
This blog post provides insight into the nuanced and complex challenges that surround the use of “synthetic data” to solve the dual goals of data utility and data privacy. Given the importance of protecting sensitive datasets and the fundamental mathematical limitations of a generalized synthetic data approach, prospective users of the technology should be sure to carefully understand and evaluate their potential solutions. This document provides some of the context necessary to assess the credibility of synthetic data products.
At LeapYear, we have spent nearly five years understanding and implementing some of the most rigorous approaches for unlocking value from sensitive data assets. Our platform provides a broad differentially private platform for manipulating and computing on highly sensitive datasets, augmenting it with a targeted synthetic data solution to overcome the innate limitations in this field. If you’d like to discuss in more detail differential privacy or synthetic data, please contact us, and we’d be happy to share our expertise.
Acknowledgment: We’d like to thank Aaron Roth for providing feedback on the state of the academic literature in this space.