Blog Post |
The Safe Enterprise: How to Share Sensitive Data With LeapYear’s Practical Differential Privacy
The race to share and monetize sensitive information is just beginning. Cloud computing, artificial intelligence, and machine learning, coupled with new privacy-enhancing technologies, enable enterprises to unlock unprecedented value from data. The demand for sensitive data is growing, as enterprises across industries ranging from financial services to healthcare seek to share data with partners and securely monetize it to create new income streams.
To date, privacy, security, and regulatory concerns have necessarily and severely restricted access to sensitive information. The same methods from machine learning that create value from sensitive data also pose risks to privacy: As these data sets are exposed to data scientists, third parties, and models, the surface area for data exfiltration is rapidly growing.
Traditional methods to protect sensitive information such as data masking, aggregation, and redaction have time and time again failed to solve this problem. These are heuristic-based approaches: They provide no provable guarantees and have been repeatedly reverse engineered, compromising highly sensitive personally identifiable information (PII), such as medical and financial data, as well as confidential business data and intellectual property (IP).
Restricting access to sensitive data outright is not an ideal solution either. Strict controls limit data value, slow the pace of innovation, and do not provide meaningful defense against privacy attacks.
The massive potential value of mining and sharing sensitive data sets, paired with the sophistication of modern privacy attacks, has created a time-sensitive and lucrative problem for key stakeholders to solve. These individuals include data scientists, security officers, IT organizations, data engineers, compliance officers, business stakeholders, and regulators. Fortunately, there is a new way to securely share sensitive information. Cryptography-based technologies protect underlying data sets from reverse-engineering or other forms of exposure, enabling firms to sell sensitive data to create profitable new revenue streams.
Traditional techniques for protecting sensitive information
For half a century, the field of cryptography has facilitated the secure, electronic sharing of sensitive information—credit card transactions, file transfers, and communications—protected by cryptographic protocols. Data sharing has been made possible by providing rigorous, mathematically proven guarantees. It’s not an overstatement to say that data encryption has enabled modern business to flourish: enabling secure communications among partners and instant, automated decisions made on real-time data flows and analytics.
The field of data privacy, however, has historically been devoid of such foundations. The only known approaches for exposing data for analysis while still attempting to maintain confidentiality have been heuristic based: in other words, guesswork. These approaches involved redacting certain fields that were considered especially sensitive, such as personal identifiable information (PII); or implementing business rules, such as “reveal the age of a person but not their birthdate” or “only show a result if there are more than 20 people in the sample.”
It is a well-known fact that these techniques do not work; Countless studies and real-world data breaches have demonstrated that these approaches can be reverse engineered. Every time a privacy protection technique is breached, practitioners propose a new, slightly more sophisticated technique, such as adding noise to release “synthetic data,” performing aggregations such as “k-anonymity,” or instituting a broad class of statistical techniques called “data masking.” Ultimately, each technique has been broken, resulting in a compromise of highly sensitive information that harms individuals and institutions alike. In the absence of better alternatives, these insecure sensitive data sharing approaches continue to be used across enterprises today, resulting in process inefficiencies, loss of value, and the risk of data compromise.
What is Differential Privacy?
Differential privacy emerged 15 years ago as a solution to this problem. Differential privacy is both a rigorous, mathematical definition of data privacy and the foundation for privacy-enhancing technology that enables the secure sharing of sensitive information.
When a statistic, algorithm, or analytical procedure meets the standard of differential privacy, it means that no individual record significantly impacts the output of that analysis. The mathematics behind differential privacy ensures that the output does not contain information that can be used to draw conclusions about any record in the underlying data set.
With differential privacy, there are mathematically created limitations on the amount of information released by the supporting privacy-enhancing technology, also known as an information theoretic guarantee. The beauty of this approach is that certain variables no longer matter. If you’re using technology that relies on differential privacy, parties won’t be able to maliciously exploit your data. It doesn’t matter how sophisticated adversaries are, what computational resources they have, or what other information they have access to: They won’t be able to reverse-engineer your sensitive data or draw conclusions about any individuals or businesses featured in these data sets.
How to implement enterprise-grade differential privacy
Despite its power to enable secure sharing of sensitive data, differential privacy technologies have not been implemented across enterprises. There are specialized implementations by Apple and Google for targeted use cases, as well as several academic research projects. However, no commercial solutions have emerged—until now. LeapYear is one of the first to offer a commercial-grade privacy-enhancing technology platform that uses differential privacy-based guarantees to enable large-scale, secure sensitive data sharing and monetization..
One reason enterprises have yet to do widespread implementations of technology using differential privacy is that it is not an algorithm or technique: It is a mathematical definition of privacy. The definition pertains to analyses: An analysis is or is not differentially private. Intuitively, if the outputs of an algorithm are insensitive to individual records, then the algorithm is differentially private. If the outputs are sensitive to individual records, then the algorithm is not differentially private. The standard set of analyses in a data scientist’s toolkit—counts, means, regressions, and other tools are not inherently differentially private. They have to be re-designed in a way that satisfies the definition, which is a hard mathematical problem.
The process of re-designing algorithms usually involves introducing precisely calibrated variability, or randomization, into the computation itself to hide the contribution of any individual records or data elements. The challenge is to introduce variability into data sets in a way that satisfies the standard of differential privacy without compromising the analytical utility of the result. Over the past 15 years, theoretical computer scientists, statisticians, and mathematicians have written thousands of papers on the topic of differential privacy and how to implement it.
Most of these papers attempt to address how, for various analytical procedures, the definition of differential privacy can be satisfied. Unfortunately, the academic literature is of mixed quality and implementability. Some papers contain errors (for instance, they are not truly differentially private), and others cannot be practically implemented (they are impossible to translate into working code or introduce too much error when run on real- world data sets). In summary, differential privacy is a standard that is complex and challenging to meet.
To enable differential privacy for the enterprise, providers need to build several components into privacy-enhancing technology and overcome technical challenges.
The following are a few examples:
- A differentially private platform must support the complete range of analytical functions used by enterprise analysts and data scientists. These range from aggregates, statistical functions, and data operations to machine learning algorithms. Ensuring that the system contains correct and accurate implementations of these differentially private algorithms requires understanding the literature, identifying the best theoretical results, and mapping them to working, production software. This process contains many challenges in statistical analysis and machine learning.
- For a platform to be differentially private, every single computation it processes needs to meet this standard. In other words, analysts must be unable to run computations that are not differentially private. This makes designing truly privacy-protecting systems difficult to do. Mathematicians and technologists must work together to ensure that there is no way to subvert the differentially private computations to exfiltrate the data.
- The platform needs to rigorously track composition. It is not sufficient for individual computations to be differentially private: The entire set of programs that are ever executed must also be so. Tracking composition correctly and allowing for continued value to be derived from sensitive data is a challenge that has bedeviled data scientists and led to innovations in information theory.
- To be practically deployed in the real world, a differentially private platform needs to scale, in many cases, to petabyte size data sets. This requires solving technical challenges in scalable and distributed computing.
These are just a few examples of the challenges that must be addressed to implement a practical differential privacy platform in a way that is both secure and usable. So, it’s not surprising that developing a differential privacy platform for the enterprise requires technical talent across a diverse set of technical fields. Our focus at LeapYear has been to assemble outstanding engineers with different specialties and align their efforts towards creating and innovating our privacy-enhancing technology platform.
The scope and breadth of differential privacy poses an additional challenge, which is mapping an abstract mathematical standard to solutions that address real-world business problems. Assembling the technical team and developing the core technology are only steps in the journey. Implementing differential privacy requires extensive experience in developing practical solutions for vertical-specific enterprise data challenges.
The LeapYear team provides expertise in enterprise security, regulatory frameworks, data sets, analytical workflows, and business objectives. We use these insights to map our privacy-protecting technology to achieve the goals of specific use cases across industries. Without the right data and business context, differential privacy can be incorrectly applied, thus compromising analytical utility and potentially voiding any and all privacy guarantees.
LeapYear has a dedicated group of solutions architects who have vertical expertise: a foundational understanding of differential privacy; and experience in implementing these mathematical guarantees to solve business problems across healthcare, technology, financial services, and government.
What enterprises can achieve with differential privacy
Prior to differential privacy, access to information was effectively synonymous with access to data: To obtain insights from data, one had to have access to data. However, the reality is that these are distinct ideas. The value of differential privacy is that it provides a rigorous framework for drawing a hard line between them. For the first time, differential privacy enables individuals who previously could not obtain access to data sets to still derive valuable insights from the information contained therein.
With LeapYear’s differential privacy platform, enterprise teams can now leverage highly sensitive data sets in ways that they could not even imagine before.
Here are a few examples of how institutions are leveraging LeapYear’s platform to use sensitive data sets in new, practical ways:
- Healthcare. Multiple top-five U.S. health insurance companies can now make personal health information (PHI) data on 100 M+ individuals available to third parties to understand the effectiveness of therapies, without exposing any private information about individual patients.
- Retail banking. Global banks share insights from customer data across borders (countries with strict data residency requirements) with partners (such as co-brand card partners) and across lines of business (such as with investment research), without exposing individual customer data across these boundaries.
- Capital markets. Multiple top-10 brokers are analyzing data across their institutional clients’ holdings and trades to develop information products for clients. These products provide market color that enables clients to understand financial markets more deeply while ensuring that one client can never see or infer proprietary information about another.
- Technology. Several global technology firms are making user data available for research and business partnerships while ensuring that no information about individual user activity on their platforms can be viewed, exfiltrated, or reconstructed.
Differential privacy sets the standard in privacy-enhancing technologies
The progress of many major technological innovations can be traced back to a single idea. This idea tends to serve as a foundational pillar on which the future is built. For instance, the explosion of modern computing can be traced back to the transistor, the internet to networking protocols, and information security to public key cryptography. Differential privacy has the right properties to be a pillar for ensuring the privacy of sensitive information: It is intuitive, generalizable, and broadly applicable. The academic community has already settled on differential privacy as the de-facto standard for privacy research.
As with many technologies, there is a gap to be bridged between theory and practice for differential privacy. However, in the case of differential privacy, this gap is particularly significant: Across 15 years of theoretical research, academic teams did not produce a viable commercial solution.
LeapYear has harnessed the expertise of our team of researchers, engineers, and business strategists to bridge this gap and create a practical platform for sharing sensitive data based on differential privacy. In partnership with some of the largest stewards of sensitive data, such as credit card and healthcare companies, we have demonstrated that this gap is very much surmountable.
Our vision is for every industry to have a secure foundation on which to build innovative sensitive data-driven applications and unlock value in ways previously unimaginable, all while protecting the privacy of individuals and institutions to a degree never thought possible before.
How will your organization use differential privacy to share and monetize data?