
Blog Post |
Ensuring Privacy and Utility in Today’s Healthcare Data Ecosystem
In today’s digital economy, healthcare providers have learned how to not only be doctors, nurses and practitioners, but also become leading experts in data analytics, privacy and usage. The growing value of machine learning and artificial intelligence to the enterprise has only hastened this trend toward codependence between healthcare and sensitive data.
However, when you overlay the current regulatory environment, and public stance on privacy, a conundrum arises. How do we ensure both data privacy and data utility, as both are required in today’s modern healthcare data ecosystem?
HIPAA and similar laws and regulations provide guidelines for protecting data and maintaining data privacy. Unfortunately, some of the techniques employed to accomplish those requirements do not always adequately protect the individuals in those data sets. Many examples from healthcare and other major sectors exist today where theoretically anonymized sets have been combined with external data to re-identify individuals in those data sets. Often the methodology by which data has been de-identified lacks mathematical rigor, resulting in significant data privacy concerns. As noted by Dr. Yves-Alexandrede Montjoye, a computer scientist at Imperial College London, and author of a recent paper on re-identification, “[we] need to move beyond de-identification… Anonymity is not a property of a data set, but is a property of how you use it.”
While current practices exist to satisfy regulatory requirements to de-identify data, they fall short of addressing the conundrum of how to protect individuals’ privacy while enable data usage and utility at scale. Worse, many of these techniques rely on heuristic approaches, such as simply removing or aggregating columns, that in the end de-values the data set and compromises the integrity of downstream applications.
Achieving Data Privacy and Utility
When we step back and think about the conundrum, we can identify a few key elements that get us to a statistically meaningful data set for application now and use downstream, all while meeting the dual goal of ensuring data privacy and solving for data utility. Those elements and their tenets are:
- Data Value—Any approach considered should maximize data value and allow for all data for use to build insights. We must avoid, if possible, wholesale removal of fields or partial reduction of value by aggregating up (e.g., allowing three-digit ZIP codes but not five-digit entries).
- Data Privacy—We should insist on privacy that can be quantified. Getting to that end should not be through a random mixture of rules, heuristics, and gut feelings. Quantifiable approaches, grounded in math and cryptography, can show us the risks associated with sharing data in a tangible, measurable manner.
- Data Control—We believe that, in the long term, data should stay with the data owners. A combination of modern digital, security and internet access techniques should combine and enable the analyst of the data set to do their job without taking physical control of that data set.
Differential Privacy: A Way Forward
There are always new technologies on the horizon, but in particular a mathematical standard called differential privacy is well suited to address the first two must haves of data value and data privacy, and when combined with modern software architectures, helps to ensure data control as well
Differential privacy is a mathematical standard of privacy that provides a quantifiable, tangible means to help safeguard data privacy from many forms of compromise. A leading voice in differential privacy, Aaron Roth, associate professor of computer science at The University of Pennsylvania, explained, “What differential privacy promises is that nobody, no matter what they might already know, should be able to distinguish between the real world, in which your data was used, and the ideal world in which it wasn’t, substantially better than random guessing.”
Differential privacy has roots in deep academic study spanning back 15 years, with a large body of experts researching, peer reviewing and publishing the details of statistical and modeling functions. Companies such as Apple and Google (in Chrome) have used the technique to protect their users’ data, and the technique is gaining more traction each day.
However, the use of this privacy technique alone is not enough. To be useful, it must be incorporated into a practical delivery mechanism—such as software—that can be used by a non-expert, supported by a traditional IT team, and broadly applied to both data set types and statistical or analytical workflows. It’s not a trivial matter, but one that can be—and is being— solved today.
Finally, when the goals of data privacy and utility are solved, valuable use cases become possible.
- Cross Organization Data Sharing—In the healthcare ecosystem, we see streamlined and functional relationships between payers, providers and pharmaceuticals focused on combining data for faster, better outcomes.
- Internal Data Sharing—In many companies, data sharing across various lines of business or services can be a difficult task. Legal and confidentiality restrictions may silo data, creating a challenge for effective use. Appropriate technology can address the task of sharing data securely and privately across teams to achieve maximum utility of the data.
It’s time to think differently about what it means to maintain the dual goal of data privacy and data utility. Siloed data is not the answer, nor is failing to honor data privacy rights and obligations. Don’t accept status quo—look for new solutions that will not only simplify compliance but amplify the ability to learn and help build the important insights that drive the future of healthcare.