We can all agree that privacy is a good thing; but what does it really mean, especially in the context of machine learning and data sharing? Human beings are adept at dealing with the ambiguities and nuances of language, but if we are going to try and commit our intuitions to code, we have to be much more precise. This can be surprisingly tricky, and it is illustrative to consider a brief history of technical approaches to data privacy.
Traditional approaches to data privacy have relied on the belief that we could “anonymize” or “de-identify” data. At the simplest level, think about this as “removing names” and other uniquely identifying features such as social security numbers. However, if I release a data set of medical records to promote research into new ML-based disease screening techniques, have I protected privacy if I have scrubbed off the names and other unique identifiers from the medical records?
A case like this happened in the mid-1990s, when the Group Insurance Commission of Massachusetts decided to publicly release information summarizing the hospital visits for all state employees. Of course, being sensitive to privacy, they “de-identified” these hospital records; however, they did retain in each record the useful demographic information about patients, such as their age, gender, and zip code. Although none of these features on their own were uniquely identifying, if you think about it for a few minutes, it will not come as a big surprise that in combination, a small number of idiosyncratic features can be sufficient to uniquely identify an individual even within a large population. This was the case: Latanya Sweeney (who was a Ph.D. student at MIT at the time and is now a professor at Harvard) was able to cross-reference the nominally “de-identified” hospital records with the voter registration roles in Cambridge Massachusetts (which also included the age, gender, and zip code of residents, this time attached to their names) and was able to re-attach the names to many of the hospital records, including that of Bill Weld, the governor of Massachusetts at the time.
What followed was many years of work that attempted to “patch up” the vulnerabilities exposed by the most recent attack, without bothering to really try and define what we meant by privacy. This was essentially a cat-and-mouse game between privacy researchers and attackers, and it was ultimately a losing game for the privacy researchers. Let’s take a look at just one attempt: k-anonymity. We could look at Sweeney’s original re-identification attack and indicate what went wrong: even though there were no individual attributes that uniquely identified patients in the hospital records, there were combinations of attributes (zip code, age, and gender) that collectively did. The idea behind k-anonymity was to ensure that this does not happen. Via a combination of data redaction and data coarsening (e.g., reporting ages in 10-year ranges rather than the exact age), k-anonymity ensures that no combination of attributes matches fewer than k records. Consider the following example taken from Chapter 1 of my recent book, The Ethical Algorithm. Suppose I have a 56-year-old female neighbor and I know that she has been treated at the Hospital of the University of Pennsylvania. That hospital has released the following 2-anonymous table of medical records:
Because the table is 2-anonymous, I cannot identify the record corresponding to my neighbor—what I know about her matches two records. Thus far, k-anonymity has accomplished what it set out to on its own terms; however, you might already suspect that this has not captured everything we might want from a definition of privacy because although I cannot identify her record, I have already learned something about her—she either has HIV or colitis, which she may well not want me to know. But the problem goes deeper. Suppose I know that my neighbor has also been a patient at another local hospital, Pennsylvania Hospital, and suppose that they too have released nominally anonymized medical records, say the following 3-anonymous table:
|*||60-70||Male||191**||Flu Like Symptoms|
Although in isolation, my knowledge of my neighbor only matches three records in this table. I can cross-reference it with the other table, and the guarantee of privacy entirely evaporates: I can now re-identify her record once again and learn her diagnosis.
The problem here (and with many other attempts to “patch up” failed attempts at anonymization) is the implicit assumption that the released data exists in isolation. In reality, a wealth of other information is available for cross-referencing, and it is difficult to anticipate what kind of extra information might now or in the future be available to a possible attacker. Narayanan and Shmatikov discuss this well in Myths and Fallacies of “Personally Identifiable Information.”
The real breakthrough came in 2006, when Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith proposed differential privacy (for which they subsequently won the prestigious Godel Prize). Unlike previous approaches, differential privacy is a definition and not a technique; it specifies what we mean by privacy. Informally, the premise is this: suppose some scientific study was performed or some machine learning algorithm was trained entirely independently of you—the study had no access to your data and you had no effect on the outcome in any way. In this case, you should not consider your privacy to have been violated because nobody even had access to your data. Let’s call this the “ideal world.” What differential privacy promises is that nobody, no matter what they might already know, should be able to distinguish between the real world, in which your data was used, and the ideal world in which it was not, substantially better than random guessing. If we buy the premise that your privacy was not violated in the “ideal world” in which your data was not used and that there is no way to reliably distinguish the real world from the ideal world, then we should also view your privacy as having been only minimally violated in the real world. What do these words, substantially and minimally, refer to? Differential privacy comes with a parameter that allows you to quantitatively dial up or down the privacy guarantee; informally, the parameter controls the strength of the word substantially. It is a strength that differential privacy is parametric, because an early lesson that anyone studying data privacy learns is that privacy does not come for free. Asking for more privacy will have the cost of either diminished data utility or the need to collect more data. Deciding how to trade-off privacy and utility is inherently a policy decision, and the parametric nature of differential privacy allows this decision to be made with our eyes open.
A common initial reaction to hearing about differential privacy is that it indeed corresponds to a strong promise of privacy, but one that might be too strong. It would not be interesting if differential privacy precluded deriving insight from data. Fortunately, 15 years of research has shown that this is not the case: differential privacy is in fact compatible with a wide range of statistical analyses and machine learning techniques. Moreover, it has started being widely adopted by big tech companies including Apple and Google and by government entities such as the US Census Bureau. As more tools and services become available, making it easier for organizations to adopt differentially private data analysis techniques, I expect that we will see differential privacy becoming even more widespread.
Differential privacy is a fascinating topic, and many resources are now available to learn more at different levels of sophistication. For someone wanting to dive right into the mathematics of differential privacy, the textbook that I wrote with Cynthia Dwork remains a good place to start. For those wanting a thorough but non-mathematical introduction, I recommend this law review article written by a number of leading experts at Harvard. For a general audience introduction to differential privacy (as well as ways to think carefully about embedding other “ethical norms” into algorithms), I recommend my recent book with Michael Kearns, The Ethical Algorithm.