This blog will discuss an often-asked question when people start learning about differential privacy and its power to protect individual privacy: “Are you just adding noise to the data?” It is an important question that needs to be properly addressed because of the various approaches available to apply differential privacy, each having its purposes, strengths, and weaknesses. This post begins with some background on why “randomness” solves the problem of privacy. Then, we review how companies such as Apple, Chrome (Google), and LeapYear use differential privacy in various ways to solve a wide variety of privacy problems and why “just adding noise to the data” oversimplifies how privacy is achieved.
The topic of privacy in general concerns the control of information. We can consider examples of use cases and services that apply when we want our data to be private:
Many factors contribute to the control and loss of privacy. Much of the focus is on the malicious actor—someone purposely trying to steal information and expose private information. Unfortunately, loss of information control is not entirely the work of hackers and other bad actors. There are examples in the recent past where either through the lack of understanding or weaknesses in defined standards, private information was available for attack and re-identification. Aside from the situations wherein the analysts themselves become the problem and exfiltrate the data (think Edward Snowden), AOL famously published over half a million web search queries of its users with associated (hashed) usernames. Even though actual usernames were not released in the data set, the information in the queries was more than sufficient to re-identify many individuals. By combining multiple pieces of information, attackers can single-out a user. For example, a user who searched for “Moving service near <zip code>” and “<surname> genealogy” in the AOL release might be identifiable to a specific individual for a determined attacker.
Perhaps counter-intuitively, unique identification does not require a uniquely identifying attribute in the data set, such as name, username, or social security number, because quasi-identifiers can be created by combining several attributes. For example, Sweeney showed that 87% of the U.S. population can be re-identified using only three pieces of information: age, sex, and postal code. This brings us to the topic of randomness.
As a general approach, calibrated randomness is critical for protecting privacy because it adds plausible deniability for individuals in a data set. A simple example of a privacy scheme for collecting sensitive information from a population is randomized response (RR). In RR, each individual is asked to respond to a question: in this example, the question is “Are you a smoker?” The respondent is instructed to answer in the following manner.
Flip a coin
For a large number of truthful respondents, the expected number of individuals who responded “yes” includes
Plausible deniability occurs because the answer “yes” does not mean the respondent smokes because there is a 50% chance that they must have said “yes” owing to a tails outcome. The responses of the sample population are still useful because the statistics of the fraction of the population who smokes can be inferred.
In the RR example above, we assumed that the coin toss is fair and has a 50% chance of landing on heads. Calibrating this randomness affects the trade-off between accuracy and privacy of individuals. Therefore, we can change the trade-off if, for example:
The RR example explains what it means to add randomness in a purposeful way and some of the trade-offs we may consider.
Let’s discuss the differences in protecting privacy in two cases. First is where the data is being actively collected, and second is when the data has already been stored. Below we explore the important considerations and differences in how privacy should be preserved.
Let’s consider the use case of Apple and Chrome. For these consumer products, the respective companies want to collect information about the usage patterns of their users and the devices. This information inherently contains private information, and not all consumers would be comfortable with this information being collected and stored in some remote data center. To reduce consumer fears about privacy, both Apple and Chrome protect the privacy of data “in motion” using differential privacy techniques by adding randomness to their data collection pipeline. Similar to the coin flip example above, the randomness applied is calibrated to give an acceptable trade-off between accuracy and privacy and all the participating users follow the same defined pipeline. Once the data are collected, the information is private with respect to the individuals and can still be used for gaining statistical understanding of patterns in the data set. Let’s address some important considerations in this approach:
Accounting for the randomness added to the data in the pipeline is not a major drawback for Apple and Chrome because the set of queries is defined before data collection occurs: the randomness can be calibrated precisely to the set of queries they will perform later. However, this approach precludes the ability to alter the questions after the data has been collected, meaning that free form analytics is not realistically possible if you want to preserve privacy.
Adding noise to the data in motion is necessary if the data collection pipeline requires privacy. However, many organizations have already collected data that is inaccessible to the analysts who want to gain insights from the data owing to regulations or policies preventing access. Let’s consider this case next.
When data is centralized and already collected and an analyst wants to ask arbitrary statistical queries, we may consider adding noise to the data set. However, this is not a good solution when you consider several factors.
It could be argued that adding randomness to the data is never a useful approach for any true ability to simultaneously protect privacy and generate value from the data. Fortunately, a better and simpler way to achieve privacy exists that involves adding randomness to the process of computing a statistic using differential privacy. “Output perturbation” or directly adding randomness to the statistic reduces the burden on analysts and allows them to treat the answer to a query as an estimate of the true statistic. An output perturbation version of privately estimating the number of smokers in a database would simply add differentially private randomness to the count statistic.
Differential privacy has the same behavior as RR with respect to the magnitude of randomness; that is, larger amount of randomness will make the result less accurate. However, it has the added benefit that noise can be specifically calibrated to each statistic in the query. The randomness required for computing a count does not depend on any properties of the data and can be calculated completely separately from a mean query. Adding noise to the computation allows the addition of the minimum amount of noise required for that particular computation to protect privacy. This means that we can protect a complete machine learning workflow. We do not need to restrict the analysts’ ability to query, investigate, and model data sets owing to some pre-set conditions to achieve privacy—the privacy of the underlying data can be protected on the fly.
Finally, an approach that leverages adding data to the computation allows the seamless inclusion of new data sets. The data does not need pre-processing, there are no restrictions on the types of queries available, and the privacy of individual records is mathematically proven. This is the approach that LeapYear takes on our platform and ensures maximum data privacy and data utility (including the ability of the analyst to do their job in a natural way).
Many approaches can be used for protecting the privacy of individuals in a data set. Apple and Chrome have specifically used differential privacy techniques to protect the data collected from their devices and consumers. Although it may seem useful, directly adding noise to data is not effective when the data is already collected and centralized. This technique restricts the ability to perform useful analytics and provides no provable protection from malicious actors. In contrast, differential privacy can be applied to the computation itself, adding noise to the statistic and supporting both a useful and private method to query and analyze data.
1 Differential privacy (DP) is a branch of applied mathematics that is concerned with the protection of an individual’s privacy while still allowing analysts to learn about the statistics and trends within a population. DP solves the issues around privacy using carefully calibrated randomness to hide details about individuals while maintaining the trends and statistics of groups. LeapYear uses DP to immediately allow analysts to access the data that were previously unavailable to them and gain new insights while protecting the privacy of individuals. Apple and Chrome developed solutions based on DP for collecting information about how people are using their products without sending precise information about a user’s behavior back to their respective centralized data warehouses.