
Blog Post |
Noise in Data vs Noise in Computation
This blog will discuss an often-asked question when people start learning about differential privacy[1] and its power to protect individual privacy: “Are you just adding noise to the data?” It is an important question that needs to be properly addressed because of the various approaches available to apply differential privacy, each having its purposes, strengths, and weaknesses. This post begins with some background on why “randomness” solves the problem of privacy. Then, we review how companies such as Apple, Chrome (Google), and LeapYear use differential privacy in various ways to solve a wide variety of privacy problems and why “just adding noise to the data” oversimplifies how privacy is achieved.
Why is randomness important for privacy?
The topic of privacy in general concerns the control of information. We can consider examples of use cases and services that apply when we want our data to be private:
- When we shop online, we rely on HTTPS encryption to restrict the access of eavesdroppers to information on the wire.
- We rely on services like PayPal and Apple Pay to centralize and control payment information, reducing the number of possible points of data breaches.
- We also rely on regulatory frameworks such as HIPAA to define how our personal information should be handled and shared.
Many factors contribute to the control and loss of privacy. Much of the focus is on the malicious actor—someone purposely trying to steal information and expose private information. Unfortunately, loss of information control is not entirely the work of hackers and other bad actors. There are examples in the recent past where either through the lack of understanding or weaknesses in defined standards, private information was available for attack and re-identification. Aside from the situations wherein the analysts themselves become the problem and exfiltrate the data (think Edward Snowden), AOL famously published over half a million web search queries of its users with associated (hashed) usernames[2]. Even though actual usernames were not released in the data set, the information in the queries was more than sufficient to re-identify many individuals. By combining multiple pieces of information, attackers can single-out a user. For example, a user who searched for “Moving service near <zip code>” and “<surname> genealogy” in the AOL release might be identifiable to a specific individual for a determined attacker.
Perhaps counter-intuitively, unique identification does not require a uniquely identifying attribute in the data set, such as name, username, or social security number, because quasi-identifiers can be created by combining several attributes. For example, Sweeney showed that 87% of the U.S. population can be re-identified using only three pieces of information: age, sex, and postal code[3]. This brings us to the topic of randomness.
As a general approach, calibrated randomness is critical for protecting privacy because it adds plausible deniability for individuals in a data set. A simple example of a privacy scheme for collecting sensitive information from a population is randomized response (RR). In RR, each individual is asked to respond to a question: in this example, the question is “Are you a smoker?” The respondent is instructed to answer in the following manner.
Flip a coin
-
- if it lands on heads, respond truthfully
- otherwise, say “yes.”
For a large number of truthful respondents, the expected number of individuals who responded “yes” includes
- 50% non-smokers who were forced to respond “yes” because they flipped heads,
- 50% of the smokers who were forced to respond “yes” because they flipped tails, and
- 50% of the smokers who responded truthfully because they flipped heads.
Plausible deniability occurs because the answer “yes” does not mean the respondent smokes because there is a 50% chance that they must have said “yes” owing to a tails outcome. The responses of the sample population are still useful because the statistics of the fraction of the population who smokes can be inferred.
In the RR example above, we assumed that the coin toss is fair and has a 50% chance of landing on heads. Calibrating this randomness affects the trade-off between accuracy and privacy of individuals. Therefore, we can change the trade-off if, for example:
- The hypothetical coin was biased to give 99% heads. Then, nearly all responses would be truthful by this scheme. The data is much more accurate but not private.
- The hypothetical coin was biased to give 1% heads. Then, nearly all responses would be “yes.” The information would be private but at the cost of losing all meaningful information about the population.
The RR example explains what it means to add randomness in a purposeful way and some of the trade-offs we may consider.
Let’s discuss the differences in protecting privacy in two cases. First is where the data is being actively collected, and second is when the data has already been stored. Below we explore the important considerations and differences in how privacy should be preserved.
Randomness and Data in Motion—adding noise in the data collection
Let’s consider the use case of Apple and Chrome. For these consumer products, the respective companies want to collect information about the usage patterns of their users and the devices. This information inherently contains private information, and not all consumers would be comfortable with this information being collected and stored in some remote data center. To reduce consumer fears about privacy, both Apple[4] and Chrome[5] protect the privacy of data “in motion” using differential privacy techniques by adding randomness to their data collection pipeline. Similar to the coin flip example above, the randomness applied is calibrated to give an acceptable trade-off between accuracy and privacy and all the participating users follow the same defined pipeline. Once the data are collected, the information is private with respect to the individuals and can still be used for gaining statistical understanding of patterns in the data set. Let’s address some important considerations in this approach:
- Adding noise to the data pipeline means that you are changing something about the data: variances can increase, means may shift, and correlations may change.
- Given #1, this means the downstream analyst must know precisely how noise is added so that they can subtract the effects of noise from their statistics.
- The ability to change the data collected or the randomness applied is constrained because any change may create disjoints with the previously collected data.
Accounting for the randomness added to the data in the pipeline is not a major drawback for Apple and Chrome because the set of queries is defined before data collection occurs: the randomness can be calibrated precisely to the set of queries they will perform later. However, this approach precludes the ability to alter the questions after the data has been collected, meaning that free form analytics is not realistically possible if you want to preserve privacy.
Adding noise to the data in motion is necessary if the data collection pipeline requires privacy. However, many organizations have already collected data that is inaccessible to the analysts who want to gain insights from the data owing to regulations or policies preventing access. Let’s consider this case next.
Randomness and Data at Rest
When data is centralized and already collected and an analyst wants to ask arbitrary statistical queries, we may consider adding noise to the data set. However, this is not a good solution when you consider several factors.
- Similar to the data in motion case, adding noise to the data means that you are changing something about the data; therefore, the analyst needs to account for these alterations of the statistics in their work.
- In addition, depending on the query, different amounts of randomness should be added to the data to achieve the desired accuracy trade-off; this can be a very complicated problem to tackle for active and dynamic data investigations and analyses.
- The analyst must be perfectly trustworthy—if they have access to the data and the amount of randomness applied, there is a risk that they can reverse engineer anything about the data set.
- The randomness itself can be attacked—an intelligent attacker can continuously query the data set, figure out the randomness scheme applied, and back out the original data.
It could be argued that adding randomness to the data is never a useful approach for any true ability to simultaneously protect privacy and generate value from the data. Fortunately, a better and simpler way to achieve privacy exists that involves adding randomness to the process of computing a statistic using differential privacy. “Output perturbation” or directly adding randomness to the statistic reduces the burden on analysts and allows them to treat the answer to a query as an estimate of the true statistic. An output perturbation version of privately estimating the number of smokers in a database would simply add differentially private randomness to the count statistic.
Differential privacy has the same behavior as RR with respect to the magnitude of randomness; that is, larger amount of randomness will make the result less accurate. However, it has the added benefit that noise can be specifically calibrated to each statistic in the query. The randomness required for computing a count does not depend on any properties of the data and can be calculated completely separately from a mean query. Adding noise to the computation allows the addition of the minimum amount of noise required for that particular computation to protect privacy. This means that we can protect a complete machine learning workflow. We do not need to restrict the analysts’ ability to query, investigate, and model data sets owing to some pre-set conditions to achieve privacy—the privacy of the underlying data can be protected on the fly.
Finally, an approach that leverages adding data to the computation allows the seamless inclusion of new data sets. The data does not need pre-processing, there are no restrictions on the types of queries available, and the privacy of individual records is mathematically proven. This is the approach that LeapYear takes on our platform and ensures maximum data privacy and data utility (including the ability of the analyst to do their job in a natural way).
Summary
Many approaches can be used for protecting the privacy of individuals in a data set. Apple and Chrome have specifically used differential privacy techniques to protect the data collected from their devices and consumers. Although it may seem useful, directly adding noise to data is not effective when the data is already collected and centralized. This technique restricts the ability to perform useful analytics and provides no provable protection from malicious actors. In contrast, differential privacy can be applied to the computation itself, adding noise to the statistic and supporting both a useful and private method to query and analyze data.