Differential Privacy Knowledge Base
This knowledge base is intended to be a non-exhaustive list of academic papers that are useful in understanding the current implementation of LeapYear’s privacy preserving Machine Learning and Analytics platform.
This knowledge base is intended to be a non-exhaustive list of academic papers that are useful in understanding the current implementation of LeapYear’s privacy preserving Machine Learning and Analytics platform.
For half a century, the field of cryptography has facilitated the secure sharing of electronic information—credit card transactions, file transfers, and communications—protected by cryptographic protocols. This has been possible through rigorous, mathematically proven guarantees. The assurances provided by the protocols to both individuals and businesses are backed by unassailable mathematical proof and the protocols have therefore stood the test of time.
The field of data privacy, however, has historically been devoid of such foundations. The only known approaches for exposing data for analysis while still attempting to maintain confidentiality have been heuristic based: in other words, guesswork. These approaches involved redacting certain fields that were considered especially sensitive, such as PII, or rules, such as “reveal the age of a person but not their birthdate” or “only show a result if there are more than 20 people in the sample.”
It is a well-known fact that these techniques do not work; countless studies and real-world breaches have demonstrated that these approaches can be reverse engineered. Every time an approach is breached, privacy practitioners propose a new, slightly more sophisticated technique, such as adding noise to release “synthetic data,” performing aggregations such as “k-anonymity,” or instituting a broad class of statistical techniques called “data masking.” Ultimately, each technique has been broken, resulting in a compromise of highly sensitive information and harming both individuals and institutions. In the absence of an alternative, these approaches continue to be used in the enterprise today, perpetuating the related inefficiencies, loss of data value, and vulnerabilities.
Differential privacy emerged 15 years ago as a solution to these issues. Differential privacy is a rigorous, mathematical definition of data privacy. When a statistic, algorithm, or analytical procedure meets the standard of differential privacy, it means that no individual record significantly impacts the output of that analysis—the mathematics behind differential privacy ensures that the output does not contain information that can be used to draw conclusions about any record in the underlying data set. With differential privacy, there is a mathematical bound on the amount of information released by the system, also known as an information theoretic guarantee. The beauty of the approach is that the sophistication level of an adversary, the amount of computational resources they use, or how much outside information they have access to does not matter; when differential privacy is satisfied, the information required to draw conclusions about individuals’ private information is just not sufficient.
Histograms are commonly used to visualize the distribution of data. A differentially private histogram is composed of differentially private counts in bins of equal width. LeapYear’s histogram algorithm is further optimized to ensure higher accuracy for small counts by incorporating wavelet transforms.
Given a set D of tuples defined on a domain Omega, we study differentially private algorithms for constructing a histogram over Omega to approximate the tuple distribution in D. ...Given a set D of tuples defined on a domain Omega, we study differentially private algorithms for constructing a histogram over Omega to approximate the tuple distribution in D. Existing solutions for the problem mostly adopt a hierarchical decomposition approach, which recursively splits Omega into sub-domains and computes a noisy tuple count for each sub-domain, until all noisy counts are below a certain threshold. This approach, however, requires that we (i) impose a limit h on the recursion depth in the splitting of Omega and (ii) set the noise in each count to be proportional to h. The choice of h is a serious dilemma: a small h makes the resulting histogram too coarse-grained, while a large h leads to excessive noise in the tuple counts used in deciding whether sub-domains should be split. Furthermore, h cannot be directly tuned based on D; otherwise, the choice of h itself reveals private information and violates differential privacy. To remedy the deficiency of existing solutions, we present PrivTree, a histogram construction algorithm that adopts hierarchical decomposition but completely eliminates the dependency on a pre-defined h. The core of PrivTree is a novel mechanism that (i) exploits a new analysis on the Laplace distribution and (ii) enables us to use only a constant amount of noise in deciding whether a sub-domain should be split, without worrying about the recursion depth of splitting. We demonstrate the application of PrivTree in modelling spatial data, and show that it can be extended to handle sequence data (where the decision in sub-domain splitting is not based on tuple counts but a more sophisticated measure). Our experiments on a variety of real datasets show that PrivTree considerably outperforms the states of the art in terms of data utility.
In recent years, many approaches to differentially privately publish histograms have been proposed. Several approaches rely on constructing tree structures in order to decrease the error when answer large range queries. ...In this paper, we examine the factors affecting the accuracy of hierarchical approaches by studying the mean squared error (MSE) when answering range queries. We start with one-dimensional histograms, and analyze how the MSE changes with different branching factors, after employing constrained inference, and with different methods to allocate the privacy budget among hierarchy levels. Our analysis and experimental results show that combining the choice of a good branching factor with constrained inference outperform the current state of the art. Finally, we extend our analysis to multi-dimensional histograms. We show that the benefits from employing hierarchical methods beyond a single dimension are significantly diminished, and when there are 3 or more dimensions, it is almost always better to use the Flat method instead of a hierarchy.
Privacy-preserving data publishing has attracted considerable research interest in recent years. Among the existing solutions, ∈-differential privacy provides the strongest privacy guarantee. ...Existing data publishing methods that achieve ∈-differential privacy, however, offer little data utility. In particular, if the output data set is used to answer count queries, the noise in the query answers can be proportional to the number of tuples in the data, which renders the results useless. In this paper, we develop a data publishing technique that ensures ∈-differential privacy while providing accurate answers for range-count queries, i.e., count queries where the predicate on each attribute is a range. The core of our solution is a framework that applies wavelet transforms on the data before adding noise to it. We present instantiations of the proposed framework for both ordinal and nominal data, and we provide a theoretical analysis on their privacy and utility guarantees. In an extensive experimental study on both real and synthetic data, we show the effectiveness and efficiency of our solution.
Machine learning (ML) models, ranging from generalized linear models (GLMs) to random forests, are commonly used for prediction tasks. To control privacy leakage from ML models, which are known to easily memorize their inputs, and to prevent the likes of model inversion attacks, LeapYear has implemented a number of differentially private techniques. Examples include empirical risk minimization and sufficient statistics perturbation for GLMs, and sufficient statistics invariance for Random Forest.
Privacy-preserving machine learning algorithms are crucial for the increasingly common setting in which personal data, such as medical or financial records, are analyzed. ...We provide general techniques to produce privacy-preserving approximations of classifiers learned via (regularized) empirical risk minimization (ERM). These algorithms are private under the ε-differential privacy definition due to Dwork et al. (2006). First we apply the output perturbation ideas of Dwork et al. (2006), to ERM classification. Then we propose a new method, objective perturbation, for privacy-preserving machine learning algorithm design. This method entails perturbing the objective function before optimizing over classifiers. If the loss and regularizer satisfy certain convexity and differentiability criteria, we prove theoretical results showing that our algorithms preserve privacy, and provide generalization bounds for linear and nonlinear kernels. We further present a privacy-preserving technique for tuning the parameters in general machine learning algorithms, thereby providing end-to-end privacy guarantees for the training process. We apply these results to produce privacy-preserving analogues of regularized logistic regression and support vector machines. We obtain encouraging results from evaluating their performance on real demographic and benchmark data sets. Our results show that both theoretically and empirically, objective perturbation is superior to the previous state-of-the-art, output perturbation, in managing the inherent tradeoff between privacy and learning performance.
One of the most effective algorithms for differentially private learning and optimization is objective perturbation. This technique augments a given optimization problem ...(e.g. deriving from an ERM problem) with a random linear term, and then exactly solves it. However, to date, analyses of this approach crucially rely on the convexity and smoothness of the objective function, limiting its generality. We give two algorithms that extend this approach substantially. The first algorithm requires nothing except boundedness of the loss function, and operates over a discrete domain. Its privacy and accuracy guarantees hold even without assuming convexity. This gives an oracle-efficient optimization algorithm over arbitrary discrete domains that is comparable in its generality to the exponential mechanism. The second algorithm operates over a continuous domain and requires only that the loss function be bounded and Lipschitz in its continuous parameter. Its privacy analysis does not require convexity. Its accuracy analysis does require convexity, but does not require second order conditions like smoothness. Even without convexity, this algorithm can be generically used as an oracle-efficient optimization algorithm, with accuracy evaluated empirically. We complement our theoretical results with an empirical evaluation of the non-convex case, in which we use an integer program solver as our optimization oracle. We find that for the problem of learning linear classifiers, directly optimizing for 0/1 loss using our approach can out-perform the more standard approach of privately optimizing a convex-surrogate loss function on the Adult dataset.
We revisit the problem of linear regression under a differential privacy constraint. By consolidating existing pieces in the literature, we clarify the correct dependence of the feature, label and coefficient ...domains in the optimization error and estimation error, hence revealing the delicate price of differential privacy in statistical estimation and statistical learning. Moreover, we propose simple modifications of two existing DP algorithms: (a) posterior sampling, (b) sufficient statistics perturbation, and show that they can be upgraded into **adaptive** algorithms that are able to exploit data-dependent quantities and behave nearly optimally **for every instance**. Extensive experiments are conducted on both simulated data and real data, which conclude that both AdaOPS and AdaSSP outperform the existing techniques on nearly all 36 data sets that we test on.
Privacy-preserving data mining has become an active focus of the research community in the domains where data are sensitive and personal in nature. ...For example, highly sensitive digital repositories of medical or financial records offer enormous values for risk prediction and decision making. However, prediction models derived from such repositories should maintain strict privacy of individuals. We propose a novel random forest algorithm under the framework of differential privacy. Unlike previous works that strictly follow differential privacy and keep the complete data distribution approximately invariant to change in one data instance, we only keep the necessary statistics (e.g. variance of the estimate) invariant. This relaxation results in significantly higher utility. To realize our approach, we propose a novel differentially private decision tree induction algorithm and use them to create an ensemble of decision trees. We also propose feasible adversary models to infer about the attribute and class label of unknown data in presence of the knowledge of all other data. Under these adversary models, we derive bounds on the maximum number of trees that are allowed in the ensemble while maintaining privacy. We focus on binary classification problem and demonstrate our approach on four real-world datasets. Compared to the existing privacy preserving approaches we achieve significantly higher utility.
Fundamental methods in statistics, such as comparisons across groups or reducing dimensionality, can be made differentially private. As an example, LeapYear’s implementation of principal components analysis (PCA) involves the addition of calibrated Gaussian noise to the covariance matrix before it is diagonalized.
Hypothesis tests are a crucial statistical tool for data mining and are the workhorse of scientific research in many fields. Here we study differentially ...private tests of independence between a categorical and a continuous variable. We take as our starting point traditional nonparametric tests, which require no distributional assumption (e.g., normality) about the data distribution. We present private analogues of the Kruskal-Wallis, Mann-Whitney, and Wilcoxon signed-rank tests, as well as the parametric one-sample t-test. These tests use novel test statistics developed specifically for the private setting. We compare our tests to prior work, both on parametric and nonparametric tests. We find that in all cases our new nonparametric tests achieve large improvements in statistical power, even when the assumptions of parametric tests are met.
The simplest and most widely applied method for guaranteeing differential privacy is to add instance-independent noise to a statistic of interest that is scaled to its global sensitivity.
...However, global sensitivity is a worst-case notion that is often too conservative for realized dataset instances. We provide methods for scaling noise in an instance-dependent way and demonstrate that they provide greater accuracy under average-case distributional assumptions.
Specifically, we consider the basic problem of privately estimating the mean of a real distribution from i.i.d.~samples. The standard empirical mean estimator can have arbitrarily-high global sensitivity. We propose the trimmed mean estimator, which interpolates between the mean and the median, as a way of attaining much lower sensitivity on average while losing very little in terms of statistical accuracy.
To privately estimate the trimmed mean, we revisit the smooth sensitivity framework of Nissim, Raskhodnikova, and Smith (STOC 2007), which provides a framework for using instance-dependent sensitivity. We propose three new additive noise distributions which provide concentrated differential privacy when scaled to smooth sensitivity. We provide theoretical and experimental evidence showing that our noise distributions compare favorably to others in the literature, in particular, when applied to the mean estimation problem.
Linear regression is one of the most prevalent techniques in machine learning, however, it is also common to use linear regression for its ...explanatory capabilities rather than label prediction. Ordinary Least Squares (OLS) is often used in statistics to establish a correlation between an attribute (e.g. gender) and a label (e.g. income) in the presence of other (potentially correlated) features. OLS assumes a particular model that randomly generates the data, and derives t-values — representing the likelihood of each real value to be the true correlation. Using t-values, OLS can release a confidence interval, which is an interval on the reals that is likely to contain the true correlation; and when this interval does not intersect the origin, we can reject the null hypothesis as it is likely that the true correlation is non-zero. Our work aims at achieving similar guarantees on data under differentially private estimators. First, we show that for well-spread data, the Gaussian Johnson-Lindenstrauss Transform (JLT) gives a very good approximation of t-values; secondly, when JLT approximates Ridge regression (linear regression with l2-regularization) we derive, under certain conditions, confidence intervals using the projected data; lastly, we derive, under different conditions, confidence intervals for the "Analyze Gauss" algorithm (Dwork et al., 2014).
We consider the problem of privately releasing a low dimensional approximation to a set of data records, represented as a matrix A in which each row corresponds to an individual and each column to an attribute. ...Our goal is to compute a subspace that captures the covariance of A as much as possible, classically known as principal component analysis (PCA). We assume that each row of A has L2 norm bounded by one, and the privacy guarantee is defined with respect to addition or removal of any single row. We show that the well-known, but misnamed, randomized response algorithm, with properly tuned parameters, provides nearly optimal additive quality gap compared to the best possible singular subspace of A. We further show that when A’A has a large eigenvalue gap – a reason often cited for PCA – the quality improves significantly. Optimality (up to logarithmic factors) is proved using techniques inspired by the recent work of Bun, Ullman, and Vadhan on applying Tardos’s fingerprinting codes to the construction of hard instances for private mechanisms for 1-way marginal queries. Along the way we define a list culling game which may be of independent interest. By combining the randomized response mechanism with the well-known following the perturbed leader algorithm of Kalai and Vempala we obtain a private online algorithm with nearly optimal regret. The regret of our algorithm even outperforms all the previously known online non-private algorithms of this type. We achieve this better bound by, satisfyingly, borrowing insights and tools from differential privacy!
Differential privacy is a rigorous standard of privacy that can protect against any arbitrarily sophisticated privacy attack. There are several relaxations on strict differential privacy that still offer strong, quantifiable protection, but that are more suited to real-world use cases. These relaxations allow queries to have a higher level of accuracy compared to pure DP.
"Concentrated differential privacy" was recently introduced by Dwork and Rothblum as a relaxation of differential privacy, which permits sharper analyses of many privacy-preserving computations. ...We present an alternative formulation of the concept of concentrated differential privacy in terms of the Rényi divergence between the distributions obtained by running an algorithm on neighboring inputs. With this reformulation in hand, we prove sharper quantitative results, establish lower bounds, and raise a few new questions. We also unify this approach with approximate differential privacy by giving an appropriate definition of "approximate concentrated differential privacy".
We propose a natural relaxation of differential privacy based on the Renyi divergence. Closely related notions have appeared in several recent papers that analyzed composition of differentially private mechanisms. ...We argue that the useful analytical tool can be used as a privacy definition, compactly and accurately representing guarantees on the tails of the privacy loss.We demonstrate that the new definition shares many important properties with the standard definition of differential privacy, while additionally allowing tighter analysis of composite heterogeneous mechanisms.
Differential privacy has seen remarkable success as a rigorous and practical formalization of data privacy in the past decade. This privacy definition and its divergence based
...relaxations, however, have several acknowledged weaknesses, either in handling composition of private algorithms or in analyzing important primitives like privacy amplification by subsampling. Inspired by the hypothesis testing formulation of privacy, this paper proposes a new relaxation, which we term `⨍-differential privacy' (⨍-DP). This notion of privacy has a number of appealing properties and, in particular, avoids difficulties associated with divergence based relaxations. First, ⨍-DP preserves the hypothesis testing interpretation. In addition, ⨍-DP allows for lossless reasoning about composition in an algebraic fashion. Moreover, we provide a powerful technique to import existing results proven for original DP to ⨍-DP and, as an application, obtain a simple subsampling theorem for ⨍-DP.
In addition to the above findings, we introduce a canonical single-parameter family of privacy notions within the ⨍-DP class that is referred to as `Gaussian differential privacy' (GDP), defined based on testing two shifted Gaussians. GDP is focal among the ⨍-DP class because of a central limit theorem we prove. More precisely, the privacy guarantees of \emph{any} hypothesis testing based definition of privacy (including original DP) converges to GDP in the limit under composition. The CLT also yields a computationally inexpensive tool for analyzing the exact composition of private algorithms.
Taken together, this collection of attractive properties render ⨍-DP a mathematically coherent, analytically tractable, and versatile framework for private data analysis. Finally, we demonstrate the use of the tools we develop by giving an improved privacy analysis of noisy stochastic gradient descent.
This paper concerns the challenge of protecting confidentiality while making statistically useful data and analytical outputs available for research and policy analysis. ...In this context, the confidentiality protection measure known as differential privacy is an attractive methodology because of its clear definition and the strong guarantees that it promises. However, concerns about differential privacy include the possibility that in some situations the guarantees may be so strong that statistical usefulness is unacceptably low. In this paper, we propose an example of a relaxation of differential privacy that allows confidentiality protection to be balanced against statistical usefulness. We give a practical illustration of the relaxation implemented as Laplace noise addition for confidentiality protection of contingency and magnitude tables. Tables are amongst the most common types of output produced by national statistical agencies, and these outputs are often protected by noise addition.
Shortly after it was first introduced in 2006, differential privacy became the flagship data privacy definition. Since then, numerous variants and extensions were proposed to adapt it to different scenarios
...and attacker models. In this work, we propose a systematic taxonomy of these variants and extensions. We list all data privacy definitions based on differential privacy, and partition them into seven categories, depending on which aspect of the original definition is modified.
These categories act like dimensions: variants from the same category cannot be combined, but variants from different categories can be combined to form new definitions. We also establish a partial ordering of relative strength between these notions by summarizing existing results. Furthermore, we list which of these definitions satisfy some desirable properties, like composition, post-processing, and convexity by either providing a novel proof or collecting existing ones.
There always exists a tradeoff between privacy of data and accuracy (utility) of results, and with differential privacy, this tradeoff can be quantified. LeapYear has implemented and improved upon several techniques in the literature for strictly improving on the utility of results while maintaining the same level of privacy guarantees.
We show that it is possible to significantly improve the accuracy of a general class of histogram queries while satisfying differential privacy. Our approach carefully chooses a set of queries to ...evaluate, and then exploits consistency constraints that should hold over the noisy output. In a post-processing phase, we compute the consistent input most likely to have produced the noisy output. The final output is differentially-private and consistent, but in addition, it is often much more accurate. We show, both theoretically and experimentally, that these techniques can be used for estimating the degree sequence of a graph very precisely, and for computing a histogram that can support arbitrary range queries accurately.
Differential privacy has become the dominant standard in the research community for strong privacy protection. There has been a flood of research into query answering algorithms that meet this standard. ...Algorithms are becoming increasingly complex, and in particular, the performance of many emerging algorithms is data dependent, meaning the distribution of the noise added to query answers may change depending on the input data. Theoretical analysis typically only considers the worst case, making empirical study of average case performance increasingly important. In this paper we propose a set of evaluation principles which we argue are essential for sound evaluation. Based on these principles we propose DPBench, a novel evaluation framework for standardized evaluation of privacy algorithms. We then apply our benchmark to evaluate algorithms for answering 1- and 2-dimensional range queries. The result is a thorough empirical study of 15 published algorithms on a total of 27 datasets that offers new insights into algorithm behavior---in particular the influence of dataset scale and shape---and a more complete characterization of the state of the art. Our methodology is able to resolve inconsistencies in prior empirical studies and place algorithm performance in context through comparison to simple baselines. Finally, we pose open research questions which we hope will guide future algorithm design.
Traditional approaches to differential privacy assume a fixed privacy requirement ϵ for a computation, and attempt to maximize the accuracy of the computation subject to the privacy constraint. ...As differential privacy is increasingly deployed in practical settings, it may often be that there is instead a fixed accuracy requirement for a given computation and the data analyst would like to maximize the privacy of the computation subject to the accuracy constraint. This raises the question of how to find and run a maximally private empirical risk minimizer subject to a given accuracy requirement. We propose a general "noise reduction" framework that can apply to a variety of private empirical risk minimization (ERM) algorithms, using them to "search" the space of privacy levels to find the empirically strongest one that meets the accuracy constraint, and incurring only logarithmic overhead in the number of privacy levels searched. The privacy analysis of our algorithm leads naturally to a version of differential privacy where the privacy parameters are dependent on the data, which we term ex-post privacy, and which is related to the recently introduced notion of privacy odometers. We also give an ex-post privacy analysis of the classical AboveThreshold privacy tool, modifying it to allow for queries chosen depending on the database. Finally, we apply our approach to two common objective functions, regularized linear and logistic regression, and empirically compare our noise reduction methods to inverting the theoretical utility guarantees of standard private ERM algorithms and a stronger, empirical baseline based on binary search.
In the tradeoff between privacy and utility (accuracy), when privacy is traded off for more utility the way that usually manifests is a reduction in the number of queries that an analyst is allowed to run. A good mechanism for differential privacy will allow analysts to run many queries, each of which meets some utility standard. LeapYear has implemented several techniques to increase the number of queries an analyst can run, without compromising privacy or utility guarantees.
Composition is one of the most important properties of differential privacy (DP), as it allows algorithm designers to build complex private algorithms from DP primitives. ...We consider precise composition bounds of the overall privacy loss for exponential mechanisms, one of the fundamental classes of mechanisms in DP. We give explicit formulations of the optimal privacy loss for both the adaptive and non-adaptive settings. For the non-adaptive setting in which each mechanism has the same privacy parameter, we give an efficiently computable formulation of the optimal privacy loss. Furthermore, we show that there is a difference in the privacy loss when the exponential mechanism is chosen adaptively versus non-adaptively. To our knowledge, it was previously unknown whether such a gap existed for any DP mechanisms with fixed privacy parameters, and we demonstrate the gap for a widely used class of mechanism in a natural setting. We then improve upon the best previously known upper bounds for adaptive composition of exponential mechanisms with efficiently computable formulations and show the improvement.
Noisy Max and Sparse Vector are selection algorithms for differential privacy and serve as building blocks for more complex algorithms. ...In this paper we show that both algorithms can release additional information for free (i.e., at no additional privacy cost). Noisy Max is used to return the approximate maximizer among a set of queries. We show that it can also release for free the noisy gap between the approximate maximizer and runner-up. This free information can improve the accuracy of certain subsequent counting queries by up to 50%. Sparse Vector is used to return a set of queries that are approximately larger than a fixed threshold. We show that it can adaptively control its privacy budget (use less budget for queries that are likely to be much larger than the threshold) in order to increase the amount of queries it can process. These results follow from a careful privacy analysis.
Iterative algorithms, like gradient descent, are common tools for solving a variety of problems, such as model fitting. For this reason, there is interest in creating differentially private versions of them. ...However, their conversion to differentially private algorithms is often naive. For instance, a fixed number of iterations are chosen, the privacy budget is split evenly among them, and at each iteration, parameters are updated with a noisy gradient. In this paper, we show that gradient-based algorithms can be improved by a more careful allocation of privacy budget per iteration. Intuitively, at the beginning of the optimization, gradients are expected to be large, so that they do not need to be measured as accurately. However, as the parameters approach their optimal values, the gradients decrease and hence need to be measured more accurately. We add a basic line-search capability that helps the algorithm decide when more accurate gradient measurements are necessary. Our gradient descent algorithm works with the recently introduced zCDP version of differential privacy. It outperforms prior algorithms for model fitting and is competitive with the state-of-the-art for (ϵ,δ)-differential privacy, a strictly weaker definition than zCDP.
When implementing a differentially private system from theory in academic literature, there are many subtleties that must be accounted for. The literature usually does not describe ways to practically implement the research and these subtleties can sometimes completely invalidate differential privacy guarantees, resulting in compromise of the very data the system sought to protect. In addition to working around these “gotchas”, LeapYear’s implementation of differential privacy has been audited by an independent panel of academics and researchers.
We describe a new type of vulnerability present in many implementations of differentially private mechanisms. In particular, all four publicly available general purpose systems for differentially private computations are susceptible to our attack
...The vulnerability is based on irregularities of floating-point implementations of the privacy-preserving Laplacian mechanism. Unlike its mathematical abstraction, the textbook sampling procedure results in a porous distribution over double-precision numbers that allows one to breach differential privacy with just a few queries into the mechanism.
We propose a mitigating strategy and prove that it satisfies differential privacy under some mild assumptions on available implementation of floating-point arithmetic.
The Sparse Vector Technique (SVT) is a fundamental technique for satisfying differential privacy and has the unique quality that one can output some query answers without apparently paying any privacy cost. ...SVT has been used in both the interactive setting, where one tries to answer a sequence of queries that are not known ahead of the time, and in the non-interactive setting, where all queries are known. Because of the potential savings on privacy budget, many variants for SVT have been proposed and employed in privacy-preserving data mining and publishing. However, most variants of SVT are actually not private. In this paper, we analyze these errors and identify the misunderstandings that likely contribute to them. We also propose a new version of SVT that provides better utility, and introduce an effective technique to improve the performance of SVT. These enhancements can be applied to improve utility in the interactive setting. Through both analytical and experimental comparisons, we show that, in the non-interactive setting (but not the interactive setting), the SVT technique is unnecessary, as it can be replaced by the Exponential Mechanism (EM) with better accuracy.
Differential privacy is an information theoretic constraint on algorithms and code. It provides quantification of privacy leakage and formal privacy guarantees that are currently
...considered the gold standard in privacy protections. In this paper we provide an initial set of "best practices" for developing differentially private platforms, techniques for unit testing that are specific to differential privacy, guidelines for checking if differential privacy is being applied correctly in an application, and recommendations for parameter settings. The genesis of this paper was an initiative by Facebook and Social Science One to provide social science researchers with programmatic access to a URL-shares dataset. In order to maximize the utility of the data for research while protecting privacy, researchers should access the data through an interactive platform that supports differential privacy.
The intention of this paper is to provide guidelines and recommendations that can generally be re-used in a wide variety of systems. For this reason, no specific platforms will be named, except for systems whose details and theory appear in academic papers.
Relational operations such as WHERE, JOIN, and UNION require special consideration from a differential privacy solution because they can drastically change the amount of randomization required to protect privacy. Most differential privacy solutions do not track this and thus either cannot support these operations or will provide very inaccurate or non-private results to queries involving these operations. LeapYear supports all of these operations and correctly accounts for the privacy considerations of any queries using these operations
We report on the design and implementation of the Privacy Integrated Queries (PINQ) platform for privacy-preserving data analysis. PINQ provides analysts with a programming interface to unscrubbed data through a SQL-like language. ...At the same time, the design of PINQ's analysis language and its careful implementation provide formal guarantees of differential privacy for any and all uses of the platform. PINQ's unconditional structural guarantees require no trust placed in the expertise or diligence of the analysts, substantially broadening the scope for design and deployment of privacy-preserving data analysis, especially by non-experts.
We present an approach to differentially private computation in which one does not scale up the magnitude of noise for challenging queries, but rather scales down the contributions of challenging records. ...While scaling down all records uniformly is equivalent to scaling up the noise magnitude, we show that scaling records non-uniformly can result in substantially higher accuracy by bypassing the worst-case requirements of differential privacy for the noise magnitudes. This paper details the data analysis platform wPINQ, which generalizes the Privacy Integrated Query (PINQ) to weighted datasets. Using a few simple operators (including a non-uniformly scaling Join operator) wPINQ can reproduce (and improve) several recent results on graph analysis and introduce new generalizations (e.g., counting triangles with given degrees). We also show how to integrate probabilistic inference techniques to synthesize datasets respecting more complicated (and less easily interpreted) measurements.
Differential privacy enjoys increasing popularity thanks to both a precise semantics for privacy and effective enforcement mechanisms. ...Many tools have been proposed to spread its use and ease the task of the concerned data scientist. The most promising among them completely discharge the user of the privacy concerns by transparently taking care of the privacy budget. However, their implementation proves to be delicate, and introduce flaws by falsifying some of the theoretical assumptions made to guarantee differential privacy. Moreover, such tools rely on assumptions leading to overapproximations which artificially reduce utility. In this paper we focus on a key mechanism that tools do not support well: sampling. We demonstrate an attack on PINQ (McSherry, SIGMOD 2009), one of these tools, relying on the difference between its internal mechanics and the formal theory for the sampling operation, and study a range of sampling methods and show how they can be correctly implemented in a system for differential privacy.
The sensitivity of a function is the maximum change of its output for a unit change of its input. In this paper we present a method for determining the sensitivity of SQL queries, seen as functions from databases to datasets, where the change is measured in the number of rows that differ. ...Given a query, a database schema and a number, our method constructs a formula that is satisfiable only if the sensitivity of the query is bigger than this number. Our method is composable, and can also be applied to SQL workflows. Our results can be used to calibrate the amount of noise that has to be added to the output of the query to obtain a certain level of differential privacy.
While there is growing research on differentially private machine learning algorithms like extreme gradient boosting and deep learning, it is often the case that they cannot be implemented at scale. To enable such complex modeling methods on sensitive data, LeapYear can generate synthetic data that is distributionally equivalent. Our method incorporates generative adversarial networks to produce datasets that maintain statistical relations between variables, while protecting the privacy of the training data.
When sharing data among researchers or releasing data for public use, there is a risk of exposing sensitive information of individuals in the data set. Data synthesis (DS) is a statistical disclosure limitation ...technique for releasing synthetic data sets with pseudo individual records. Traditional DS techniques often rely on strong assumptions of a data intruder's behaviors and background knowledge to assess disclosure risk. Differential privacy (DP) formulates a theoretical approach for a strong and robust privacy guarantee in data release without having to model intruders' behaviors. Efforts have been made aiming to incorporate the DP concept in the DS process. In this paper, we examine current DIfferentially Private Data Synthesis (DIPS) techniques for releasing individual-level surrogate data for the original data, compare the techniques conceptually, and evaluate the statistical utility and inferential properties of the synthetic data via each DIPS technique through extensive simulation studies. Our work sheds light on the practical feasibility and utility of the various DIPS approaches, and suggests future research directions for DIPS.