Blog Post |
Cryptographic Techniques and the Privacy Problems They Solve
Have A similar use case?
This blog post will discuss current state-of-the-art cryptographic techniques and the privacy and security problems they solve for the enterprise. The specific techniques we will cover are:
- Secure multiparty computation (SMPC)
- Fully homomorphic encryption (FHE)
- Differential privacy (DP)
Although many other technologies exist in the data privacy and security space,1 we will explore these three specifically because they (1) are among a rare class of approaches that provide rigorous guarantees and (2) are often confused due to ambiguity in terminology.
SMPC, FHE, and DP address fundamentally different use cases for enterprises and have little to no overlap. In this blog post, we will address the following questions for each technology:
- What is a simple, intuitive example?
- What does the approach protect against?
- What does the approach not protect against?
- What are implementation challenges?
Then, we will provide a table with example enterprise data protection use cases and cover which technology can address each case.
Secure multiparty computation (SMPC) addresses the concerns of multiple organizations who want to run some calculation that involves all their data without trusting each other or a central server to (1) store the data or (2) do the calculation. The assumption is that the input data is sensitive, but the output (and any information it implies about the input) is not sensitive.
Simple example: Investigating fraudulent bank accounts
Two banks each have a small list of accounts in their databases that have been flagged for fraud. They want to know if the other bank has flagged these same individuals and, if so, to obtain additional information about their activity. The banks do not want to share unnecessary information with each other, such as information about customers not already in the other bank’s database.
The banks can create tables that only contain the accounts they have flagged. Using SMPC, they can carry out an encrypted join on the tables, which allows them to see only the accounts they have in common. Any information about accounts not in common will be encrypted and not visible to the other bank. Note that SMPC will only be feasible because the size of the tables is small, the computation (join) is pre-defined, and the output of the computation (the common, flagged accounts) is not considered sensitive information.
What does it protect against?
For a computation that involves data from multiple parties, SMPC keeps the system executing the computation from exposing the input data. In an SMPC implementation, a server typically receives data from multiple parties to execute a computation. SMPC ensures that, even if the server is compromised, the underlying data will be encrypted (even during the computation) and therefore protected from the party compromising the server. In other words, SMPC removes the need for a trusted third party.
What does it not protect against?
SMPC does not provide any protection against the outputs of the computation themselves—whether they contain or potentially reveal sensitive information. In fact, with SMPC, the outputs of the computation can be used to output the entire sensitive dataset.
Example 1: As a simple example, suppose multiple banks are using SMPC to share credit card data. If someone asks, “What are the names, account numbers, and SSNs of the banks’ common customers?” then the SMPC algorithm will output names, account numbers, and SSNs of the common customers.
Example 2: As a more subtle example, suppose the banks want to train a credit default model on the combined data. The SMPC system will train an encrypted model on encrypted data. Although SMPC will ensure that an attacker will not be able to learn credit card information by compromising the model training server, it does not protect against attackers’ using the model to infer the credit card information. For most machine learning models, this is a very real threat.
As a result, SMPC is only appropriate when a business problem can be solved with predetermined calculations that the parties are confident do not output sensitive information. SMPC is not appropriate for scenarios in which:
- The end users or parties formulating the queries are not completely trusted (because the queries could be written to simply output the sensitive data).
- The use case involves dynamically querying the data, such as statistical analysis/data science use cases. Even with a limited set of allowed queries, previous attacks have shown that queries, such as summary statistics, can reconstruct entire datasets. The queries could also simply output the data.
What are the implementation challenges?
Deploying an SMPC system requires each party to use the same software configured to a network able to communicate for the purposes of answering queries. Each instance of the SMPC system will have access to each of the data owners’ data that is to be shared.
SMPC is computationally intensive and introduces several orders of magnitude of overhead. Because the compute is I/O bound, providing additional computational resources is not helpful. As a result, SMPC is only applicable for running a few pre-set queries on small datasets.
In traditional encryption, data is typically encrypted at rest and in motion. However, the data needs to be decrypted for a computation to execute. Fully homomorphic encryption (FHE) is a technique that enables arbitrary computations on an encrypted dataset without ever requiring that the dataset be decrypted, even when executing the computation. FHE enforces a paradigm in which the analyst or end user has full, unrestricted access to the underlying data, but the storage/compute environment never has access to data in the clear (because it is always encrypted). FHE is useful in situations in which an organization wants to use another organization’s physical hardware (such as a cloud computing provider) but believes the cloud provider has a risk of being compromised.
Simple example: Secret bank accounts
Consider a bank with clients who have strict requirements for secrecy and confidentiality. If the bank wishes to host banking information in the cloud, it is exposed to risk if the cloud provider is compromised or subpoenaed by a government agency. With FHE, the bank can store the banking information in the cloud and ensure that only the employees of the bank can use the data for basic analytics.
What does it protect against?
FHE protects a vulnerability in the storage or computing environment from compromising the data. For instance, in our simple example above, even if there was a bad actor at the cloud provider, or if the cloud provider was hacked, the bank’s data would still be protected. This is because the data is never decrypted in the compute environment, even when the computation is running.
What does it not protect against?
FHE does not keep anyone who is able to utilize the data (for any purpose) from compromising all the underlying data. This is because (1) the result set of a query must be decrypted by the end user for it to be usable and (2) FHE allows arbitrary computation, including queries that could exfiltrate sensitive data. Effectively, with FHE, anyone who can get value from the data has full access to all information in the dataset.
What are the implementation challenges?
FHE has several implementation challenges:
- Performance costs: Even the best encryption schemes have a massive performance cost—several orders of magnitude slower than the “unencrypted computation.” FHE is only appropriate for a few queries on very small datasets.
- Storage costs: Encrypted data files can be substantially larger than the same data in the clear. The increased financial cost of storing a substantially larger dataset may invalidate the original purpose of utilizing FHE, which typically is leveraging the cloud for more efficient processing and storage of sensitive data.
- Functional support: FHE does not support many standard functions, including data transformations or joins, and therefore needs to be augmented with other protocols (which can weaken security guarantees). For example, FHE does not allow for lookups or filtering operations, only for arithmetic computations.
- Application security: For applications that are not custom-built for an FHE implementation to leverage data protected by FHE, the data needs to be decrypted in the application. This means that the application needs to reside in a trusted environment. In the use case of a bank that needs to store account information in the cloud without revealing private data to the cloud provider, the application servers would still need to reside on the bank’s premises. Otherwise, the data would be in the clear in the cloud while being processed by the application, thus defeating the purpose of using FHE.
Differential privacy (DP) enables statistics and machine learning on a dataset while ensuring that information about individual records in the dataset cannot be extracted or inferred.
Simple example: Data sharing for analytics
Suppose a company has sensitive information on its customers and would like a third party to be able to develop reports and predictive models on the data. To preserve data value, the company wants to ensure that all information about its customers can be used in the reports and models. However, the company is concerned that the third party could obtain access to sensitive information about its customers while querying and modeling on the data.
With DP, the third party can compute statistics and models but cannot draw conclusions about individual records. This is because DP quantifies the probability that computational results can be combined to infer properties of individual records and ensures that this probability never exceeds an acceptable risk threshold by introducing calibrated randomness into the statistical result.
What does it protect against?
DP protects against query results or models revealing information about individual records in the underlying dataset. It ensures not only that the outputs are free of sensitive information but also that the outputs cannot be combined with each other or with external information to compromise any information about individual records in the dataset. In other words, while DP allows all data fields to be used for analytics, it ensures that no specific data point about an individual can be learned.
What does it not protect against?
DP does not protect the compute environment from being compromised. For example, if the company in the above scenario stored its data in the clear, and if its database were compromised directly, then DP would not provide any protection. DP protects against the end user’s utilizing computed results to infer properties of individuals in the data. It does not protect against vulnerabilities in the infrastructure used to store or manage the data.
What are the implementation challenges?
Unlike FHE and SMPC, DP is computationally efficient. However, implementing DP in an enterprise has many vital requirements. For example, a production DP implementation requires:
- An automated system that uses advanced information theoretic techniques to measure information exposure.
- Intelligent algorithms for automatically parametrizing differentially private algorithms (i.e., controlling the level of randomization to preserve accuracy).
- Differentially private algorithms that can support the broad range of computations in a modern analytics and data science ecosystem.
Systems that do not address these requirements are not truly differentially private, and they can be easily exploited to compromise sensitive data. In fact, most systems that attempt to achieve DP only add noise to outputs; these systems are not secure and degrade data value.
Finally, DP should be implemented only for use cases that are statistical in nature, such as reporting, analytics, and machine learning. Use cases such as search or data retrieval are not a good fit for DP.
Table of use cases:
|use case 1: FHE|
|Description||A healthcare CIO wants to employ a third-party risk assessment tool that can take in a patient’s data and output a risk assessment. The third party has its own risk model that it does not want to share with the healthcare company, and the healthcare company does not want to share its data with the third party. The healthcare company can use FHE to encrypt the data, provide the encrypted data to the third party, allow the model to be scored on the encrypted data, and get the risk scores in the clear.|
|How is the problem solved by the technology?||FHE ensures that the data is always encrypted in the cloud, even during compute. Therefore, the cloud provider can never see the data and never holds decryption keys—even during compute.|
|Why don’t the limitations of the technology apply?||Because the dataset is only a single user/patient, the performance limitations are less relevant; the dataset on which the encrypted computation is running is just one record. Note that FHE would not be a useful approach if the third party itself needed to analyze the data, or if the third party needed to be given access to outputs from the data. By being able to analyze the data or see outputs, the third party could easily output/reconstruct the sensitive dataset.|
|Why don’t the other approaches apply?||DP: The CIO is trying to obtain predictions for individual records from the third party and does not intend to expose any information to that third party. If the CIO wanted to allow the third party to run analytics or build models on the data and to have the ability to view or utilize the results of those analyses, then DP would be required to protect the sensitive information.
SMPC: Only one party with sensitive data is involved, so SMPC would needlessly complicate the solution.
|use case 2: SMPC|
|Description||Two banks each have a small list of accounts in their databases that have been flagged for fraud. They want to know if the other bank has flagged these same individuals and, if so, to obtain additional information about those individuals’ activity. The banks do not want to share unnecessary information with each other, such as information about customers who are not already in the other bank’s database.|
|How is the problem solved by the technology?||SMPC ensures that the banks can each create a table of the flagged accounts and join it with the other bank’s table in an encrypted form. The banks will only be able to decrypt the matched results, not to decrypt information on accounts for which there are no matches.|
|Why don’t the limitations of the technology apply?||Because the dataset is small, and because the computation is a simple join, the performance limitations are less relevant.
The banks must accept that all information about the fraudulent accounts will be exposed to the other bank. As this is an acceptable risk, and since the banks are not supporting interactive querying, analytics, or model development, the risk of extracting or reconstructing information about the other bank’s non-flagged customers does not apply.
|Why don’t the other approaches apply?||DP: The banks are trying to view information about specific individuals, not trying to carry out statistical analysis or model development, so the threat model is different from the one DP addresses. DP protects against all information from individual records being disclosed while preserving statistical properties and enabling machine learning.
FHE: The banks don’t want to compute information about the data, such as applying a model. More than one party is involved, and FHE does not support use cases where multiple parties need access to the data.
|use case 3: DP|
|Description||A bank wants to share information about its clients’ transactions with analysts. The bank wants to ensure the analysts can run statistical queries and build machine learning models on the data, but it does not want to compromise any information about its clients’ transactions. The transactions are confidential because they contain PII (e.g., credit card transactions) or proprietary IP (e.g., trading data from institutional clients). The analysts could be internal, cross LOB, cross border, or third party.|
|How is the problem solved by the technology?||An advanced DP implementation solves the problem by (1) ensuring that the analysts can run a broad range of queries, including data exploration, manipulation, statistics, and model training, without revealing any information about individual records (client transactions); (2) scaling to multi-petabyte clusters; and (3) providing an architecture that allows for third party (and multi-party) data analytics with the same privacy guarantees.|
|Why don’t the limitations of the technology apply?||The computations are statistical in nature and do not require outputting individual records, and the bank is concerned with keeping the analysts from learning sensitive information from the data, not protecting that data from the storage/compute provider.|
|Why don’t the other approaches apply?||FHE: The bank wants to keep the analysts from having access to clients’ transactions. In an FHE implementation, the analysts would be able to see all sensitive information, including PII and proprietary trades.
SMPC: This approach only applies when there is a predefined set of computations. In an environment in which the analyst can query the data, the entire dataset is easily inferred, resulting in a privacy breach.
Aside from the fact that the security model is not relevant to the use case, both FHE and SMPC would introduce substantial latency given the data (the transaction data is large), the types of computations (machine learning is compute intensive), and the use case (interactive analytics require rapid results; analysts cannot wait days/weeks/months for a single query to evaluate).
Differential privacy, secure multiparty computation, and fully homomorphic encryption are complex and powerful technologies in modern cryptography. Although the language used to describe these techniques as “privacy-preserving computing” is similar, they are fundamentally different approaches that address distinct threat models and business use cases. There is little to no overlap in the problems these techniques solve.
Generally, when implemented correctly and in line with the state of the art:
- Fully homomorphic encryption is designed for use cases in which the direct users of a dataset (and those who use downstream applications) are trusted with all the sensitive information, but the compute environment, such as a public cloud, is not trusted with sensitive Given the computational overhead, FHE is appropriate only when the data is small and the operations are not computationally intensive.
- Secure multiparty computation is designed for use cases in which multiple parties do not want to share data with each other but would like to carry out a fixed set of known, simple operations on the data, such as “who are our common customers?” The parties trust that the outputs of these operations will not be used to compromise the data. Given the computational overhead, SMPC is appropriate only when the data is small and the operations are not computationally intensive.
- Differential privacy is designed for use cases in which a single or multiple parties would like to share data for analytics and machine learning but do not trust other parties with the record-level information in the dataset. Differential privacy allows for insights to be learned from the data without revealing the record-level information—even in an interactive query environment. Additionally, this approach scales to complex queries on massive datasets without performance impact. Differential privacy is not appropriate for use cases in which the end user is completely trusted with access to the full dataset or in which the intent is to prevent the end user from gaining any level of insight from the dataset.
Security and privacy practitioners must carefully evaluate the nuances of their use cases and each of these approaches before deciding which technology (or combination of solutions) is most suitable for their problem.