Supporting Privacy-Preserving Analytics in modern IT environments
VP of Solutions Architecture
Jan 7, 2020
This blog post will address some considerations for IT professionals as they start enabling Privacy-Preserving Analytics (PPA) in their firms. PPA, a term increasingly used by academics and businesspeople alike, is surprisingly poorly defined; we therefore first address this issue. At LeapYear, we define PPA as a technology that allows an organization to create valuable and accurate analytics from a dataset without disclosing, directly or indirectly, sensitive information about the individuals or entities upon which these analytics are based.
The question of how to enable PPA is an interesting one when considering the complexity of modern IT environments. Here at LeapYear, we interact with some of the world’s largest companies across the Fortune 500 to enable PPA at scale. We work with companies in a variety of verticals including healthcare, financial services, technology, and other regulated industries. The rest of this post explores some of the lessons we’ve learned from our experiences. The IT landscape at these large companies is characterized by a variety of systems and tools that must seamlessly work together. Centralized notions of identity and access management need to be taken into account, and infrastructure may be spread across both on-premises and cloud environments. Data may be housed in silos across the organization, with each silo having its own regulatory and compliance regime that must be adhered to, as well as other idiosyncrasies that must be taken into account. The following are the three key areas of consideration to maximize the probability of success when implementing or enabling PPA across your organization:
What are you protecting against, and why?
Clearly define the threat model. What entities need to be protected and what are the standards of protection you must adhere to? Who are the personas or individuals that this data needs to be protected from (internal business analysts, external analysts)? The answers to these questions will be critical to choose a solution architecture that best meets the needs of all stakeholders involved. If the primary threat is an internal analyst potentially compromising data, the PPA tool should integrate with your existing identity solutions and provide some level of alerting/tracking of the analyst’s activity. If the primary threat is a consumer of a static deliverable (such as a report or a set of model parameters), the PPA tool should have some way of tracking the privacy risk associated with these deliverables that is independent of the process used to create them.
Clearly define the value proposition. What business outcome is going to be enabled by PPA in my chosen scenario? Understanding the use case and the resulting impact on relevant products and services as well as surrounding teams will help contextualize any required trade-offs.Consider the case where an organization is sharing data with a third party for the purposes of interactive analytics and machine learning, as many of our customers do. Providing access to third parties may require a specific set of controls to be in place as well as the use of infrastructure that is logically or physically separate from the internal infrastructure. Interactive analytics will likely require elastic infrastructure to be scaled up to meet compute demands[.3] . In cases where the economic benefit to the business is high (e.g., third parties pay for access to these data/insights), pressure on the overall project budget or resource consumption will be less.
How does PPA integrate with existing application systems?
Use a common notion of identity and authentication. An enterprise-grade implementation of PPA should be able to consume a centralized notion of identity—for example, LDAP-based SSO via SAML—to drive both authentication and access control.
Decide how PPA will fit into your application and performance monitoring framework. An application is only useful if it’s available and performant. Any PPA solution should either come with its own monitoring framework or be able to easily integrate as a source into your existing monitoring solutions.
How does PPA integrate into existing infrastructure?
Careful infrastructure planning is key to success. PPA should be planned inside of the existing infrastructure, regardless of whether it is hybrid, cloud, or on-premises (and should accommodate any of these deployment models!). A PPA provider should be able to work with you to help tune the application and maintain balance between performance and cost.
PPA should leverage existing data infrastructure. A mature PPA solution will easily integrate with your existing data architecture. Technologies such as JDBC/ODBC connectors and blob storage are critical and enable the federation of analytics across multiple disparate data sources and underlying datastore technologies.
When implemented in partnership with business, data, and IT stakeholders, PPA and machine learning projects will significantly benefit an organization. As a result of unlocking access to sensitive data, companies can realize new economic gains through information-based products and services. The IT team is an integral part of this journey, owning key systems and architectural details that will determine many aspects of the success of PPA implementation. If your organization is considering a project in the realm of PPA, we’d love to help you along your journey. Please reach out to us and we will be happy to share our experience.