Sharing without sharing: separating the myth from reality
Can you have the privacy cake and eat it, too? We frame this question in the context native to data-driven companies forced to balance between legal requirements and data collaboration advantages.
Data is undoubtedly an increasingly valuable resource. Society, organisations, and countries are noticing more and more how the use of data is shaping the services they get, the products they create, and even their security. Data regulation has come forward aiming to create guardrails around the purposes personal data may be processed for and by whom. However, these guardrails are a safety net as much as they are a constraint; and the same rule applies to data usage.
While data privacy regulation is creating a safety net for individuals when it comes to the processing of their personal data, it also creates a set of data isolation that strongly constrains data collaborations between two or more parties, especially if these collaborations are across jurisdictions. The notion that some data may only be shared with third parties under certain restrictive conditions (or not at all) is not particularly new. It has been the standard in some industries with highly sensitive data such as healthcare or finance for a long time now. However, with data privacy regulation coming into play, this notion has expanded to any personal data, be it sensitive or not. And with courts and authorities interpreting data privacy regulation in a seemingly more and more restrictive manner, the bar for compliant data collaborations is constantly rising.
The inherent data dilemma
Privacy expectations and regulations don't live in a vacuum. Along with the increasing demand for user privacy, there is an increasing demand for services and products that utilize an increasing amount of diverse data to augment user experiences. Be that more personalized products, more personalized AI recommendations, or better and more personalized healthcare. Services like these are data intensive both in the quality of the data required but also in its diversity and volume. It is apparent by now that these trends are (seemingly) moving in opposite directions.
On one hand, there is a trend for keeping the data closer and isolated for control, privacy, and security. On the other, there is the trend of data sharing; bringing together data from multiple sources in order to increase accuracy, fairness, and inclusion of different analytics and machine learning products and services.
Fortunately, this dilemma is not a necessarily binary choice but can be a spectrum. The space between that spectrum is the trade-off you have to make when choosing between data utility and data privacy. A spectrum with full access to the raw data on one end and noise (or encryption) on the other end. Everything in between is eventually a risk calculation that the data owner has to do, and as in most risk-based scenarios, there are no silver bullets.
Privacy Enhancing Technologies: The silver bullet?
Knowing that there is no silver bullet is one thing. Stopping the search for one is another. An umbrella of technologies called privacy-enhancing technologies (PETs) came into the spotlight from the (until now) theoretical world. PETs propose a solution to the dilemma with a paradox: “share data without sharing it” or "collaborate on data without moving it". The fact that it sounds like a privacy silver bullet has led to its increased popularity. However, when confronted with real-world applications, users of PETs always discover that a proper risk assessment of data collaboration is still necessary.
This article attempts to address the difficult problem of how to think about the effect of PETs when assessing the risk of data collaboration. After all, the umbrella of PETs encompasses a wide variety of technologies that use different approaches to attempt to solve more than one type of privacy problem.
Wait, there is more than one type of privacy?
The central idea of privacy is "I want to make sure personal data is not disclosed to someone I don't trust with it". Privacy seems like a straightforward problem when phrased that way. Complexity arises however when you start picking the terms "personal data" and "trust" apart. Specifically, we separate "who do I trust, and with what" and "what constitutes personal data".
A typical scenario to better illustrate the two sets of challenges includes two parties that want to extract some insights (simple queries, or even advanced machine learning and AI) from their combined data but they don't want (trust) to completely reveal that data to each other - or data privacy regulation doesn’t allow them to.
To truly assess and demonstrate the help that PETs provide, we separate privacy concerns into two categories. We believe this helps clearly show the benefits and shortcomings of each technology, and demystifies any steps that seem “magic”.
The first question we have to ask is "Who do I have to trust with my data?" in any given data collaboration. The precise answer is "it depends"; but what we see most often is that they will have to distribute trust across three groups:
- The collaboration analyst(s),
- A third-party vendor who offers a platform for that collaboration (more often than not in a form of a software "black box"),
- The collaboration parties and/or their vendor's infrastructure provider (more often than not, a cloud provider).
Essentially the groups above can be separated into two types of actors: “Authorized users” that are supposed to get some information out of the data, and everyone else.
When people speak colloquially about privacy, this is often what they mean. It is the same as the traditional concept of data security, so we will name this type of privacy “input security” to highlight that relationship.
Input security or “How to keep strangers out of my data”
Input security protects against unauthorised access so that data can only be accessed by those that are supposed to access it. In the example used above, the third-party vendor and/or their cloud provider have nothing to do with the actual collaboration and they should have no access to that data at all. And even though a lot of marketing has been created around the security and privacy of vendor platforms and infrastructure providers, the truth remains that data sits unencrypted and vulnerable during processing. Meaning that standard encryption is not enough to protect data from prying eyes when the processing is happening in a location outside my control.
The PETs in this category make sure that data analysis happens only on encrypted data. This makes sure that any unauthorised party is not able to see any data, even if it runs on a machine they control. Additionally, it gives the tools for best-of-breed vendors to prove that they don’t (and can’t) interfere with that data, even if it’s software they manage. The technologies in this bucket are Confidential Computing (CC), Secure Multi-Party Computation (SMPC), and Fully Homomorphic Encryption (FHE).
If you are interested in learning more about this, we recently co-authored a report about this problem/solution pair with UBS, Nordea, Caixabank, and Intel.
Output privacy or “How to prevent analysts from learning personal data”
In some ways, the problem of input security is relatively simple. Some parties are not supposed to learn anything so I have to keep them out completely. It is a hard problem but one that is clearly defined. Output privacy however is not like that. In a collaboration, while I don’t trust my counter-party analysts with everything, I still expect them to learn something from my data; just not something I don’t want them to. Unpacking this “threat model” is making obvious that there are quite some assumptions that need to be made. From what data I am trying to protect, to how sophisticated my “attacker” would be. Ultimately these decisions are what define the risk I am willing to take. Having in mind that as I anonymize the data I remove information, eventually turning it into pure noise. There is a common saying: The most anonymous data is no data.
It is becoming obvious that a one-fits-all solution will be hard to find for this problem, and indeed, PETs that deal with output privacy are most of the time based on statistical methods to minimize the risk of exposing sensitive information to an (overly) curious or even malicious analyst. Techniques that range from simply removing sensitive columns to enforced aggregation and even parametrized noise on the output. However, at the end of the day, no technology is going to substitute human risk thresholding. Some PETs that help with making that decision easier and/or more efficient are: Synthetic Data, Differential Privacy, and Federated Learning. If you want to learn more about how these technologies actually work feel free to contact us or check our previous blog of this series talking about AI-generated Synthetic Data.
Are input security and output privacy really that different?
Yes, and it is very useful to make the distinction. As we demonstrate above, input security is more closely related to traditional notions of security, and those notions do not necessarily include protection against how much authorized users of a system can learn about the data subjects. The widespread misconception that privacy and security are the same is what motivated us to write this blog. We continuously find that even people well versed in the privacy space may not appreciate the importance of this distinction. We blame it (among other things) on the convoluted way companies in the PET space talk about their technologies and trust models.
Input security and output privacy are distinct concepts and it can make some complex topics easier to understand if you separate them. An input security technology allows you to perform computations on protected data, denying access to anyone who until now could see or manipulate the data in plaintext while it was processed. This however tells you nothing about what computations will be performed on the encrypted data. A code that is leaking out personal data can still be executed in an encrypted setting, and still leak out the personal data to those who are supposed to receive the output.
On the other hand, any method for ensuring output data privacy does not automatically protect your data from being leaked to a vendor or the infrastructure provider who is not supposed to have access to the data at all. This becomes especially relevant in cases where the output privacy is applied outside of the data owners’ premises or through black box software. In both those cases, it becomes very hard (or even impossible) to guarantee that an external party can’t have access to the data.
PETs are a tool, not the solution
In the end, it is not about knowing the ins and outs of every single technology, but it is about being able to ask the right questions. We at Decentriq see every day that no PET is going to substitute human decision-making in a deeply human problem field such as privacy. The best technology (and a technology vendor) can do is to help make these decisions as transparent and easy as possible.
In our next blog, we are going to dive a bit deeper into what PET providers really mean when they say “Your data never leaves your premises”, and how it affects data localization and sovereignty decisions.
References
Related content
Subscribe to Decentriq
Stay connected with Decentriq. Receive email notifications about industry news and product updates.