Join today’s leading executives online at the Data Summit on March 9th. Register here.
This article was contributed by Tianhui Michael Li, founder of The Data Incubator, and Maxime Agostini, cofounder and CEO of Sarus.
Why differential privacy overcomes many of the fatal flaws of de-identification and data masking
If data is the new oil, then privacy is the new environmentalism. With the growing use of data comes the need for strong privacy protections. Indeed, robust data privacy protects against rogue employees spying on users (like these at Uber or Google) or even well-meaning employees who have been hacked, as well as when data is shared between departments or companies. Unfortunately, the conventional approaches to protecting the privacy of shared data are fundamentally flawed and have encountered several high-profile failures. With enhanced regulatory scrutiny from Europe’s GDPR, California’s CCPA, and China’s PIPL, failures can cost companies millions in fines.
In response, companies have focused on the bandaid solution — such as risk assessments and data masking — performed by already overtapped compliance teams. These solutions are slow, burdensome, and often inaccurate. We make the case that companies should use differential privacy, which is fast becoming the gold standard for protecting personal data and has been implemented in privacy-sensitive applications by industry leaders like Google, Apple, and Microsoft. Differential privacy is now emerging as not only the more secure solution but one that is lighter-weight and can enable safe corporate data collaboration. Companies are embracing differential privacy as they look to capture the $3 trillion of value Mckinsey estimates will be generated by data collaboration.
Data masking is vulnerable to attackers with side information
The common industry solution, data masking, sometimes called de-identification, leaves companies vulnerable to privacy breaches and regulatory fines. At its simplest form, it aims to make data records anonymous by removing all personally identifiable information (PII), or anything that is sufficient to identify a single individual. Such identifiers can be obvious (name, email, phone number, social security number) or less so (IP, internal ID, date of birth, or any unique combinations of the above). For example, in medical data, HIPAA compliance proposes a list of 18 identifiers that need to be removed to qualify for safe harbor compliance. There is no shortage of masking techniques, such as deletion, substitution, perturbation, hashing, shuffling, redaction, etc. All come with their specific parameterization to make it harder to re-identify an individual. But while data masking is a first attempt at anonymization, it does not make data sets anonymous.
In 1996, Massachusetts released the hospital records of its state employees in a noble attempt to foster research on improving healthcare and controlling costs. The governor at the time, William Weld, assured the public that their medical records would be safe and the state had taken pains to de-identify their dataset by removing critical PII. Little did he know that MIT graduate student Latanya Sweeney took on the challenge of re-identifying the data. By purchasing voter roll data, she was able to learn the governor’s birth date and zip code, which, when combined with his sex, uniquely identified his hospital visit in the dataset. In a final theatrical flourish, she even mailed Governor Weld’s health care records to his office. This famous case is a reminder that, as long as there is something potentially unique left in the de-identified record, someone with the right “side information” may use that as a way to carry out a re-identification attack. Indeed, even just sharing simple aggregates — like sums and averages — can be enough to re-identify users given the right side information.
Data masking is slow, manual, and burdens already-overtapped compliance teams
Regulators have long understood that de-identification is not a silver bullet due to re-identification with side information. When regulators defined anonymous or de-identified information, they refrained from giving a precise definition and deliberately opted for a practical one based on the reasonable risks of someone being re-identified. GDPR mentions “all the means reasonably likely to be used” whereas CCPA defines de-identified to be “information that cannot reasonably identify” an individual. The ambiguity of both definitions leaves places the burden of privacy risk assessment onto the compliance team. For each supposedly de-identified dataset, they need to prove that the re-identification risk is not reasonable. To meet those standards and keep up with proliferating data sharing, organizations have had to beef up their compliance teams.
This appears to have been the process that Netflix followed when they launched a million-dollar prize to improve its movie recommendation engine in 2006. They publicly released a stripped-down version of their dataset with 500,000 movie reviews, enabling anyone in the world to develop and test prediction engines that could beat theirs. The company appears to have deemed the risk of re-identification based on user film ratings negligible. Nonetheless, researchers from UT Austin were able to leverage user ratings of movies as a “fingerprint” to tie a user’s private Netflix reviews to their public IMDB reviews. The IMDB accounts sometimes had real user names while the corresponding Netflix accounts often had extra movie reviews not in the public IMDB accounts. Some of these extra reviews revealed apparent political affiliations, religious beliefs, sexual preferences, and other potentially sensitive information. As a result, Netflix ended up settling a privacy lawsuit for an undisclosed amount.
Data masking strategies can always be adjusted in an attempt to meet the growing pressure to protect privacy but their intrinsic limitations mean they will never fully meet expectations. While Governor Weld’s re-identification may seem obvious in retrospect, the Netflix re-identification case highlights how side information can be difficult to anticipate, especially as users are increasingly prone to share previously private yet seemingly innocuous information on social media. Accurate risk assessments for privacy attacks are an unrealistic ask for compliance teams; they are perilous at best and futile at worst. Nonetheless, organizations have responded with lengthier reviews and more stringent data masking requirements that sometimes amputated the business value of the resulting data. This manual approach to protecting privacy has led to a significant slowdown in data projects, high cost of compliance, significant data engineering load, and missed opportunities.
Differential privacy to the rescue
By studying the risk of re-identification more thoroughly, researchers were able to better articulate the fundamental requirements for information to be anonymous. They realized that a robust definition of anonymous should not rely on what side information may be available to an attacker. This led to the definition of Differential Privacy in 2006 by Cynthia Dwork, then a researcher at Microsoft. It quickly became the gold standard for privacy and has been used in global technology products like Chrome, the iPhone, and Linkedin. Even the US Census used it for the 2020 census.
Differential privacy solves the problem of side information by looking at the most powerful attacker possible: an attacker who knows everything about everyone in a population except for a single individual. Let’s call her Alice. When releasing information to such an attacker, how can you protect Alice’s privacy? If you release exact aggregate information for the whole population (e.g., the average age of the population), the attacker can compute the difference between what you shared and the expected value of the aggregate with everyone but Alice. You just revealed something personal about Alice.
The only way out is to not share the exact aggregate information but add a bit of random noise to it and only share the slightly noisy aggregate information. Even for the most well-informed of attackers, differential privacy makes it impossible to deduce what value Alice contributed. Also, note that we have talked about simple insights like aggregations and averages but the same possibilities for re-identification apply to more sophisticated insights like machine learning or AI models, and the same differential privacy techniques can be used to protect privacy by adding noise when training models. Now, we have the right tools to find the optimal tradeoff: adding more noise makes it harder for a would-be attacker to re-identify Alice’s information, but at a greater loss of data fidelity for the data analyst. Fortunately, in practice, there is a natural alignment between differential privacy and statistical significance. After all, an insight that is not differentially private means it depends too much on just one individual, but in that case, it is not statistically significant either. Used properly, differential privacy should not get in the way of statistically significant insights, and neither differential privacy nor statistical significance are typically of concern at “big data” scales. Differential privacy provides guarantees around the worst-case effectiveness of even the most powerful attacker.
With differential privacy, producing privacy-preserving analytics or machine learning models calls for a new way of interacting with personal data. The traditional approach was to run data through a data-masking pipeline before providing the altered data to the data analyst. With differential privacy, no data (whether masked or not) is sent to an analyst. Instead, an analyst submits queries and a system runs those on the data and adds appropriate noise. This paradigm works for both business analytics and machine learning use cases. It also fits very well with the modern data infrastructures where the data is often stored and processed on distributed systems with data practitioners working remotely.
Differential privacy doesn’t just better protect user privacy, but it can do so automatically for new datasets without lengthy, burdensome privacy risk assessments. This is critical for companies looking to stay nimble as they capture part of what McKinsey estimates is $3 trillion dollars of value generated by data collaboration. Traditional data compliance team committees are costly, might take months to deliberate on a single case, and make fallible pronouncements about privacy. Additionally, each dataset and data project calls for a bespoke data masking strategy and ad-hoc anonymization pipeline, adding yet another burden to stretched data engineering resources. In some cases, compliance may even forbid sharing of data if no viable masking technique is known. With differential privacy, we can let the math and computers algorithmically determine how much noise needs to be added to meet the protection standards, cheaply, quickly, and reliably. Much as new distributed computing frameworks like Hadoop and Spark made it easy to scale data and computation, differential privacy is making it easier to scale privacy protection and data governance.
To achieve anonymization, organizations have long relied on applying various data masking techniques to de-identify data. As the anecdotes about Massachusetts Governor Weld and Netflix have shown, and privacy research has proven, as long as there is exact information left in the data, one may use it to carry out re-identification attacks. Differential privacy is the modern, secure, mathematically rigorous, and practical way to protect user privacy at scale.
Maxime Agostini is the cofounder and CEO of Sarus, a privacy company that lets organizations leverage confidential data for analytics and machine learning. Prior to Sarus, he was cofounder and CEO of AlephD, a marketing tech company that he led until a successful acquisition by Verizon Media.
Tianhui Michael Li is the founder of The Data Incubator, an eight-week fellowship to help Ph.D.s and postdocs transition from academia into industry. It was acquired by Pragmatic Institute. Previously, he headed monetization data science at Foursquare and has worked at Google, Andreessen Horowitz, J.P. Morgan, and D.E. Shaw.