Data anonymization techniques that maintain a 1:1 relationship between personal info and the people to whom they relate are appropriate in certain use cases. But depending on these techniques in live production environments leaves companies—along with their users and/or employees—vulnerable to a cyber attack.
Not all data breaches lead to disaster for the companies that have been breached. Virtually every mid-sized to large company uses some sort of data protection, such as data anonymization, which makes data useless to anyone outside of the company. The problem is, many companies are still using anonymization techniques that were sufficient 10 years ago, but fail in today’s modern world. Here are four outdated techniques that companies should use only sparingly (when a use case calls for re-identification), and never in production environments.
Pseudonymization (also called de-identification by regulators in other countries, including the US)
Pseudonymization—which involves replacing personal identifiable information (PII) with an artificial number or code, and creating a new data table—is not true anonymization. It makes data an easy target for a privacy attack. As it turns out, 63% of the US population is uniquely identifiable by combining their gender, date of birth and zip code alone. Most companies, furthermore, do not pseudonymize enough data to eliminate the chance of re-identification.
This is why pseudonymized data must fulfill the same GDPR requirements as the original personal data. That alone should be a signal that this approach is not sufficient to protect data.
Permutation
Permutation involves swapping data points between records to de-identify a person. All data is retained in the table, but some is moved to a different line. For instance, names could be switched around. There are several problems here—one is that there is still a high risk of re-identification, and another is low statistical performance. When you move data around, you lose many of the correlations, insights and relations among columns, rendering the data useless for many analytical purposes. So permutation is not only unsafe, but also wrecks your data.
Randomization
Another classic anonymization approach, randomization entails modifying characteristics according to redefined randomized patterns. For example, one randomization technique is perturbation (not to be confused with permutation), which adds systematic noise to the data to obscure it—for instance, adding or subtracting a certain number of days from a date throughout the table. While some correlations can be preserved, and it may be harder for hackers to retrieve accurate personal data, it is certainly not impossible. The risk of re-identification is still high, and for that reason, randomization is risky.
Generalization
Generalization is another well-known anonymization technique that reduces the granularity of the data representation to preserve privacy. The main goal is to replace specific values with generic but semantically consistent values—for instance, replacing a specific date with a month, or replacing a specific age with an age range. Unfortunately, as with randomization, the risk of re-identification is high, and it impedes statistical performance of the data.
All four of these “anonymization” techniques lead to data that’s not completely anonymous. In all four scenarios, the datasets maintain a 1:1 link between each record in the data to a specific person, and these links are the very reason behind the possibility of re-identification.
Companies using any of these approaches in a production environment would be well served to move to more modern approaches—such as synthetic data – that have been proven to eliminate PII while maintaining the integrity of the data. Some data scientists even claim to prefer synthetic data over real data sets, because they can be analyzed, shared via cloud, and used for AI model development and software testing without risk of re-identification or regulatory compliance issues.
One thing is for certain—bad actors will continue to try to get to your data. Don’t make their jobs easier by using antiquated techniques to protect it.