What Is Data Anonymization & De-Identification in 2024: Is It Truly Anonymous?
Services that collect data of real-life people usually need to anonymize or de-identify that data in order to process, transfer or sell it. However, anonymization techniques are not always enough to keep someone from identifying an individual. This article discusses anonymization techniques, how they can be reversed and how you can protect yourself.
Online services collect a lot of data. Tech giants like Google, Microsoft and Meta (formerly Facebook) make a business of harvesting user data and selling it to data brokers and advertisers. They try to assuage fears of misuse of this data by claiming to anonymize and de-identify your data, but is data anonymization truly anonymous?
Key Takeaways: Data Anonymization
- Data anonymization and de-identification techniques attempt to hide the identities of people within a dataset by removing directly or indirectly identifying data.
- Many anonymization methods are subpar and can lead to your anonymized data being reidentified with your person.
- Exercising caution on the internet and using a VPN can help to mitigate some of the faults of data anonymization.
Read on as we explain how data anonymization and de-identification work and why you should still take precautions, even when using services that anonymize data. We’ll also suggest a few services you can use to attempt to keep as low a profile as possible, even if being truly anonymous is nigh on impossible.
Data Anonymization Definition
Data anonymization is a method of data processing that removes personally identifiable information from a set of personal data about a particular data subject. Certain data protection laws, like the GDPR, require that businesses and other data controllers anonymize all collected and stored data to ensure data confidentiality for the data subjects they cover.
There are numerous ways to properly anonymize data in a way that can still be useful for analytical purposes. Unfortunately, this kind of data analysis isn’t always the most lucrative use of all that sensitive data. Instead, it’s much more profitable to sell that data to advertisers, who can use it to target individuals who fall within certain categories.
Achieve Online Privacy and Security Even as a VPN Novice
- Comprehend the essential role a VPN plays in safeguarding your digital life
- Gain a deep understanding of how VPNs function under the hood
- Develop the ability to distinguish fact from fiction in VPN promotions
This means that most online services don’t use effective methods of anonymization, and instead leave breadcrumbs that could still be used to identify a person.
What Is Identifiable Data?
Identifiable data is simply any data that can be used to point to a specific individual. However, there are several levels to identifiable data.
- Direct private identifiers: Names, emails, addresses, social security numbers
- Indirect identifiers: Dates of birth, zip codes, IP addresses, license plates, geolocation data
- Data pointing to multiple individuals: Movie preferences, food preferences, school attendance
- Data that cannot point to any individual: Aggregated census data, anonymous survey data
De-Identification vs Anonymization vs Pseudonymization
- De-identification refers to the removal of directly identifying data from a data subject’s profile. Some potentially identifying data is left behind, and this can include sensitive data, such as race, gender or political affiliation. De-identified data can be reidentified.
- Anonymization, when properly executed, should scrub all identifiable data from a dataset, including indirect identifiers. If it does retain indirectly identifiable data, it should ideally employ data masking, generalization or aggregation, ensuring that it can’t be used to identify any individual. True data anonymization is irreversible.
- Pseudonymization removes direct identifiers from a profile, but replaces them with pseudonyms. While this is less private than anonymization, pseudonymization is necessary in certain fields of science and is frequently used in medical trials and other scientific studies.
Data Anonymization Techniques
There are several methods that a data controller can use to anonymize data. However, only a few can create an irreversibly anonymous data set.
1. Deleting Direct and Indirect Identifiers
Deleting direct identifiers is the easiest (or laziest) way to de-identify data. Under the GDPR, this is considered to be de-identification and is not fully anonymous. Pseudonymized data can be reversed by data matching.
2. Pseudonymization or Tokenization
We already discussed this method, which replaces private identifiers with false identifiers called pseudonyms or tokens. This method provides no more protection than deletion does, and it might be even easier to reverse if done improperly.
3. Data Masking
Masking involves replacing data within a set with fictitious data. This can be done on the fly, with information being altered as one gains data access. Alternatively, a data controller can create an augmented set of data, while keeping a backup of the original dataset.
4. Introducing Statistical Noise
Statistical noise is a fault in the dataset that makes it more difficult to pinpoint an individual. There are several ways to introduce noise to a dataset and obscure the original data:
- Data generalization — Rounding values or presenting them as ranges, as opposed to conferring concrete numbers
- Data perturbation — Changing a certain value for an entire data set for an equal amount
- Data swapping — Swapping rearranges dataset attribute values, exchanging some information between two data subjects within the same dataset
This is one of the best methods of data anonymization, as it generates a dataset that contains random noise and differs significantly from the initial dataset.
5. Data Aggregation
The only truly irreversible method of data anonymization, aggregation presents collected data as aggregated values, with no attributes attached to each other.
6. Synthetic Data
Synthetic data is not real data. Rather, it’s a dataset that’s derived from a real one using various algorithms. Methods that create artificial datasets are the most secure, as the data stored in them is truly fictitious, while offering enough statistical similarity to the original to be useful for research.
How Can De-Identified or Anonymized Data Be Re-Identified?
Services that claim to de-identify user data promise to scrub your data of any identifying information. However, what information counts as identifying? Legally, this includes things like your name, IP address, email or any other contact information. In practice, though, it takes a lot less to actually identify you.
Let’s play a game. In this game, you’re describing a friend to your other friends, but without using their name or address. You resort to describing them by their hobbies, what they do for a living and what college they attended. You can bet that your friends will be able to identify the person pretty quickly. This is called re-identification.
In a notorious case, a Catholic priest was forced to resign his position after a religious newsletter publisher managed to identify him — using data that was legally purchased from a broker. The data included app usage data that included the gay hookup app, Grindr, as well as location data showing visits to gay bars, his home and workplace.
Reidentification Methods
To reidentify a dataset means to reattach a real-world person to it. This is relatively easy for a data broker to do, as they have multiple data points to compare between datasets. Here are some ways anonymized data can be de-anonymized.
1. Data Matching or Data Linking
A data broker might have your name and email address, plus a few more tidbits of information — say, your school, race, gender and birthday. It can then gather anonymized data from an online service, which doesn’t include your name and email, but does include those other data points.
The data broker just needs to compare the new dataset to the old one to find where they overlap, and they will effectively have reidentified that anonymized data.
2. Exploiting Flawed Anonymization
If the data controller deletes direct identifiers, they may have left plenty of indirect identifiers in the dataset, which can be used to identify individuals and reveal sensitive information about them.
3. Pseudonym or Token Reversal
If a dataset was pseudonymized using an inferior technique, such as simple character substitution, or if there was a key used to assign pseudonyms, there are ways for the pseudonyms to be reversed. For example, if one learns the key, it’ll be easy to reverse the pseudonymization.
4. Advanced AI Models
Advances in data analysis technology have made it much simpler to reidentify de-identified data. With recent breakthroughs in AI, technologies like machine learning algorithms in conjunction with other statistical methods can now be used to put a name to an individual in an anonymized dataset.
HIPAA & De-Identification
If you’re at all familiar with privacy laws in the U.S., you’ll have at least heard of the Health Insurance Portability and Accountability Act, or HIPAA for short. HIPAA is a privacy act that governs how data controllers handle sensitive medical data. Among other things, it prevents anyone except the data subject from accessing their medical history.
However, HIPAA allows anonymized data to be used for things like medical research. If the data isn’t properly anonymized, it can still be used to identify a person. This famously happened in 1996, when Latanya Sweeney was able to uncover the entire medical history of Massachusetts Governor Bill Weld by comparing two commercially available datasets that cost only $20 to purchase.
Thankfully, HIPAA regulations have changed since then, though the risk of improperly anonymized data still exists.
How to Keep Your Data Private
Although the only way to keep your data private is to smash your router to pieces and never connect to the internet again, that may be too drastic for most people. If you want to keep your private life private, while still enjoying the luxury of the internet, here are a few things you can do.
1. Cut Out or Minimize Social Media Use
Social media is one of the biggest privacy risks currently facing internet users. Most social media is centered around exposing the details of your life. Limiting social media use to communication will reduce your digital fingerprint significantly.
Additionally, you could use a pseudonym and burner email or phone number, and not provide any real data when signing up for a social media account.
2. Read Privacy Policies Before Providing Consent
Privacy policies reveal what kind of data a service collects. If you’re uncomfortable with the amount of data needed to use a service, you should look for an alternative.
3. Block Cookies
It is now legally required for all websites to obtain consent for using cookies. Some cookies are mandatory for the website to work, but you should opt out of all cookies that aren’t. Here’s a detailed guide on how to delete cookies.
4. Log Out of Your Browser, Google, Apple or Microsoft Account
If you’re logged in to a browser, it’s highly likely that your email address will be linked to your internet activity. Using Google and other search engines when logged in will result in your search history being logged. If possible, log out of your device’s Microsoft or Apple account as well to minimize the data those companies can collect.
5. Use a VPN
VPNs let you browse the web anonymously by encrypting your internet traffic and disguising your IP address. This prevents your ISP, the government or hackers from seeing what you’re doing online, which means data controllers won’t be able to connect your online activity to a particular IP address. Here is our list of the best VPNs, in case you need help making a choice.
6. Erase Your Data From Data Broker Archives
Under the GDPR and similar laws, you’re entitled to have your stored data deleted. You can contact a data broker to let them know that you want your data deleted.
However, there are too many to contact them all individually, and a service like Surfshark’s Incogni or DeleteMe (read our DeleteMe review) can help you automate that process. You only need to provide your email and a few other details, and the data removal service will delete your data from wherever it’s possible. More on this in our how to remove yourself from data collection sites.
Final Thoughts: Data Anonymization & De-Identification
Staying anonymous online is practically impossible. The internet is vast, and we use so many services that one can never truly be sure they’re anonymous. Even if a service claims to anonymize or de-identify your data, there are still ways to link that data back to you.
What’s probably more frightening is the rapid advance of publicly available AI models like ChatGPT that can make reidentifying your data even easier.
What are your thoughts on the matter? Do you trust services that claim to anonymize your data? Do you use a VPN to protect yourself online? Let us know in the comments below, and as always, thank you for reading.
FAQ
Anonymization is the process of removing personally identifiable data from a dataset.
The goal of anonymizing data is to be able to analyze it without compromising the privacy of the people included in the dataset.
Pseudonymization is a form of anonymization where identifying data is replaced with a pseudonym or token.