Data Anonymization
What is Data Anonymization?
Data anonymization is the permanent and complete alteration or removal of sensitive or personally identifiable information (PII) from a dataset. PII includes, but is not limited to, any information that may lead to the identification of an individual, such as an address, telephone number, photographic image, date of birth, or social security number.
Data Anonymization at a Glance
Data anonymization is:
- Process of removing private or confidential information from raw data
- Results in anonymous data that cannot be associated with anyone or anything, such as individuals, organizations, or projects
Why perform data anonymization?
Protect:
- Identities
- Private activities
- Sensitive information
How to perform data anonymization:
Techniques include:
- Aggregation
- Blurring
- Customized anonymization
- Directory replacement
- Encryption
- Generalization
- Hashing
- Masking out
- Nulling out
- Number and date variance
- Perturbation
- Pseudonymization
- Randomization
- Scrambling or Shuffling
- Substitution
- Suppression
Interest in data anonymization continues to increase as privacy concerns grow. Because data anonymization allows for de-identifying original confidential data, it helps organizations meet strict privacy mandates from private, industry, and government rules that impact data protection. In addition, if data is anonymized and an organization suffers a breach, there is no requirement to report it, because technically no sensitive information has been leaked.
Personal data that has been rendered anonymous in such a way that the individual is not or no longer identifiable is no longer considered personal data. For data to be truly anonymized, the anonymization must be irreversible.
General Data Protection Regulation’s directive on data anonymization
The process of data anonymization converts sensitive information into a format that can no longer be associated with the source data. In most cases, once this data is stripped of sensitive or personally identifying elements, those elements can never be re-associated with the original data.
Data Anonymization Techniques
The process of data anonymization uses various de-identification techniques to make sure that sensitive information is no longer identifiable.
Aggregation
Data is stored in aggregate for summarization. For example, individuals’ dates of birth are not included in a dataset, but the number of people of each age group is known. Aggregation is often used for data anonymization when information is being sold.
Custom Anonymization
This data anonymization method involves creating and implementing a unique technique or a combination of several techniques that utilize scripts or an application.
Directory Replacement
Makes changes to the sensitive data but maintains consistent relations between other values. The data is pseudonymized in this way. To anonymize, the separately stored information that identifies the sensitive information must be deleted.
Encryption
Encrypting data makes it unreadable to anyone without a decryption key, rather than removing sensitive data.
Generalization
Data is grouped to obfuscate it. Examples of generalization are using only states rather than cities or area codes rather than complete telephone numbers.
Hashing
With hashing, data points are replaced with a hash (i.e., an alphanumeric string).
Nulling Out
Sensitive data is removed and deleted from the data set, with those data elements becoming null values.
Perturbation
Data elements are altered slightly. Therefore, this data aggregation technique can only be used if accurate data is not needed.
Pseudonymization
Substitutes the identifying data with a placeholder, but the data can be re-identified with the correct information.
Randomization
Involves noise addition (i.e., adding or multiplying randomized numbers) to express sensitive data elements imprecisely.
Scrambling / Shuffling
With scrambling or shuffling, the letters or digits in the sensitive data are mixed.
Disadvantages of Data Anonymization
Allowing for Re-Identification
Performing data anonymization in a way that can be reversed is inherently risky. Having a way to reverse the de-identification leaves open the possibility of an unauthorized user being able to reveal sensitive information that was thought to be protected.
Differing Criteria and Processes for Data Anonymization
Varying techniques for data anonymization yield different results when de-identifying sensitive information in datasets. In addition, oversight of the processes used to perform these methods varies, which can also impact the efficacy of the techniques.
Lack of Regulation for Data Anonymization
Performing data anonymization on some information can cause problems related to compliance and legal requirements. For example, some may have rules about producing certain records that could be impossible if they anonymized, such as:
- Data protection laws—local, state, federal, and international
- E-discovery
- Export control regulations
- Industry compliance regulations
- Internal data security rules
Loss of Data Quality
Some data anonymization techniques devalue data by diluting it so much that it is difficult for the data to be used.
Data Anonymization and GDPR
When considering General Data Protection Regulation (GDPR) compliance, note that anonymized data is out of the scope of personal data. According to Recital 26 of GDPR, “…information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”
Based on this definition, it could be argued that either data anonymization or data pseudonymization would meet the requirements for protecting PII. Data anonymization or data pseudonymization decisions made for GDPR compliance come down to an acceptable level of risk relative to your organization’s effort.
Data pseudonymization techniques are more straightforward than those for data anonymization and can achieve data protection set forth by GDPR. But, there are risks that the data could be de-identified. Data anonymization is more complicated, but virtually eliminates the risk of PII data being compromised.
Data Anonymization vs. De-Identification
Using PII or electronic medical records (EMF), for example, the differences between data anonymization and de-identification are:
Data Anonymization | De-Identification |
Any links between a person and PII data have been irreversibly broken, making it virtually impossible to re-establish the identity of a person in the original record. | Personal identifiers in a record have been extracted, and it would be very difficult to re-establish the identity of a person in the original record. |
The biggest difference between de-identification and data anonymization is that there is no re-identification of anonymized records, because the links back to the subjects are irreversibly broken. However, with de-identification, the people and associated records can be put back together.
There are legal implications for choosing data anonymization or de-identification. An example is with the use of medical records and human tissues in biomedical research, which is controlled by both The Common Rule (Title 45 Code of Federal Regulations, Part 46, Protection of Human Subjects) and the Standards for Privacy of Individually Identifiable Health Information, Final Rule (part of HIPAA).
In this scenario, HIPPA only requires de-identification, but the Department of Health and Human Services requires data anonymization. This is just one of many examples of the challenges and considerations associated with data anonymization vs. de-identification.
Data Anonymization Best Practices
Data anonymization best practices start with the designation of a person or team to manage and evaluate requests as well as develop related forms, processes, and documentation. This role’s responsibilities should include working with requesters to:
- Understand the reason for the request.
- Analyze the data and identify what data elements need to be anonymized.
- Evaluate the challenges and risks.
- Gather information related to required resources and budget.
- Ensure that approved projects are successfully completed.
Balance Data Anonymization Risks and Benefits
The use case should drive decisions about the best way to handle data anonymization. This will direct the right level of data anonymization (for example, whether data pseudonymization is a viable option or not).
A balance needs to be struck between the degree of risk involved in re-identification and how the data will be used. This is because the more data is anonymized, the less value it has for analysis. When choosing the best data anonymization technique, always consider laws and regulations, the nature of the information, and its intended use.
Egnyte has experts ready to answer your questions. For more than a decade, Egnyte has helped more than 16,000 customers with millions of customers worldwide.
Last Updated: 23rd December, 2021