Data anonymization refers to the process of either encrypting or removing personally identifiable information from data sets, such as names and email addresses.
The practice will increasingly be used in in sectors such as healthcare, fintech and advertising sectors as a result of heightened concerns about personal privacy.
However, the tricky part is that anonymous data can be recovered. In fact, it is not that difficult for machine learning to do data recovery if there is sufficient data.
Research in the US has shown that a dataset with 15 demographic attributes including age, sex and marital status can accurately identify 99.98 percent of people.
Some researchers have even developed tools for everyone to do anonymity check.
By providing information such as postal code, gender and birth date, the software will give you an estimate of the chance that your identity can be revealed from a set of anonymous data.
In 2012, UK’s Department of Education held an event to showcase in relation to anonymous data.
During the event, a student quickly recognized himself from a set of anonymous data, because one of the pieces of data was about an exam where he got high scores and the exam had been taken by only a handful of students.
We are frequently guaranteed that anonymized data can protect our privacy. But it’s proved that these deidentification practices are far from adequate to protect privacy. To meet the new challenges, there is need for more robust anonymous data standards.
Data security technology has grown rapidly in recent years. Differential privacy, homomorphic encryption and federated learning are three technologies that we should keep a close eye on.
This article appeared in the Hong Kong Economic Journal on Aug 21
Translation by Julie Zhu
[Chinese version 中文版]
– Contact us at [email protected]