Guess Who? Netflix & Data Anonymisation

What?

Data anonymisation is the process of protecting sensitive or private information by erasing or encrypting identifying information about an individual in data. Some of the data this can relate to might include names or locations. Subjective information such as ratings or opinions can also be treated as personal data. Sometimes this information on its own is not enough to identify an individual; however, when used in conjunction with other data it can be used to identify an individual. 

Why?

The General Data Protection Regulation (GDPR and the UK retention of GDPR) outlines a set of rules to protect users and provide transparency. It mainly focuses on the data that can be used to identify an individual. An example of how this can be compromised accidentally is the Netflix Prize case study:

“The Netflix Prize” was an open competition, where contestants were asked to create the best model to predict user ratings for films based on their previous ratings without any other information.

Narayanan and Shmatikov (2006) applied their de-anonymisation methodology to the dataset. By using the Internet Movie Database (IMDB) as the source of background knowledge, they successfully identified the Netflix records of known users, uncovering their apparent political preferences and other potentially sensitive information.

The following year, Netflix announced that it would not pursue a second Prize competition that it had already announced. The decision was in response to a lawsuit and Federal Trade Commission privacy concerns

How?

There are several approaches to data anonymisation and they all present different advantages and disadvantages. If anonymisation is required with an emphasis on data security, then some methods are more effective; however, there will be some circumstances where there still needs to be some interpretability in the data.

Data masking hides data with altered values, examples include character shuffling, encryption and word or character substitution. This should make reverse engineering or detection much harder, if not impossible. However it can be cumbersome and time consuming and can negatively alter the underlying data.

Pseudonymisation replaces private identifiers with fake identifiers. This method preserves data accuracy and integrity which lends itself to be used in Machine Learning applications whilst also protecting data privacy. It may not be enough if non-identifiable information can be used to identify an individual.

Generalisation - remove identifiable information or group data so that it is no longer identifiable. An example of this is to remove the house number from an address. This would be more acceptable, however it may still not be suitable if other information could be used in conjunction to identify an individual. 

Data Perturbation - you can introduce noise to the data or randomise values within a field. Too little perturbation will lead to poor anonymisation, whereas too much will reduce the value of the data. If you round an age by 1 then it is still relatively simple to match to an individual whereas rounding to the nearest 25 will make the data almost unusable.

Synthetic Data - create data that is representative of the original data but using artificial data. However, for this data to be useful for techniques such as Machine Learning then the artificial data should have similar values/patterns to the original data. This can be described using averages, standard deviations, and medians etc. 

Hashing - converts data to a fixed length code. It is technically possible to reverse hashing but it is impractical to. Hashing will give the same value each time and no two outputs should collide. However this does not maintain the value of the data for machine learning, a common workaround is to hash and group the results. 

Conclusion 

As a data scientist, my personal preference is to generalise the data, as it also deals with overfitting and small sample groups. However, creating synthetic data is also of great interest as it allows you to create more data than you currently have (assuming you know the fields and the statistical measures for them). Whichever approach is appropriate, it is useful to understand the contexts and advantages to all of them.

References

Imperva (n.d) Anonymization [Online] Available at : https://www.imperva.com/learn/data-security/anonymization/#:~:text=Data%20Anonymization%20Techniques,*%E2%80%9D%20or%20%E2%80%9Cx%E2%80%9D [Accessed 21st February 2022]

Narayanan, Arvind; Shmatikov, Vitaly (2006). "How To Break Anonymity of the Netflix Prize Dataset".[Online] Available at :  arXiv:cs/0610105 [Accessed 21st February 2022]


Previous
Previous

Insight from an RAF Assessment Day

Next
Next

Using Slack Effectively as a Remote Company