The data we collect

Phil Hesketh

by Phil Hesketh

February 28, 2020

Being aware of what we’re recording and what kind of processing we need to do to it not only helps communicate our intent, but it can also save a lot of work in synthesis. Thinking about this ahead of time gives us a great opportunity to be creative in finding ways that we can tell meaningful stories with less PII in.

This article is part 2 of a series of 4 on best practices around informed consent. If you missed it, you can read Part 1: Informed consent best practice.

Now that we know a little bit about the lawful basis for processing information, it’s helpful to know the difference between the different types of information we might collect.

A picture of a pair of hands

There are three different categories of information we might process:

  • Personally identifiable information
  • De-identified (or pseudonymised) information
  • Anonymised information

Personally identifiable information

According to the GDPR, personally identifiable information (PII) is:

Data which relate to a living individual who can be identified, either:

a) from those data, or

b) from those data and other information which is in the possession of, or is likely to come into the possession of, the data controller, and includes any expression of opinion about the individual and any indication of the intentions of the data controller or any other person in respect of the individual.

There are two distinct types of PII: Direct identifiers and Indirect identifiers.

Although this list is not exhaustive, it adds a bit of colour to the distinction:

Direct Identifiers Indirect Identifiers
First and Last Names Date of birth
Personal identification numbers: (SSN), passport number, driver’s license, etc Place of birth
Personal addresses: street address, or email Business telephone number
Personal telephone number Business mailing or email address
Photographic images (particularly of face or other identifying characteristics), fingerprints, or handwriting Race
Biometric data: retina scans, voice signatures, or facial geometry Religion
Information identifying personally owned property: VIN number or title number Geographical indicators
Asset information: Internet Protocol (IP) or Media Access Control (MAC) addresses that consistently link to a particular person Employment information
  Medical information
  Education information
  Financial information

Examples sourced from The University of Pitsburgh.

Deidentified information

The Anonymisation Decision Making Framework published by UKAN is a great reference for understanding the complexities around anonymising your data. According to UKAN, de-identification is:

A process of removing or masking direct identifiers in personal data such as a person’s name, face or voice, address, Driving Licence or other unique number associated with them.

De-identification is sometimes called pseudonymisation.

In the past I’ve confused de-identification with anonymisation, telling participants that I would anonymise their data when in fact I was technically de-identifying it. After changing the terminology in my consent forms, I began to be questioned about what de-indentification meant, which was good because it would start a conversation about how I was handling the information they gave me which seemed to reassure people and helped me to build trust.

The reason removing direct identifiers alone isn’t enough to fully anonymise someone is because often, when we combine data sets together we can dramatically reduce the number of possible people it could be, which increases the likelihood of that person being identified.

The New York Times reported that a study titled “Unique in the Shopping Mall: On the Reidentifiability of Credit Card Metadata,” demonstrated that “knowing just four random pieces of information was enough to re-identify 90 percent of the shoppers as unique individuals and to uncover their records.”

Anonymised information

So if re-identification is so easy to do, what actually counts as anonymisation?

UKAN defines anonymisation as:

A process of ensuring that the risk of somebody being identified in the data is negligible.

Anonymisation involves doing more than simply de-identifying the data, and often requires that data be further altered or masked in some way in order to prevent statistical linkage.

This is a lot harder to do. It involves knowing what other data sets the controller (you or your organisation) might have which could link people - which is also true for people who you might share your findings with.

What can I do about it?

In practice, only asking for information that you really need to answer your research question is a good start. The more information you have, the greater the opportunity for statistical linkage so just not asking for it in the first place is prudent.

If you do collect additional information; the speed which you can synthesise, draw insights and the delete everything you don’t need to share can help reduce the risk of unintended re-identification.

Additionally, establishing greater control over who has access to the information can prevent future statistical linkage.

Anonymisation becomes complicated very quickly and I’m certainly not attempting to cover it in detail here. The aforementioned The Anonymisation Decision Making Framework from UKAN is a great reference if you want to read more around this. Also, there is something called the Motivated Intruder Test that can be used to understand how robust your process is.

This article is part 2 of 4 around best practice in informed consent. Next up, part 3: Decision making capacity.

Read another article