It’s a hard fact that you can’t protect or manage something you can’t find or don’t know you have. How do you price insurance for your home if you don’t know what’s in it or what it’s worth? How can you establish a budget if you don’t know how much you spend every month and on what? How can you be sure a critical application server is secured if everyone uses the same admin account, and you don’t know who knows the password? Attempts to understand and manage risks around personal data across an organization suffer from this same problem of trying to implement a concrete control against something abstract and ephemeral, and it results in controls that just don’t work.
In his indispensable paper "A Taxonomy of Privacy", Daniel J. Solove discusses a series of privacy-related harms that relate to data processing, including linking together personal information from otherwise unrelated data sources to develop a more complete profile of the user or to identify a previously anonymous individual. In a system designed for privacy, this set of problems can be described as "linkability" or reidentification vulnerabilities. The design of a system and the ability to combine existing data in new ways creates the potential for exceeding the purpose for which the data was collected or for which the organization has some justification for processing, and of course opens up the possibility for malicious shenanigans as well.
To my thinking, though, the problem is less about controlling the ability to link data stores together in novel ways and more about a general lack of imagination about what really constitutes personal data. In the early days of GDPR when forward-thinking organizations were planning their compliance strategies, one of the biggest challenges was figuring out how to document data processing operations as required by Article 30. The language in the regulation seems pretty straightforward, but these companies discovered that they didn’t have even the most cursory understanding of their data collection and processing use cases, or what data they had, where it might be stored, and where it might be going, and that figuring it all out was a cracking big problem. Thus began grand efforts to build enterprise data maps, driven by questionnaires and collation efforts, and fueling a growing market of automated data discovery and classification tools.
These strategies are fine and appropriate as far they go. However, the problem remains that when we think about personal data, we’re mainly thinking about it as a forms problem. We describe it as structured data in well-documented and explicable data schemas, and as unstructured data in the form of messy fileshares stuffed with documents, spreadsheets, images, and other digital flotsam. But to understand the problem with a true privacy focus, you have to consider the data in potentia as well.
Imagine a simple application used for employee benefits provisioning that maintains records about employees, spouses, and dependents. The employee record might include data elements such as name, date of birth, date of hire, gender, marital status, department code, job code, etc., and the spouse/beneficiary records might be expected to be similar, excluding hire date, and so on. There’s nothing terribly scurrilous going on here, however with the right privileges, the organization operating this application is one simple join away from generating a list of employees with same-sex spouses. This gets into potentially dangerous territory, and constitutes a class of data the organization almost certainly doesn’t have a legal basis to retain and process.
The point is this: a typical organization has a lot more personal data in its grasp than can be readily identified by simply looking at the data schema and scanning its document hoard for a list of regex hits corresponding to known personal data types. This is part of what makes good anonymization so fiendishly difficult, and makes the over-reliance on data discovery automation so risky; some types of personal data only exist when the question about them is being asked. If you have a generous corpus of personal data plus a team of well-trained data scientists with quality tooling, or your risk analysis begins and ends with a simple list of extant data items, your data handling, and retention policy is probably a lot more aspirational than you’d like to believe and your privacy risk exposure quite a bit higher.