Your Personal Data Problem Is Worse Than You Think

It’s a hard fact that you can’t protect or manage something you can’t find or don’t know you have. How do you price insurance for your home if you don’t know what’s in it or what it’s worth? How can you establish a budget if you don’t know how much you spend every month and on what? How can you be sure a critical application server is secured if everyone uses the same admin account, and you don’t know who knows the password? Attempts to understand and manage risks around personal data across an organization suffer from this same problem of trying to implement a concrete control against something abstract and ephemeral, and it results in controls that just don’t work.

In his indispensable paper "A Taxonomy of Privacy", Daniel J. Solove discusses a series of privacy-related harms that relate to data processing, including linking together personal information from otherwise unrelated data sources to develop a more complete profile of the user or to identify a previously anonymous individual. In a system designed for privacy, this set of problems can be described as "linkability" or reidentification vulnerabilities. The design of a system and the ability to combine existing data in new ways creates the potential for exceeding the purpose for which the data was collected or for which the organization has some justification for processing, and of course opens up the possibility for malicious shenanigans as well.

To my thinking, though, the problem is less about controlling the ability to link data stores together in novel ways and more about a general lack of imagination about what really constitutes personal data. In the early days of GDPR when forward-thinking organizations were planning their compliance strategies, one of the biggest challenges was figuring out how to document data processing operations as required by Article 30. The language in the regulation seems pretty straightforward, but these companies discovered that they didn’t have even the most cursory understanding of their data collection and processing use cases, or what data they had, where it might be stored, and where it might be going, and that figuring it all out was a cracking big problem. Thus began grand efforts to build enterprise data maps, driven by questionnaires and collation efforts, and fueling a growing market of automated data discovery and classification tools.

These strategies are fine and appropriate as far they go. However, the problem remains that when we think about personal data, we’re mainly thinking about it as a forms problem. We describe it as structured data in well-documented and explicable data schemas, and as unstructured data in the form of messy fileshares stuffed with documents, spreadsheets, images, and other digital flotsam. But to understand the problem with a true privacy focus, you have to consider the data in potentia as well.

Imagine a simple application used for employee benefits provisioning that maintains records about employees, spouses, and dependents. The employee record might include data elements such as name, date of birth, date of hire, gender, marital status, department code, job code, etc., and the spouse/beneficiary records might be expected to be similar, excluding hire date, and so on. There’s nothing terribly scurrilous going on here, however with the right privileges, the organization operating this application is one simple join away from generating a list of employees with same-sex spouses. This gets into potentially dangerous territory, and constitutes a class of data the organization almost certainly doesn’t have a legal basis to retain and process.

The point is this: a typical organization has a lot more personal data in its grasp than can be readily identified by simply looking at the data schema and scanning its document hoard for a list of regex hits corresponding to known personal data types. This is part of what makes good anonymization so fiendishly difficult, and makes the over-reliance on data discovery automation so risky; some types of personal data only exist when the question about them is being asked. If you have a generous corpus of personal data plus a team of well-trained data scientists with quality tooling, or your risk analysis begins and ends with a simple list of extant data items, your data handling, and retention policy is probably a lot more aspirational than you’d like to believe and your privacy risk exposure quite a bit higher.

A Lesson in Wrecking User Trust

The news was breathless, and the outrage was savage. Audacity is now spyware! Remove it immediately! Fury! Betrayal! Torches! Pitchforks!

First, a little background. Audacity is a venerable open source audio processing application that’s been around since 2000. It has a vast and devoted user base (including your humble author) and developer community. It’s one of the most recognized and recommended audio recording and editing tools for amateur audiophiles and content producers, though I’m sure there’s some penetration into the pro ranks as well, and it’s routinely rated in the top echelon of open source applications which includes other stalwarts such as LibreOffice, VLC, GIMP, and the Linux operating system itself. It’s not a minor project.

So there was some concern when it was acquired by Muse Group in April of last year. FOSS proponents are famously prickly when it comes to any perceived threat to software freedoms and in their distrust of corporate motives, and their suspicions seemed to bear fruit with an almost-immediate pull request implementing Google Analytics and Yandex telemetry functionality into the desktop application, which had never had any kind of data collection or phone-home capabilities before. The draft update to the policy notice discussing the new behavior didn’t do much to soften the message, including language such as “Data necessary for law enforcement, litigation and authorities’ requests…” and including notes that while the data will be stored in a limited fashion on Muse Group’s servers in the EEA, they may transfer it to their offices in the USA and Russia.

The reaction was swift and unforgiving. The request itself was inundated with thousands of negative votes and rage-fueled comments, and the project was forked multiple times within hours. Muse Group quickly changed course with a post from new maintainer Martin Keary promising to drop the telemetry features and attempting to respin the whole mess as a “bad communication/coordination blunder,” but the damage was already done.

This entire kerfuffle was not the result of a regulatory or compliance failure. There was no breach report, no official complaint, no data protection authority investigation. The proposed changes hadn’t even been merged back to the main branch. Muse Group communicated directly to the user and developer community what they were planning to implement, and that’s when all hell broke loose.

User trust is fragile, and it doesn’t require a big misstep to break it. I’ve seen many organizations taking very tentative steps toward privacy entirely in response to the fear of regulatory enforcement, only to lose interest when regulators weren’t immediately showing up on their doorsteps with clipboards and sledgehammers. The problem for these organizations is that privacy regulations are something of a trailing indicator for public sentiment, so being fined or clobbered with some other enforcement action isn’t always the most pressing risk. If your organization is still considering privacy as a cost of doing business, or you’re doing the absolute minimum to meet regulatory requirements in your target markets, you might wake up to discover that the real enforcers aren’t over-worked state district attorneys or European data protection authorities, but your own customers.