Your Personal Data Problem Is Worse Than You Think

It’s a hard fact that you can’t protect or manage something you can’t find or don’t know you have. How do you price insurance for your home if you don’t know what’s in it or what it’s worth? How can you establish a budget if you don’t know how much you spend every month and on what? How can you be sure a critical application server is secured if everyone uses the same admin account, and you don’t know who knows the password? Attempts to understand and manage risks around personal data across an organization suffer from this same problem of trying to implement a concrete control against something abstract and ephemeral, and it results in controls that just don’t work.

In his indispensable paper "A Taxonomy of Privacy", Daniel J. Solove discusses a series of privacy-related harms that relate to data processing, including linking together personal information from otherwise unrelated data sources to develop a more complete profile of the user or to identify a previously anonymous individual. In a system designed for privacy, this set of problems can be described as "linkability" or reidentification vulnerabilities. The design of a system and the ability to combine existing data in new ways creates the potential for exceeding the purpose for which the data was collected or for which the organization has some justification for processing, and of course opens up the possibility for malicious shenanigans as well.

To my thinking, though, the problem is less about controlling the ability to link data stores together in novel ways and more about a general lack of imagination about what really constitutes personal data. In the early days of GDPR when forward-thinking organizations were planning their compliance strategies, one of the biggest challenges was figuring out how to document data processing operations as required by Article 30. The language in the regulation seems pretty straightforward, but these companies discovered that they didn’t have even the most cursory understanding of their data collection and processing use cases, or what data they had, where it might be stored, and where it might be going, and that figuring it all out was a cracking big problem. Thus began grand efforts to build enterprise data maps, driven by questionnaires and collation efforts, and fueling a growing market of automated data discovery and classification tools.

These strategies are fine and appropriate as far they go. However, the problem remains that when we think about personal data, we’re mainly thinking about it as a forms problem. We describe it as structured data in well-documented and explicable data schemas, and as unstructured data in the form of messy fileshares stuffed with documents, spreadsheets, images, and other digital flotsam. But to understand the problem with a true privacy focus, you have to consider the data in potentia as well.

Imagine a simple application used for employee benefits provisioning that maintains records about employees, spouses, and dependents. The employee record might include data elements such as name, date of birth, date of hire, gender, marital status, department code, job code, etc., and the spouse/beneficiary records might be expected to be similar, excluding hire date, and so on. There’s nothing terribly scurrilous going on here, however with the right privileges, the organization operating this application is one simple join away from generating a list of employees with same-sex spouses. This gets into potentially dangerous territory, and constitutes a class of data the organization almost certainly doesn’t have a legal basis to retain and process.

The point is this: a typical organization has a lot more personal data in its grasp than can be readily identified by simply looking at the data schema and scanning its document hoard for a list of regex hits corresponding to known personal data types. This is part of what makes good anonymization so fiendishly difficult, and makes the over-reliance on data discovery automation so risky; some types of personal data only exist when the question about them is being asked. If you have a generous corpus of personal data plus a team of well-trained data scientists with quality tooling, or your risk analysis begins and ends with a simple list of extant data items, your data handling, and retention policy is probably a lot more aspirational than you’d like to believe and your privacy risk exposure quite a bit higher.

A Lesson in Wrecking User Trust

The news was breathless, and the outrage was savage. Audacity is now spyware! Remove it immediately! Fury! Betrayal! Torches! Pitchforks!

First, a little background. Audacity is a venerable open source audio processing application that’s been around since 2000. It has a vast and devoted user base (including your humble author) and developer community. It’s one of the most recognized and recommended audio recording and editing tools for amateur audiophiles and content producers, though I’m sure there’s some penetration into the pro ranks as well, and it’s routinely rated in the top echelon of open source applications which includes other stalwarts such as LibreOffice, VLC, GIMP, and the Linux operating system itself. It’s not a minor project.

So there was some concern when it was acquired by Muse Group in April of last year. FOSS proponents are famously prickly when it comes to any perceived threat to software freedoms and in their distrust of corporate motives, and their suspicions seemed to bear fruit with an almost-immediate pull request implementing Google Analytics and Yandex telemetry functionality into the desktop application, which had never had any kind of data collection or phone-home capabilities before. The draft update to the policy notice discussing the new behavior didn’t do much to soften the message, including language such as “Data necessary for law enforcement, litigation and authorities’ requests…” and including notes that while the data will be stored in a limited fashion on Muse Group’s servers in the EEA, they may transfer it to their offices in the USA and Russia.

The reaction was swift and unforgiving. The request itself was inundated with thousands of negative votes and rage-fueled comments, and the project was forked multiple times within hours. Muse Group quickly changed course with a post from new maintainer Martin Keary promising to drop the telemetry features and attempting to respin the whole mess as a “bad communication/coordination blunder,” but the damage was already done.

This entire kerfuffle was not the result of a regulatory or compliance failure. There was no breach report, no official complaint, no data protection authority investigation. The proposed changes hadn’t even been merged back to the main branch. Muse Group communicated directly to the user and developer community what they were planning to implement, and that’s when all hell broke loose.

User trust is fragile, and it doesn’t require a big misstep to break it. I’ve seen many organizations taking very tentative steps toward privacy entirely in response to the fear of regulatory enforcement, only to lose interest when regulators weren’t immediately showing up on their doorsteps with clipboards and sledgehammers. The problem for these organizations is that privacy regulations are something of a trailing indicator for public sentiment, so being fined or clobbered with some other enforcement action isn’t always the most pressing risk. If your organization is still considering privacy as a cost of doing business, or you’re doing the absolute minimum to meet regulatory requirements in your target markets, you might wake up to discover that the real enforcers aren’t over-worked state district attorneys or European data protection authorities, but your own customers.

Just Because You Can Do Something…

Right before this year’s midterm elections, the New York Times ran a piece about a pair of helpful little social shaming apps with the superficially noble goal of driving voting turnout. VoteWithMe and OutVote are quasi-social networking apps that match people in your contact list with scraped public records on voter information, party affiliation and whether a given individual voted in the last election. They can then assist you in reminding—or pressuring, hounding, humiliating, whatever—the laggards in your group to get to the polls. Predictably, these apps also helpfully collected other bits of personal information that weren’t entirely necessary to achieve the stated goals of furthering civic engagement.

What these apps are doing is probably kinda mostly legal, because the data being collected and presented to the user appears to be based solely on public records, and most privacy regulations contain some sort of derogation regarding the use of data that has already been publicly disclosed. But the fact that it’s technically possible to do something and that the current regulatory environment allows you to get away with it without serious legal repercussions doesn’t mean that you should just forge on ahead.

If you’ve been following the privacy business for any length of time, you’re undoubtedly familiar with the now-infamous Strava case. But here’s a quick recap: Strava is a San Francisco-based social network that allows its users to share their workouts based on GPS data captured from wearables such as Fitbits. In November of 2017, Strava released a new “Global Heatmap” feature, in which they highlighted running and cycling routes and related data from their users’ collected GPS points. It was a nifty new way to visualize this massive data hoard, but it was quickly discovered to reveal the location and internal navigable routes of a number of secret military installations around the world.

You can argue that this was the result of a massive OPSEC failure on the part of the affected western militaries and intelligence organizations, and you can also make a strong case for this resulting from weaknesses in Strava’s notification and consent management practices, further underscored by the changes Strava later implemented in its privacy policy and opt-out functions for Heatmap data gathering. The key point, though, is that personal data is highly nuanced, it’s relatively simple to inadvertently reveal information that wasn’t explicitly included in the original data set, and novel new uses for existing data can result in disclosures the data subjects never consented to in the first place.

Exactly how one votes is generally confidential in the US. Your ballots aren’t released to the public domain, and there are no legal circumstances I’m aware of in which you would be required to divulge that information. However that’s really an academic point in today’s highly polarized political environment; if I know your party affiliation and whether you voted at all in the last election, I can likely deduce how you and everyone else in your district voted using data analysis techniques no more sophisticated than sorting a spreadsheet.

We don’t have to go very far back in history to find instances in which political affiliation caused dangerous problems, and the NYT piece offers up a small catalog of potential pitfalls of how aggregated voter information could be put to malicious purposes. There’s a reason that political affiliation is treated with the same gravity as health data in Europe, where historical memory is a bit longer than in the US. My deeper concern has more to do with the ethical considerations, the optics, and the how the intersection of the two affect the desired outcomes.

One of the most important considerations in deciding how to use personal data for any purpose is to first ask if the proposed use is something the data subject would expect and consider reasonable. If I provide my mobile number to a specialty online retailer in the process of making a purchase, I would consider it reasonable for that retailer to call me on that number with a question about my order, to let me know there was a shipping delay, etc. I would consider it reasonable for my primary care physician to disclose aspects of my medical history to an EMT if I were unconscious and heading to the hospital in the back of an ambulance. I would even consider it reasonable if a sales rep called me to pitch something if I’d dropped my business card in a fishbowl at that company’s booth at a conference. But! I would not consider it reasonable for my political party affiliation and past voting behavior to be disclosed to a client who happened to have my mobile number in her address book, and to have that information used to allow her to start to up an argument with me about my politics.

Which brings us to what this means for outcomes. I would love to see data on how these apps performed against their goals while in-market. I don’t have any doubt they did drive some additional action in the polls in tech-savvy districts, but I’m also willing to bet they created some unnecessary tension, turned some people off the process, and created a pool of data ripe for misuse in the future. It’s a surprisingly tone-deaf effort in a time when privacy and the misuse of personal data is the highest it’s been in a very long time.

Security Perception and Supply Chain Poisoning: Those Pesky Little Chips

By now it’s inevitable you’ve run across the Bloomberg article discussing how nefarious forces in China managed to surreptitiously add malicious chips to server motherboards supplied by Super Micro, and how those boards ended up in servers deployed in marquee cloud providers’ data centers, including those operated by Amazon and Apple, and maybe even the US military and intelligence agencies. These chips supposedly provide China’s People’s Liberation Army the ability to secretly monitor server and network activity as well as alter I/O to the affected machines’ CPUs.

But if you’ve read the Bloomberg article, you’ve also likely caught the wave of meta-journalism around the topic, including vociferous denials from both Amazon and Apple that anything of the sort was discovered, as well as statements from some of the “officials” referred to in the original story that muddies the whole thing up.

This is a deliciously insidious problem, because it perfectly feeds standard conspiracy theory tropes. The big cloud providers will deny their hardware has been irreparably compromised, because of course they would—to do otherwise is to cast doubt on the safety and security of the bulk of their offerings. The involved “officials” will make similar denials as well, because they don’t want to be in the headline above the fold when the cloud industry goes into a tailspin.

I’m not trying to stoke conspiracy chatter around this topic or suggest that the creeping body horror scenario of little chips embedded in everything and watching every move we make is the new reality, and the companies and officials we trust to ferret out these little nasties are covering the whole thing up to quell panic. I don’t have the data in front of me I would need to make a conclusive judgment one way or the other. Is it possible that surveillance chips are being slipped into production boards? Sure, though—wow—what a complicated attack that would be. If there aren’t chips inserted into the boards, is it still possible that there is unseen surveillance code buried in the firmware or microcode? Sure, that’s also possible and a little more probable, actually. The real takeaway, however, is that the end functionality of the entire stack from silicon to user space software depends on a complex web of trust between parties that increasingly don’t trust each other.

We’re already well past the point at which we should be thinking about security as something other than the final lick of paint before dropping alpha code to paying customers. But this is another strong case for getting serious about defense in depth: the perimeter doesn’t trust the Internet; the segment doesn’t trust the backbone; the network doesn’t trust the host; the host doesn’t trust the network, the software, or the user; and absolutely everything is monitored and logged (and someone or something, preferably a properly-tuned SIEM, is actually doing something smart with the logs). Given that so many organizations haven’t even begun flirting with the bare minimum of information security, we’re probably not ready to get too wound up about secret little Chinese chips.