Reidentification as Basic Science (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. We’ll continue to post contributions into next week. —MM

Arvind Narayanan (Ph.D. 2009) is an Assistant Professor of Computer Science at Princeton. He studies information privacy and security and has a side-interest in technology policy. His research has shown that data anonymization is broken in fundamental ways, for which he jointly received the 2008 Privacy Enhancing Technologies Award. Narayanan is one of the researchers behind the “Do Not Track” proposal. His most recent research direction is the use of Web measurement to uncover how companies are using our personal information.

Narayanan is an affiliated faculty member at the Center for Information Technology Policy at Princeton and an affiliate scholar at Stanford Law School’s Center for Internet and Society. You can follow him on Twitter at @random_walker.

By Arvind Narayanan

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Let me elaborate on why reidentification algorithms are interesting and important. First, they yield fundamental insights about people — our interests, preferences, behavior, and connections — as reflected in the datasets collected about us. Second, as is the case with most basic science, these algorithms turn out to have a variety of applications other than reidentification, both for good and bad. Let us consider some of these.

First and foremost, reidentification algorithms are directly applicable in digital forensics and intelligence. Analyzing the structure of a terrorist network (say, based on surveillance of movement patterns and meetings) to assign identities to nodes is technically very similar to social network deanonymization. A reidentification researcher that I know who is a U.S. citizen tells me he has been contacted more than once by intelligence agencies to apply his expertise to their data.

Homer et al.’s work on identifying individuals in DNA mixtures is another great example of how forensics algorithms are inextricably linked to privacy-infringing applications. In addition to DNA and network structure, writing style and location trails are other attributes that have been utilized both in reidentification and forensics.

It is not a coincidence that the reidentification literature often uses the word “fingerprint” — this body of work has generalized the notion of a fingerprint beyond physical attributes to a variety of other characteristics. Just like physical fingerprints, there are good uses and bad, but regardless, finding generalized fingerprints is a contribution to human knowledge. A fundamental question is how much information (i.e., uniqueness) there is in each of these types of attributes or characteristics. Reidentification research is gradually helping answer this question, but much remains unknown.

It is not only people that are fingerprintable — so are various physical devices. A wonderful set of (unrelated) research papers has shown that many types of devices, objects, and software systems, even supposedly identical ones, are have unique fingerprints: blank paper, digital cameras, RFID tags, scanners and printers, and web browsers, among others. The techniques are similar to reidentification algorithms, and once again straddle security-enhancing and privacy-infringing applications.

Even more generally, reidentification algorithms are classification algorithms for the case when the number of classes is very large. Classification algorithms categorize observed data into one of several classes, i.e., categories. They are at the core of machine learning, but typical machine-learning applications rarely need to consider more than several hundred classes. Thus, reidentification science is helping develop our knowledge of how best to extend classification algorithms as the number of classes increases.

Moving on, research on reidentification and other types of “leakage” of information reveals a problem with the way data-mining contests are run. Most commonly, some elements of a dataset are withheld, and contest participants are required to predict these unknown values. Reidentification allows contestants to bypass the prediction process altogether by simply “looking up” the true values in the original data! For an example and more elaborate explanation, see this post on how my collaborators and I won the Kaggle social network challenge. Demonstrations of information leakage have spurred research on how to design contests without such flaws.

If reidentification can cause leakage and make things messy, it can also clean things up. In a general form, reidentification is about connecting common entities across two different databases. Quite often in real-world datasets there is no unique identifier, or it is missing or erroneous. Just about every programmer who does interesting things with data has dealt with this problem at some point. In the research world, William Winkler of the U.S. Census Bureau has authored a survey of “record linkage”, covering well over a hundred papers. I’m not saying that the high-powered machinery of reidentification is necessary here, but the principles are certainly useful.

In my brief life as an entrepreneur, I utilized just such an algorithm for the back-end of the web application that my co-founders and I built. The task in question was to link a (musical) artist profile from last.fm to the corresponding Wikipedia article based on discography information (linking by name alone fails in any number of interesting ways.) On another occasion, for the theory of computing blog aggregator that I run, I wrote code to link authors of papers uploaded to arXiv to their DBLP profiles based on the list of coauthors.

There is more, but I’ll stop here. The point is that these algorithms are everywhere.

If the algorithms are the key, why perform demonstrations of privacy failures? To put it simply, algorithms can’t be studied in a vacuum; we need concrete cases to test how well they work. But it’s more complicated than that. First, as I mentioned earlier, keeping the privacy conversation intellectually honest is one of my motivations, and these demonstrations help. Second, in the majority of cases, my collaborators and I have chosen to examine pairs of datasets that were already public, and so our work did not uncover the identities of previously anonymous subjects, but merely helped to establish that this could happen in other instances of “anonymized” data sharing.

Third, and I consider this quite unfortunate, reidentification results are taken much more seriously if researchers do uncover identities, which naturally gives us an incentive to do so. I’ve seen this in my own work — the Netflix paper is the most straightforward and arguably the least scientifically interesting reidentification result that I’ve co-authored, and yet it has received by far the most attention, all because it was carried out on an actual dataset published by a company rather than demonstrated hypothetically.

My primary focus on the fundamental research aspect of reidentification guides my work in an important way. There are many, many potential targets for reidentification — despite all the research, data holders often (rationally) act like nothing has changed and continue to make data releases with “PII” removed. So which dataset should I pick to work on?

Focusing on the algorithms makes it a lot easier. One of my criteria for picking a reidentification question to work on is that it must lead to a new algorithm. I’m not at all saying that all reidentification researchers should do this, but for me it’s a good way to maximize the impact I can hope for from my research, while minimizing controversies about the privacy of the individuals represented in the datasets I study.

I hope this post has given you some insight into my goals, motivations, and research outputs, and an appreciation of the fact that there is more to reidentification algorithms than their application to breaching privacy. It will be useful to keep this fact in mind as we continue the conversation on the ethics of reidentification.

Thanks to Vitaly Shmatikov for reviewing a draft.

0 thoughts to “Reidentification as Basic Science (Re-Identification Symposium)”

  1. Arvind,

    Thanks for this interesting post.
    I agree with your point that if possible reID experiments should be conducted on real datasets with known identities. However, technically, these experiments suffer from two caveats of end-to-end reID experiments.

    First, they are prone to ascertainment bias. The records of self-identifying people might be biased towards people that are more open about sharing information. Thus, the inference algorithm has more hints and available data for the reID attempts, which might artificially boost the reID success rate. In addition, attitudes and norms towards data sharing are found in correlation with socio-economic status, ethic group, and sex. This again introduces self-ascertainment bias in the experiment. So there is a risk that the success of the algorithm on the identified data does not provide a reliable answer for the data as a whole.

    Second, there is a risk of over-fitting of the algorithm or flawed analysis. Generally, best scientific practices is to conduct blind experiments. There are mutliple lines of empirical studies that showed that honest scientific mistakes are usually done to justify the alternative hypothesis and to make experiments work. See recent Nature review here: https://bit.ly/ZdV0KK about replicability of scientific literature and the importance of blind experiments.

    End to end reID experiments of uncovering hidden identities are not subject to these technical and scientific caveats. They are more powerful not because they are more “sexy” but because their scientific merits.

  2. Thanks, Arvind, for this post, and thanks to Yaniv for the comment. Two questions for both of you:

    (1) Yaniv suggests that conducting re-ID research with people who self-identify might constitute a biased sample that limits the usefulness of the resulting data. As you both know, sample bias is a perennial problem in research. Perhaps most notably, those of us who volunteer to participate in research are almost certainly different from those who decline. And so a rule that requires consent from research participants almost certainly affects the generalizability of the result. And yet, outside the realm of minimal risk research, emergency research, and a few other exceptional areas, we generally nevertheless do require consent. I wonder if either of you has any view on this question of using self-identifying people in light of this broader framework that governs human subjects research.

    (2) Arvind’s “criteria for picking a reidentification question to work on is that it must lead to a new algorithm.” Yaniv, I gather that you have a different perspective about this based on your view of the value of replications of re-ID studies. Can you both speak a bit more about this issue?

  3. Very interesting piece indeed, thanks.
    Yes, it is unfortunate that to be taken seriously, these demonstrations have to work on real people. I assume you mean it’s regrettable that real privacy is breached. Now we’re getting close to the gist of this symposium: many of these demonstrations breach people’s privacy.
    Many scientific experiments have a human down side, a cost. We all know this. The ethical question is whether the ends justify the means? I don’t see any of our reidentification proponents yet tackle this question in respect of privacy. In fact some are quite irritated by the suggestion that privacy has been breached. They insist that the PGP consent contemplated the down side, warning participants they should expect no privacy, so what’s the problem? Well, if privacy is breached, even if it’s predictable and the participants have accepted that, the question remains: What is the ethics of the third parties doing the breach?

  4. Yaniv, yes, that’s a very good point.

    It looks like there are two slightly different types of biases being talked about here.

    1. A user chooses to participate in a service that doesn’t claim to offer any privacy, and is thus not representative of a user who participates in a service that claims to deidentify users.

    2. A user who is already using a service consents to participating in a study on reidentification, and is thus not representative of a typical user of the service.

    To me the latter bias is stronger. The former doesn’t bother me quite as much. But I can see how people may draw the line at different places.

    The above is, incidentally, one answer to Michelle’s question: if we’re researching reidentification, the variable that affects the probability of consent (privacy preference) is a variable that directly also affects what we’re measuring (probability of reidentification). So it’s not straightforward to require researchers to get consent, even if it’s the norm for other types of studies. A study on (say) cancer doesn’t have this peculiar problem, because privacy preferences can be assumed to be uncorrelated (or only minimally correlated) with cancer genes.

    1. Thanks, Arvind. Let me push back a bit on your claim that consent doesn’t have a serious biasing effect on other kinds of research, including cancer trials. There’s good reason to believe that so-called “volunteer bias” (and its perhaps slightly less significant cousin, informed consent bias) afflicts all manner of research with human subjects, not just research involving privacy. Participants who agree to enroll in research have been found to be physically healthier and more intelligent, and they are likely more socially outgoing, less risk-averse, and so on. All of these traits, and many others that probably correlate with willingness to participate, are likely to affect a wide range of study outcomes. Consenting participants may also be more compliant than the general population, leading to a trial that suggests that an intervention is quite efficacious followed by a disappointing implementation in the real world where it is much less effective than predicted (the efficacy-effectiveness gap). And indeed, informed consent has been found to affect the therapeutic response of study drugs, and to “increase the apparent efficacy of both the tested [cancer] agent and the placebo, and decrease the perceived difference the two.” These links are really just the tip of the iceberg. If volunteer (or consent) bias isn’t unique to privacy research, then it’s fair to ask whether that research should be uniquely excused from consent requirements.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.