Applying Information Privacy Norms to Re-Identification Demonstrations (Re-Identification Symposium)

This is the first post in Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week. Background on the symposium is here. You can call up all of the symposium contributions by clicking here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post). —MM

By Stephen Wilson

I’m fascinated by the methodological intersections of technology and privacy – or rather the lack of intersection, for it appears that a great deal of technology development occurs in blissful ignorance of information privacy norms.  By “norms” in the main I mean the widely legislated OECD Data Protection  Principles (see Graham Greenleaf, Global data privacy laws: 89 countries, and accelerating, Privacy Laws & Business International Report, Issue 115, Special Supplement, February 2012).

Standard data protection and information privacy regulations world-wide are grounded by a reasonably common set of principles; these include, amongst other things, that personal information should not be collected if it is not needed for a core business function, and that personal information collected for one purpose should not be re-used for unrelated purposes without consent. These sorts of privacy formulations tend to be technology neutral; they don’t much care about the methods of collection but focus instead on the obligations of data custodians regardless of how personal information has come to be in their systems. That is, it does not matter if you collect personal information from the public domain, or from a third party, or if you synthesise it from other data sources, you are generally accountable under the Collection Limitation and Use Limitation principles in the same way as if you collect that personal information directly from the individuals concerned.

I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently.  In the first, Yaniv Erlich used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be “very hard”. In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.

A great deal of the debate around these cases has focused on the consent forms and the research subjects’ expectations of anonymity. These are important matters for sure, yet for me the ethical issue in re-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements.  The act of recording a person’s name against erstwhile anonymous data represents a collection of personal information.  The implications for genomic data re-identification are clear.

Read More

Online Symposium on the Law, Ethics & Science of Re-identification Demonstrations

By Michelle Meyer

Over the course of the last fifteen or so years, the belief that “de-identification” of personally identifiable information preserves the anonymity of those individuals has been repeatedly called up short by scholars and journalists. It would be difficult to overstate the importance, for privacy law and policy, of the early work of “re-identification scholars,” as I’ll call them. In the mid-1990s, the Massachusetts Group Insurance Commission (GIC) released data on individual hospital visits by state employees in order to aid important research. As Massachusetts Governor Bill Weld assured employees, their data had been “anonymized,” with all obvious identifiers, such as name, address, and Social Security number, removed. But Latanya Sweeney, then an MIT graduate student, wasn’t buying it. When, in 1996, Weld collapsed at a local event and was admitted to the hospital, she set out to show that she could re-identify his GIC entry. For twenty dollars, she purchased the full roll of Cambridge voter-registration records, and by linking the two data sets, which individually were innocuous enough, she was able to re-identify his GIC entry. As privacy law scholar Paul Ohm put it, “In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.”

Sweeney’s demonstration led to important changes in privacy law, especially under HIPAA. But that demonstration was just the beginning. In 2006, the New York Times was able to re-identify one individual (and only one individual)  in a publicly available research dataset of the three-month AOL search history of over 600,000 users. The Times demonstration led to a class-action lawsuit (which settled out of court), an FTC complaint, and soul-searching in Congress. That same year, Netflix began a three-year contest, offering a $1 million prize to whomever could most improve the algorithm by which the company predicts how much a particular user will enjoy a particular movie. To enable the contest, Netflix made publicly available a dataset of the movie ratings of 500,000 of its customers, whose names it replaced with numerical identifiers. In a 2008 paper, Arvind Narayanan, then a graduate student at UT-Austin, along with his advisor, showed that by linking the “anonymized” Netflix prize dataset to the Internet Movie Database (IMDb), in which viewers review movies, often under their own names, many Netflix users could be re-identified, revealing information that was suggestive of their political preferences and other potentially sensitive information. (Remarkably, notwithstanding the re-identification demonstration, after awarding the prize in 2009 to a team from AT&T, in 2010, Netflix announced plans for a second contest, which it cancelled only after tussling with a class-action lawsuit (again, settled out of court) and the FTC.) Earlier this year, Yaniv Erlich and colleagues, using a novel technique involving surnames and the Y chromosome, re-identified five men who had participated in the 1000 Genomes Project — an international consortium to place, in an open online database, the sequenced genomes of (as it turns out, 2500) “unidentified” people — who had also participated in a study of Mormon families in Utah.

Most recently, Sweeney and colleagues re-identified participants in Harvard’s Personal Genome Project (PGP), who are warned of this risk, using the same technique she used to re-identify Weld in 1997. As a scholar of research ethics and regulation — and also a PGP participant — this latest demonstration piqued my interest. Although much has been said about the appropriate legal and policy responses to these demonstrations (my own thoughts are here), there has been very little discussion about the legal and ethical aspects of the demonstrations themselves. As a modest step in filling that gap, I’m pleased to announce an online symposium, to take place here at the Bill of Health the week of May 20th, that will address both the scientific and policy value of these demonstrations and the legal and ethical issues they raise. Participants fill diverse stakeholder roles (data holder, data provider — i.e., research participant, re-identification researcher, privacy scholar, research ethicist) and will, I expect, have a range of perspectives on these questions:

Misha Angrist
Madeleine Ball

Daniel Barth-Jones

Yaniv Erlich

Beau Gunderson

Stephen Wilson

Michelle Meyer

Arvind Narayanan

Paul Ohm

Latanya Sweeney

Jennifer Wagner

I hope readers will join us on May 20.

UPDATE: You can call up all of the symposium contributions, in reverse chronological order, by clicking here.