Reidentification as Basic Science (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. We’ll continue to post contributions into next week. —MM

Arvind Narayanan (Ph.D. 2009) is an Assistant Professor of Computer Science at Princeton. He studies information privacy and security and has a side-interest in technology policy. His research has shown that data anonymization is broken in fundamental ways, for which he jointly received the 2008 Privacy Enhancing Technologies Award. Narayanan is one of the researchers behind the “Do Not Track” proposal. His most recent research direction is the use of Web measurement to uncover how companies are using our personal information.

Narayanan is an affiliated faculty member at the Center for Information Technology Policy at Princeton and an affiliate scholar at Stanford Law School’s Center for Internet and Society. You can follow him on Twitter at @random_walker.

By Arvind Narayanan

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Read More

I Never Promised You a Walled Garden (Re-Identification Symposium)

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. We’ll continue to post contributions into next week. —MM

By Misha Angrist

Dear Michelle:

You know I respect your work immensely: your paper on the heterogeneity problem will be required reading in my classes for a long time to come.

But as far as this forum goes, I feel like I need both to push back and seek clarity. I’m missing something.

As you know, the PGP consent form includes a litany of risks that accompany the decision to make one’s genome and medical information public with no promises of privacy and confidentiality. These risks range from the well documented (discovery of non-paternity) to the arguably more whimsical (“relatedness to criminals or other notorious figures.”), including the prospect of being cloned. You write:

Surely the fact that I acknowledge that it is possible that someone will use my DNA sequence to clone me (not currently illegal under federal law, by the way) does not mean that I have given permission to be cloned, that I have waived my right to object to being cloned, or that I should be expected to be blasé or even happy if and when I am cloned.

Of course not. No one is asking you to be silent, blasé or happy about being cloned (your clone, however, tells me she is “totally psyched”).

But I don’t think it’s unfair to ask that you not be surprised that PGP participants were re-identified, given the very raison d’être of the PGP.

I would argue that the PGP consent process is an iterative, evolving one that still manages to crush HapMap and 1000 Genomes, et al., w/r/t truth in advertising (as far as I know, no other large-scale human “subjects” research study includes an exam). That said, the PGP approach to consent is far from perfect and, given the inherent limitations of informed consent, never will be perfect.

But setting that aside, do you really feel like you’ve been sold a bill of goods? Your deep–and maybe sui generis–understanding of the history of de-identification demonstrations makes me wonder how you could have been shocked or even surprised by the findings of the Sweeney PGP paper.

And yet you were. As your friend and as a member of the PersonalGenomes.org Board of Directors, this troubles and saddens me. In the iterative and collaborative spirit that the Project tries to live by, I look forward to hearing about how the PGP might do better in the future.

In the meantime, I can’t help but wonder: Knowing what you know and having done your own personal cost-benefit analysis, why not quit the PGP? Why incur the risk?

Warm regards,

Misha

Reflections of a Re-Identification Target, Part I: Some Information Doesn’t Want To Be Free (Re-Identification Symposium)

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Michelle N. Meyer

I wear several hats for purposes of this symposium, in addition to organizer. First, I’m trained as a lawyer and an ethicist, and one of my areas of scholarly focus is research regulation and ethics, so I see re-identification demonstrations through that lens. Second, as a member of the advisory board of the Social Science Genetics Association Consortium (SSGAC), I advise data holders about ethical and regulatory aspects of their research, including issues of re-identification. I may have occasion to reflect on this role later in the symposium. For now, however, I want to put on my third hat: that of data provider to (a.k.a. research participant in) the Personal Genome Project (PGP), the most recent target of a pair of re-identification “attacks,” as even re-identification researchers themselves seem to call them.

In this first post, I’ll briefly discuss my experience as a target of a re-identification attack. In my discussions elsewhere about the PGP demonstrations, some have suggested that re-identification requires little or no ethical justification where (1) participants have been warned about the risk of re-identification; (2) participants have given blanket consent to all research uses of the data they make publicly available; and/or (3) the re-identification researchers are scholars rather than commercial or criminal actors.

In explaining below why I think each of these arguments is mistaken, I focus on the PGP re-identification demonstrations. I choose the PGP demonstrations not to single them out, but rather for several other reasons. First, the PGP attacks are the case studies with which, for obvious reasons, I’m most familiar, and I’m fortunate to have convinced so many other stakeholders involved in those demonstrations to participate in the symposium and help me fill out the picture with their perspectives. I also focus on the PGP because some view it as an “easy” case for re-identification work, given the features I just described. Therefore, if nonconsensual re-identification attacks on PGP participants are ethically problematic, then much other nonconsensual re-identification work is likely to be as well. Finally, although today the PGP may be somewhat unusual in being so frank with participants about the risk of re-identification and in engaging in such open access data sharing, both of these features, and especially the first, shouldn’t be unusual in research. To the extent that we move towards greater frankness about re-identification risk and broader data sharing, trying to achieve clarity about what these features of a research project do — and do not — mean for the appropriateness of re-identification demonstrations will be important.

Having argued here about how not to think about the ethics of re-identification studies, in a later post, I plan to provide some affirmative thoughts about an ethical framework for how we should think about this work.

Read More

Data Sharing vs. Privacy: Cutting the Gordian Knot (Re-Identification Symposium)

PGP participants and staff at the 2013 GET Conference. Photo credit: PersonalGenomes.org, license CC-BY

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Madeleine Ball

Scientists should share. Methods, samples, and data — sharing these is a foundational aspect of the scientific method. Sharing enables researchers to replicate, validate, and build upon the work of colleagues. As Isaac Newton famously wrote: “If I have seen further it is by standing on the shoulders of giants.”

When scientists study humans, however, this impulse to share runs into another motivating force — respect for individual privacy. Clinical research has traditionally been conducted using de-identified data, and participants have been assured privacy. As digital information and computational methods have increased the ability to re-identify participants, researchers have become correspondingly more restrictive with sharing. Solutions are proposed in an attempt to maximize research value while protecting privacy, but these can fail — and, as Gymrek et al. have recently confirmed, biological materials themselves contain highly identifying information through their genetic material alone.

When George Church proposed the Personal Genome Project in 2005, he recognized this inherent tension between privacy and data sharing. He proposed an extreme solution: cutting the Gordian knot by removing assurances of privacy:

If the study subjects are consented with the promise of permanent confidentiality of their records, then the exposure of their data could result in psychological trauma to the participants and loss of public trust in the project. On the other hand, if subjects are recruited and consented based on expectation of full public data release, then the above risks to the subjects and the project can be avoided.

Church, GM “The Personal Genome Project” Molecular Systems Biology (2005)

Thus, the first ten PGP participants — the PGP-10 — identified themselves publicly.

Read More

Breaking Good: A Short Ethical Manifesto for the Privacy Researcher

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week, and extending at least into early next week. Background on the symposium is here. You can call up all of the symposium contributions here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post).

Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Yaniv Erlich

1. Increase the general knowledge –Like any other scientific discipline, privacy research strives to increase our knowledge about the world. You are breaking bad if your actions are aimed to reveal intimate details of people, or worst to exploit these details for your own benefit. This is not science. This is just ugly behavior. Ethical privacy research aims to deduce technical commonalities about vulnerabilities in systems not about the individuals in these systems. This should be your internal compass.

This rule immediately asserts that your published findings should communicate only relevant information to deduce general rules. Any shocking/juicy/intimate detail that was revealed during your study is not relevant and should not be included in your publication.

Some people might gently (or aggressively) suggest that you should not publish your findings at all. Do not get too nervous by that. Simply remind them that the ethical ground of your actions is increasing the general knowledge. Therefore, communicating your algorithms, hacks, and recipes is an ethical obligation and without that your actions cannot be truly regarded as research. “There is no ignorabimus … whatever in natural science. We must know — we will know!”, the great Mathematician David Hilbert once said. His statement applies also to privacy research.

Read More

Re-Identification Is Not the Problem. The Delusion of De-Identification Is. (Re-Identification Symposium)

By Michelle Meyer

This is the second post in Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week, and extending at least into early next week. Background on the symposium is here. You can call up all of the symposium contributions by clicking here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post).

Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Jen Wagner, J.D., Ph.D.

Before I actually discuss my thoughts on the re-identification demonstrations, I think it would be useful to provide a brief background on my perspective.

Identification≠identity

My genome is an identifier. It can be used in lieu of my name, my visible appearance, or my fingerprints to describe me sufficiently for legal purposes (e.g. a “Jane Doe” search or arrest warrant specifying my genomic sequence). Nevertheless, my genome is not me. It is not the gist of who I am –past, present or future. In other words, I do not believe in genetic essentialism.

My genome is not my identity, though it contributes to my identity in varying ways (directly and indirectly; consciously and subconsciously; discretely and continuously). Not every individual defines his/her self the way I do. There are genomophobes who may shape their identity in the absence of their genomic information and even in denial of and/or contradiction to their genomic information. Likewise, there are genomophiles who may shape their identity with considerable emphasis on their genomic information, in the absence of non-genetic information and even in denial of and/or contradiction to their non-genetic information (such as genealogies and origin beliefs).

My genome can tell you probabilistic information about me, such as my superficial appearance, health conditions, and ancestry. But it won’t tell you how my phenotypes have developed over my lifetime or how they may have been altered (e.g. the health benefits I noticed when I became vegetarian, the scar I earned when I was a kid, or the dyes used to hide the grey hairs that seem proportional to time spent on the academic job market). I do not believe in genetic determinism. My genomic data is of little research value without me (i.e. a willing, able, and honest participant), my phenotypic information (e.g. anthropometric data and health status), and my environmental information (e.g. data about my residence, community, life exposures, etc). Quite simply, I make my genomic data valuable.

As a PGP participant, I did not detach my name from the genetic data I uploaded into my profile. In many ways, I feel that the value of my data is maximized and the integrity of my data is better ensured when my data is humanized.

Read More

Applying Information Privacy Norms to Re-Identification Demonstrations (Re-Identification Symposium)

This is the first post in Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week. Background on the symposium is here. You can call up all of the symposium contributions by clicking here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post). —MM

By Stephen Wilson

I’m fascinated by the methodological intersections of technology and privacy – or rather the lack of intersection, for it appears that a great deal of technology development occurs in blissful ignorance of information privacy norms.  By “norms” in the main I mean the widely legislated OECD Data Protection  Principles (see Graham Greenleaf, Global data privacy laws: 89 countries, and accelerating, Privacy Laws & Business International Report, Issue 115, Special Supplement, February 2012).

Standard data protection and information privacy regulations world-wide are grounded by a reasonably common set of principles; these include, amongst other things, that personal information should not be collected if it is not needed for a core business function, and that personal information collected for one purpose should not be re-used for unrelated purposes without consent. These sorts of privacy formulations tend to be technology neutral; they don’t much care about the methods of collection but focus instead on the obligations of data custodians regardless of how personal information has come to be in their systems. That is, it does not matter if you collect personal information from the public domain, or from a third party, or if you synthesise it from other data sources, you are generally accountable under the Collection Limitation and Use Limitation principles in the same way as if you collect that personal information directly from the individuals concerned.

I am aware of two distinct re-identification demonstrations that have raised awareness of the issues recently.  In the first, Yaniv Erlich used what I understand are new statistical techniques to re-identify a number of subjects that had donated genetic material anonymously to the 1000 Genomes project. He did this by correlating genes in the published anonymous samples with genes in named samples available from genealogical databases. The 1000 Genomes consent form reassured participants that re-identification would be “very hard”. In the second notable demo, Latanya Sweeney re-identified volunteers in the Personal Genome Project using her previously published method of using a few demographic values (such as date or birth, sex and postal code) extracted from the otherwise anonymous records.

A great deal of the debate around these cases has focused on the consent forms and the research subjects’ expectations of anonymity. These are important matters for sure, yet for me the ethical issue in re-anonymisation demonstrations is more about the obligations of third parties doing the identification who had nothing to do with the original informed consent arrangements.  The act of recording a person’s name against erstwhile anonymous data represents a collection of personal information.  The implications for genomic data re-identification are clear.

Read More

Online Symposium on the Law, Ethics & Science of Re-identification Demonstrations

By Michelle Meyer

Over the course of the last fifteen or so years, the belief that “de-identification” of personally identifiable information preserves the anonymity of those individuals has been repeatedly called up short by scholars and journalists. It would be difficult to overstate the importance, for privacy law and policy, of the early work of “re-identification scholars,” as I’ll call them. In the mid-1990s, the Massachusetts Group Insurance Commission (GIC) released data on individual hospital visits by state employees in order to aid important research. As Massachusetts Governor Bill Weld assured employees, their data had been “anonymized,” with all obvious identifiers, such as name, address, and Social Security number, removed. But Latanya Sweeney, then an MIT graduate student, wasn’t buying it. When, in 1996, Weld collapsed at a local event and was admitted to the hospital, she set out to show that she could re-identify his GIC entry. For twenty dollars, she purchased the full roll of Cambridge voter-registration records, and by linking the two data sets, which individually were innocuous enough, she was able to re-identify his GIC entry. As privacy law scholar Paul Ohm put it, “In a theatrical flourish, Dr. Sweeney sent the Governor’s health records (which included diagnoses and prescriptions) to his office.”

Sweeney’s demonstration led to important changes in privacy law, especially under HIPAA. But that demonstration was just the beginning. In 2006, the New York Times was able to re-identify one individual (and only one individual)  in a publicly available research dataset of the three-month AOL search history of over 600,000 users. The Times demonstration led to a class-action lawsuit (which settled out of court), an FTC complaint, and soul-searching in Congress. That same year, Netflix began a three-year contest, offering a $1 million prize to whomever could most improve the algorithm by which the company predicts how much a particular user will enjoy a particular movie. To enable the contest, Netflix made publicly available a dataset of the movie ratings of 500,000 of its customers, whose names it replaced with numerical identifiers. In a 2008 paper, Arvind Narayanan, then a graduate student at UT-Austin, along with his advisor, showed that by linking the “anonymized” Netflix prize dataset to the Internet Movie Database (IMDb), in which viewers review movies, often under their own names, many Netflix users could be re-identified, revealing information that was suggestive of their political preferences and other potentially sensitive information. (Remarkably, notwithstanding the re-identification demonstration, after awarding the prize in 2009 to a team from AT&T, in 2010, Netflix announced plans for a second contest, which it cancelled only after tussling with a class-action lawsuit (again, settled out of court) and the FTC.) Earlier this year, Yaniv Erlich and colleagues, using a novel technique involving surnames and the Y chromosome, re-identified five men who had participated in the 1000 Genomes Project — an international consortium to place, in an open online database, the sequenced genomes of (as it turns out, 2500) “unidentified” people — who had also participated in a study of Mormon families in Utah.

Most recently, Sweeney and colleagues re-identified participants in Harvard’s Personal Genome Project (PGP), who are warned of this risk, using the same technique she used to re-identify Weld in 1997. As a scholar of research ethics and regulation — and also a PGP participant — this latest demonstration piqued my interest. Although much has been said about the appropriate legal and policy responses to these demonstrations (my own thoughts are here), there has been very little discussion about the legal and ethical aspects of the demonstrations themselves. As a modest step in filling that gap, I’m pleased to announce an online symposium, to take place here at the Bill of Health the week of May 20th, that will address both the scientific and policy value of these demonstrations and the legal and ethical issues they raise. Participants fill diverse stakeholder roles (data holder, data provider — i.e., research participant, re-identification researcher, privacy scholar, research ethicist) and will, I expect, have a range of perspectives on these questions:

Misha Angrist
Madeleine Ball

Daniel Barth-Jones

Yaniv Erlich

Beau Gunderson

Stephen Wilson

Michelle Meyer

Arvind Narayanan

Paul Ohm

Latanya Sweeney

Jennifer Wagner

I hope readers will join us on May 20.

UPDATE: You can call up all of the symposium contributions, in reverse chronological order, by clicking here.