Ethical Concerns, Conduct and Public Policy for Re-Identification and De-identification Practice: Part 3 (Re-Identification Symposium)

This post is part of Bill of Health’s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. Background on the symposium is here. You can call up all of the symposium contributions by clicking here. —MM

By Daniel C. Barth-Jones

In Part 1 and Part 2 of this symposium contribution I wrote about a number of re-identification demonstrations and their reporting, both by the popular press and in scientific communications. However, even beyond the ethical considerations that I’ve raised about the accuracy of some of these communications, there are additional ethical, “scientific ethos”, and pragmatic public policy considerations involved in the conduct of re-identification research and de-identification practice that warrant some more thorough discussion and debate.

First Do No Harm

Unless we believe that the ends always justify the means, even obtaining useful results for guiding public policy (as was the case with the PGP demonstration attack’s validation of “perfect population register” issues) doesn’t necessarily mean that the conduct of re-identification research is on solid ethical footing. Yaniv Erlich’s admonition in his “A Short Ethical Manifesto for the Privacy Researcher blog post contributed as part of this symposium provides this wise advice: “Do no harm to the individuals in your study. If you can prove your point by a simulation on artificial data – do it.” This is very sound ethical advice in my opinion. I would argue that the re-identification risks for those individuals in the PGP study who had supplied 5-digit Zip Code and full date of birth were already understood to be unacceptably high (if these persons were concerned about being identified) and that no additional research whatsoever was needed to demonstrate this point. However, if additional arguments needed to be made about the precise levels of the risks, this could have been adequately addressed through the use of probability models. I’d also argue that “data intrusion scenario” uncertainty analyses which I discussed in Part 1 of this symposium contribution already accurately predicted the very small re-identification risks found for the sort of journalist and “nosy neighbor” attacks directed at the Washington hospital data. When strong probabilistic arguments can be made regarding potential re-identification risks, there is little possible purpose for undertaking actual re-identifications that can impact specific persons.

Looking more broadly, it seems more reasonably debatable whether the earlier January re-identification attacks by the Erlich lab on the CEPH – Utah Residents with Northern and Western European Ancestry (CEU) participants could have been warranted by virtue of the attack having exposed a previously underappreciated risk. However, I think an argument could likely be made that, given the prior work by Gitschier which had already revealed the re-identification vulnerabilities of CEU participants, the CEU portion of the Science paper also might not have served any additional purpose in directly advancing the science needed for development of good public policy. Without the CEU re-identifications though, it is unclear whether the surname inference paper would have been published (at least by a prominent journal like Science) and it also seems quite unlikely that it would have sustained nearly the level of media attention.

Read More

Press and Reporting Considerations for Recent Re-Identification Demonstration Attacks: Part 2 (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. Background on the symposium is here. You can call up all of the symposium contributions by clicking here. —MM

Daniel C. Barth-Jones, M.P.H., Ph.D., is a HIV and infectious disease epidemiologist.  His work in the area of statistical disclosure control and implementation under the HIPAA Privacy Rule provisions for de-identification is focused on the importance of properly balancing competing goals of protecting patient privacy and preserving the accuracy of scientific research and statistical analyses conducted with de-identified data. You can follow him on Twitter at @dbarthjones.

Forecast for Re-identification: Media Storms Continue…

In Part 1 of this symposium contribution, I wrote about the re-identification “media storm” started in January by the Erlich lab’s “Y-STR” re-identifications which made use of the relationship between Short Tandem Repeats (STRs) on the Y chromosome and paternally inherited surnames. Within months of that attack, April and June brought additional re-identification media storms; this time surrounding re-identification of Personal Genome Project (PGP) participants and a separate attack matching 40 persons within the Washington State hospital discharge database to news reports. However, as I have written has sometimes been the case with past reporting on other re-identification risks, accurate and legitimate characterization of re-identification risks has, unfortunately, once again been over-shadowed by distortive and exaggerated reporting on some aspects of these re-identification attacks. Unfortunately, a careful review of both the popular press coverage and scientific communications for these recent re-identification demonstrations displays some highly misleading communications, the most egregious of which incorrectly informs more than 112 million persons (more than one third of the U.S. population) that they are at potential risk of re-identification when they would not actually be unique and, therefore, re-identifiable. While each separate reporting concern that I’ve addressed here is important in and of itself, the broader pattern that can be observed for these communications about re-identification demonstrations raises some serious concerns about the impact that such distortive reporting could have on the development of sound and prudent public policy for the use of de-identified data.

Reporting Fail (and after-Fails)

University of Arizona law professor Jane Yakowitz Bambauer was the first to call out the distortive “reporting fail” for the PGP “re-identifications” in her blog post on the Harvard Law School Info/Law website. Bambauer pointed out that a Forbes article (written by Adam Tanner, a fellow at Harvard University’s Department of Government, and colleague of the re-identification scientist) covering the PGP re-identification demonstration was misleading with regard to a number of aspects of the actual research report released by Harvard’s Data Privacy Lab. The PGP re-identification study attempted to re-identify 579 persons in the PGP study by linking their “quasi-identifiers” {5-digit Zip Code, date of birth and gender} to both voter registration lists and an online public records database. The Forbes article led with the statement that “more than 40% of a sample of anonymous participants” had been re-identified. (This dubious claim was also repeated in subsequent reporting by the same author in spite of Bambauer’s “call out” of the inaccuracy explained below.) However, the mischaracterization of this data as “anonymous” really should not have fooled anyone beyond the most casual readers. In fact, approximately 80 individuals among the 579 were “re-identified” only because they had their actual names included within file names of the publically available PGP data. Some two dozen additional persons had their names embedded within the PGP file names, but were also “re-identifiable” by matching to voter and online public records data. Bambauer points out that the inclusion of the named individuals was “not relevant to an assessment of re-identification risk because the participants were not de-identified,” and quite correctly adds that “Including these participants in the re-identification number inflates both the re-identification risk and the accuracy rate.

As one observer humorously tweeted after reading Bambauer’s blog piece,

It’s like claiming you “reidentified” people from their high school yearbook”.

Read More

Public Policy Considerations for Recent Re-Identification Demonstration Attacks on Genomic Data Sets: Part 1 (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science and Re-Identification Demonstrations. We’ll have more contributions throughout the week. Background on the symposium is here. You can call up all of the symposium contributions by clicking here. —MM

Daniel C. Barth-Jones, M.P.H., Ph.D., is a HIV and Infectious Disease Epidemiologist.  His work in the area of statistical disclosure control and implementation under the HIPAA Privacy Rule provisions for de-identification is focused on the importance of properly balancing competing goals of protecting patient privacy and preserving the accuracy of scientific research and statistical analyses conducted with de-identified data. You can follow him on Twitter at @dbarthjones.

Re-identification Rain-makers

The media’s “re-identification rain-makers” have been hard at work in 2013 ceremoniously drumming up the latest anxiety-inducing media storms. In January, a new re-identification attack providing “surname inferences” from genomic data was unveiled and the popular press and bloggers thundered, rattled and raged with headlines ranging from the more staid and trusted voices of major newspapers (like the Wall Street Journal’s: “A Little Digging Unmasks DNA Donor Names. Experts Identify People by Matching Y-Chromosome Markers to Genealogy Sites, Obits; Researchers’ Privacy Promises ‘Empty’”) to near “the-sky-is-falling” hysteria in the blogosphere where headlines screamed: “Your Biggest Genetic Secrets Can Now Be Hacked, Stolen, and Used for Target Marketing” and “DNA hack could make medical privacy impossible”. (Now, we all know that editors will sometimes write sensational headlines in order to draw in readers, but I have to just say “Please, Editors… Take a deep breath and maybe a Xanax”.)

The more complicated reality is that, while this recent re-identification demonstration provided some important warning signals for future potential health privacy concerns, it was not likely to have been implemented by anyone other than an academic re-identification scientist; nor would it have been nearly so successful if it had not carefully selected targets who were particularly susceptible for re-identification.

As I’ve written elsewhere, from a public policy standpoint, it is essential that the re-identification scientists and the media accurately communicate re-identification risk research; because public opinion should, and does, play an important role in setting priorities for policy-makers. There is no “free lunch”. Considerable costs come with incorrectly evaluating the true risks of re-identification, because de-identification practice importantly impacts the scientific accuracy and quality of the healthcare decisions made based on research using de-identified data. Properly balancing disclosure risks and statistical accuracy is crucial because some popular de-identification methods can unnecessarily, and often undetectably, degrade the accuracy of de-identified data for multivariate statistical analyses. Poorly conducted de-identification may fail to protect privacy, and the overuse of de-identification methods in cases where they do not produce meaningful privacy protections can quickly lead to undetected and life threatening distortions in research and produce damaging health policy decisions.

So, what is the realistic magnitude of re-identification risk posed by the “Y-STR” surname inference re-identification attack methods developed by Yaniv Erlich’s lab? Should *everyone* really be fearful that this “DNA Hack” has now made their “medical privacy impossible”? Read More

An Open Letter From a Genomic Altruist to a Genomic Extrovert (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. We’ll continue to post contributions throughout the week. —MM

Dear Misha:

In your open letter to me, you write:

No one is asking you to be silent, blasé or happy about being cloned (your clone, however, tells me she is “totally psyched”).

First things first: I have an ever-growing list of things I wish I had done differently in life, so let me know when my clone has learned how to read, and I’ll send it on over; perhaps her path in life will be sufficiently similar to mine that she’ll benefit from at least a few items on the list.

Moving on to substance, here’s the thing: some people did say that PGP participants have no right to complain about being re-identified (and, by logical extension, about any of the other risks we assumed, including the risk of being cloned). It was my intention, in that post, to articulate and respond to three arguments that I’ve encountered, each of which suggests that re-identification demonstrations raise few or no ethical issues, at least in certain cases. To review, those arguments are:

  1. Participants who are warned by data holders of the risk of re-identification thereby consent to be re-identified by third parties.
  2. Participants who agree to provide data in an open access format for anyone to do with it whatever they like thereby gave blanket consent that necessarily included consent to using their data (combined with other data) to re-identify them.
  3. Re-identification is benign in the hands of scholars, as opposed to commercial or criminal actors.

I feel confident in rejecting the first and third arguments. (As you’ll see from the comments I left on your post, however, I struggled, and continue to struggle, with how to respond to the second argument; Madeleine also has some great thoughts.) Note, however, two things. First, none of my responses to these arguments was meant to suggest that I or anyone else had been “sold a bill of goods” by the PGP. I’m sorry that I must have written my post in such a way that it leant itself to that interpretation. All I intended to say was that, in acknowledging the PGP’s warning that re-identification by third parties is possible, participants did not give third parties permission to re-identify them. I was addressing the relationship between re-identification researchers and data providers more than that between data providers and data holders.

Second, even as to re-identification researchers, it doesn’t follow from my rejection of these three arguments that re-identification demonstrations are necessarily unethical, even when conducted without participant consent. Exploring that question is the aim, in part, of my next post. What I tried to do in the first post was clear some brush and push back against the idea that under the PGP model — a model that I think we both would like to see expand — participants have given permission to be re-identified, “end of [ethical] story.” Read More

Reidentification as Basic Science (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. We’ll continue to post contributions into next week. —MM

Arvind Narayanan (Ph.D. 2009) is an Assistant Professor of Computer Science at Princeton. He studies information privacy and security and has a side-interest in technology policy. His research has shown that data anonymization is broken in fundamental ways, for which he jointly received the 2008 Privacy Enhancing Technologies Award. Narayanan is one of the researchers behind the “Do Not Track” proposal. His most recent research direction is the use of Web measurement to uncover how companies are using our personal information.

Narayanan is an affiliated faculty member at the Center for Information Technology Policy at Princeton and an affiliate scholar at Stanford Law School’s Center for Internet and Society. You can follow him on Twitter at @random_walker.

By Arvind Narayanan

What really drives reidentification researchers? Do we publish these demonstrations to alert individuals to privacy risks? To shame companies? For personal glory? If our goal is to improve privacy, are we doing it in the best way possible?

In this post I’d like to discuss my own motivations as a reidentification researcher, without speaking for anyone else. Certainly I care about improving privacy outcomes, in the sense of making sure that companies, governments and others don’t get away with mathematically unsound promises about the privacy of consumers’ data. But there is a quite different goal I care about at least as much: reidentification algorithms. These algorithms are my primary object of study, and so I see reidentification research partly as basic science.

Read More

I Never Promised You a Walled Garden (Re-Identification Symposium)

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. We’ll continue to post contributions into next week. —MM

By Misha Angrist

Dear Michelle:

You know I respect your work immensely: your paper on the heterogeneity problem will be required reading in my classes for a long time to come.

But as far as this forum goes, I feel like I need both to push back and seek clarity. I’m missing something.

As you know, the PGP consent form includes a litany of risks that accompany the decision to make one’s genome and medical information public with no promises of privacy and confidentiality. These risks range from the well documented (discovery of non-paternity) to the arguably more whimsical (“relatedness to criminals or other notorious figures.”), including the prospect of being cloned. You write:

Surely the fact that I acknowledge that it is possible that someone will use my DNA sequence to clone me (not currently illegal under federal law, by the way) does not mean that I have given permission to be cloned, that I have waived my right to object to being cloned, or that I should be expected to be blasé or even happy if and when I am cloned.

Of course not. No one is asking you to be silent, blasé or happy about being cloned (your clone, however, tells me she is “totally psyched”).

But I don’t think it’s unfair to ask that you not be surprised that PGP participants were re-identified, given the very raison d’être of the PGP.

I would argue that the PGP consent process is an iterative, evolving one that still manages to crush HapMap and 1000 Genomes, et al., w/r/t truth in advertising (as far as I know, no other large-scale human “subjects” research study includes an exam). That said, the PGP approach to consent is far from perfect and, given the inherent limitations of informed consent, never will be perfect.

But setting that aside, do you really feel like you’ve been sold a bill of goods? Your deep–and maybe sui generis–understanding of the history of de-identification demonstrations makes me wonder how you could have been shocked or even surprised by the findings of the Sweeney PGP paper.

And yet you were. As your friend and as a member of the PersonalGenomes.org Board of Directors, this troubles and saddens me. In the iterative and collaborative spirit that the Project tries to live by, I look forward to hearing about how the PGP might do better in the future.

In the meantime, I can’t help but wonder: Knowing what you know and having done your own personal cost-benefit analysis, why not quit the PGP? Why incur the risk?

Warm regards,

Misha

Reflections of a Re-Identification Target, Part I: Some Information Doesn’t Want To Be Free (Re-Identification Symposium)

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Michelle N. Meyer

I wear several hats for purposes of this symposium, in addition to organizer. First, I’m trained as a lawyer and an ethicist, and one of my areas of scholarly focus is research regulation and ethics, so I see re-identification demonstrations through that lens. Second, as a member of the advisory board of the Social Science Genetics Association Consortium (SSGAC), I advise data holders about ethical and regulatory aspects of their research, including issues of re-identification. I may have occasion to reflect on this role later in the symposium. For now, however, I want to put on my third hat: that of data provider to (a.k.a. research participant in) the Personal Genome Project (PGP), the most recent target of a pair of re-identification “attacks,” as even re-identification researchers themselves seem to call them.

In this first post, I’ll briefly discuss my experience as a target of a re-identification attack. In my discussions elsewhere about the PGP demonstrations, some have suggested that re-identification requires little or no ethical justification where (1) participants have been warned about the risk of re-identification; (2) participants have given blanket consent to all research uses of the data they make publicly available; and/or (3) the re-identification researchers are scholars rather than commercial or criminal actors.

In explaining below why I think each of these arguments is mistaken, I focus on the PGP re-identification demonstrations. I choose the PGP demonstrations not to single them out, but rather for several other reasons. First, the PGP attacks are the case studies with which, for obvious reasons, I’m most familiar, and I’m fortunate to have convinced so many other stakeholders involved in those demonstrations to participate in the symposium and help me fill out the picture with their perspectives. I also focus on the PGP because some view it as an “easy” case for re-identification work, given the features I just described. Therefore, if nonconsensual re-identification attacks on PGP participants are ethically problematic, then much other nonconsensual re-identification work is likely to be as well. Finally, although today the PGP may be somewhat unusual in being so frank with participants about the risk of re-identification and in engaging in such open access data sharing, both of these features, and especially the first, shouldn’t be unusual in research. To the extent that we move towards greater frankness about re-identification risk and broader data sharing, trying to achieve clarity about what these features of a research project do — and do not — mean for the appropriateness of re-identification demonstrations will be important.

Having argued here about how not to think about the ethics of re-identification studies, in a later post, I plan to provide some affirmative thoughts about an ethical framework for how we should think about this work.

Read More

Data Sharing vs. Privacy: Cutting the Gordian Knot (Re-Identification Symposium)

PGP participants and staff at the 2013 GET Conference. Photo credit: PersonalGenomes.org, license CC-BY

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Madeleine Ball

Scientists should share. Methods, samples, and data — sharing these is a foundational aspect of the scientific method. Sharing enables researchers to replicate, validate, and build upon the work of colleagues. As Isaac Newton famously wrote: “If I have seen further it is by standing on the shoulders of giants.”

When scientists study humans, however, this impulse to share runs into another motivating force — respect for individual privacy. Clinical research has traditionally been conducted using de-identified data, and participants have been assured privacy. As digital information and computational methods have increased the ability to re-identify participants, researchers have become correspondingly more restrictive with sharing. Solutions are proposed in an attempt to maximize research value while protecting privacy, but these can fail — and, as Gymrek et al. have recently confirmed, biological materials themselves contain highly identifying information through their genetic material alone.

When George Church proposed the Personal Genome Project in 2005, he recognized this inherent tension between privacy and data sharing. He proposed an extreme solution: cutting the Gordian knot by removing assurances of privacy:

If the study subjects are consented with the promise of permanent confidentiality of their records, then the exposure of their data could result in psychological trauma to the participants and loss of public trust in the project. On the other hand, if subjects are recruited and consented based on expectation of full public data release, then the above risks to the subjects and the project can be avoided.

Church, GM “The Personal Genome Project” Molecular Systems Biology (2005)

Thus, the first ten PGP participants — the PGP-10 — identified themselves publicly.

Read More

Breaking Good: A Short Ethical Manifesto for the Privacy Researcher

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week, and extending at least into early next week. Background on the symposium is here. You can call up all of the symposium contributions here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post).

Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Yaniv Erlich

1. Increase the general knowledge –Like any other scientific discipline, privacy research strives to increase our knowledge about the world. You are breaking bad if your actions are aimed to reveal intimate details of people, or worst to exploit these details for your own benefit. This is not science. This is just ugly behavior. Ethical privacy research aims to deduce technical commonalities about vulnerabilities in systems not about the individuals in these systems. This should be your internal compass.

This rule immediately asserts that your published findings should communicate only relevant information to deduce general rules. Any shocking/juicy/intimate detail that was revealed during your study is not relevant and should not be included in your publication.

Some people might gently (or aggressively) suggest that you should not publish your findings at all. Do not get too nervous by that. Simply remind them that the ethical ground of your actions is increasing the general knowledge. Therefore, communicating your algorithms, hacks, and recipes is an ethical obligation and without that your actions cannot be truly regarded as research. “There is no ignorabimus … whatever in natural science. We must know — we will know!”, the great Mathematician David Hilbert once said. His statement applies also to privacy research.

Read More

Re-Identification Is Not the Problem. The Delusion of De-Identification Is. (Re-Identification Symposium)

By Michelle Meyer

This is the second post in Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. We’ll have more contributions throughout the week, and extending at least into early next week. Background on the symposium is here. You can call up all of the symposium contributions by clicking here (or by clicking on the “Re-Identification Symposium” category link at the bottom of any symposium post).

Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Jen Wagner, J.D., Ph.D.

Before I actually discuss my thoughts on the re-identification demonstrations, I think it would be useful to provide a brief background on my perspective.

Identification≠identity

My genome is an identifier. It can be used in lieu of my name, my visible appearance, or my fingerprints to describe me sufficiently for legal purposes (e.g. a “Jane Doe” search or arrest warrant specifying my genomic sequence). Nevertheless, my genome is not me. It is not the gist of who I am –past, present or future. In other words, I do not believe in genetic essentialism.

My genome is not my identity, though it contributes to my identity in varying ways (directly and indirectly; consciously and subconsciously; discretely and continuously). Not every individual defines his/her self the way I do. There are genomophobes who may shape their identity in the absence of their genomic information and even in denial of and/or contradiction to their genomic information. Likewise, there are genomophiles who may shape their identity with considerable emphasis on their genomic information, in the absence of non-genetic information and even in denial of and/or contradiction to their non-genetic information (such as genealogies and origin beliefs).

My genome can tell you probabilistic information about me, such as my superficial appearance, health conditions, and ancestry. But it won’t tell you how my phenotypes have developed over my lifetime or how they may have been altered (e.g. the health benefits I noticed when I became vegetarian, the scar I earned when I was a kid, or the dyes used to hide the grey hairs that seem proportional to time spent on the academic job market). I do not believe in genetic determinism. My genomic data is of little research value without me (i.e. a willing, able, and honest participant), my phenotypic information (e.g. anthropometric data and health status), and my environmental information (e.g. data about my residence, community, life exposures, etc). Quite simply, I make my genomic data valuable.

As a PGP participant, I did not detach my name from the genetic data I uploaded into my profile. In many ways, I feel that the value of my data is maximized and the integrity of my data is better ensured when my data is humanized.

Read More