Reflections of a Re-Identification Target, Part I: Some Information Doesn’t Want To Be Free (Re-Identification Symposium)

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. You can call up all of the symposium contributions here. Please note that Bill of Health continues to have problems receiving some comments. If you post a comment to any symposium piece and do not see it within half an hour or so, please email your comment to me at mmeyer @ law.harvard.edu and I will post it. —MM

By Michelle N. Meyer

I wear several hats for purposes of this symposium, in addition to organizer. First, I’m trained as a lawyer and an ethicist, and one of my areas of scholarly focus is research regulation and ethics, so I see re-identification demonstrations through that lens. Second, as a member of the advisory board of the Social Science Genetics Association Consortium (SSGAC), I advise data holders about ethical and regulatory aspects of their research, including issues of re-identification. I may have occasion to reflect on this role later in the symposium. For now, however, I want to put on my third hat: that of data provider to (a.k.a. research participant in) the Personal Genome Project (PGP), the most recent target of a pair of re-identification “attacks,” as even re-identification researchers themselves seem to call them.

In this first post, I’ll briefly discuss my experience as a target of a re-identification attack. In my discussions elsewhere about the PGP demonstrations, some have suggested that re-identification requires little or no ethical justification where (1) participants have been warned about the risk of re-identification; (2) participants have given blanket consent to all research uses of the data they make publicly available; and/or (3) the re-identification researchers are scholars rather than commercial or criminal actors.

In explaining below why I think each of these arguments is mistaken, I focus on the PGP re-identification demonstrations. I choose the PGP demonstrations not to single them out, but rather for several other reasons. First, the PGP attacks are the case studies with which, for obvious reasons, I’m most familiar, and I’m fortunate to have convinced so many other stakeholders involved in those demonstrations to participate in the symposium and help me fill out the picture with their perspectives. I also focus on the PGP because some view it as an “easy” case for re-identification work, given the features I just described. Therefore, if nonconsensual re-identification attacks on PGP participants are ethically problematic, then much other nonconsensual re-identification work is likely to be as well. Finally, although today the PGP may be somewhat unusual in being so frank with participants about the risk of re-identification and in engaging in such open access data sharing, both of these features, and especially the first, shouldn’t be unusual in research. To the extent that we move towards greater frankness about re-identification risk and broader data sharing, trying to achieve clarity about what these features of a research project do — and do not — mean for the appropriateness of re-identification demonstrations will be important.

Having argued here about how not to think about the ethics of re-identification studies, in a later post, I plan to provide some affirmative thoughts about an ethical framework for how we should think about this work.

My Experience

I believe in the importance of research, including genetic research. Our current practices, including — but not at all limited to — medicine, are not nearly as safe, effective, and generally evidence-based as we pretend. And so knowledge production — and how we govern it through statutes, regulations, case law, and the ethical norms on which these were explicitly based — are tremendously and broadly important.

I also believe that for research, and genetic research in particular, to be accurate and to maximally progress, we will often need data from large numbers of participants (this is also a foundational principle of the SSGAC). Finally, innovation works best when all comers have open access to the resulting data. And in the case of genetic research, for all the reasons that Jen and Madeleine eloquently articulate, we need genomic data to be wedded to rich phenotypic and environmental data.

The PGP shares this commitment to open access (see § 9.1 of the consent document in effect when I enrolled and when Latanya Sweeney began her re-identification work on the PGP in September of 2011). This commitment means that any genetic or trait data a participant chooses to give to the PGP will not be held in confidence but “will be made available via a publicly accessible website and database” (see § 10.1) — with one exception:

10.2 Association of Your Name With Your Data. The PGP will not intentionally associate your name with your genomic or trait data or other information that is published to the PGP’s public website and database. The PGP will not intentionally publicly identify you by name as a participant in the PGP without your prior consent. However, as described above, because of the identifiable nature of the information you are providing to the study and generated about you by the study, it is possible that one or more third parties may identify you as a participant in the PGP and associate your published data and information with your name or other information that you have not provided to the PGP and may not have wished to be publicly disclosed.

That is, the PGP does not allow participants to be anonymous vis-à-vis the project directors; indeed, they insist on confirming participants’ identity and want to be sure, for obvious reasons, that participants provide their own samples, and not those of others (see § 4.2 of the consent document). But after the first ten PGP participants (the “PGP-10”), the PGP no longer insisted that participants associate themselves with their data publicly, and indeed itself promised not to intentionally “out” participants to others. Of course, plenty of PGP participants happily out themselves. But plenty do not. (Admittedly, discerning whether a PGP participant intends to associate her profile with her identity or not is a tricky business, as there is currently no field into which participants can choose to enter their name, or not. The PGP’s plans to give participants the option of adding their names and/or photos will greatly help clarify when a participant has chosen to associate herself with her profile and when she has unintentionally uploaded a 23andMe file with her name embedded in the file name, for instance. That said, of the 1,130 participants whose profiles Sweeney reviewed, only 579 had provided their full zip code, date of birth, and gender in the relevant fields. It seems unlikely that many participants who exclude this information have nevertheless intentionally associated themselves with their PGP profile or wish to be associated with it.)

I wanted to donate my genotype and phenotype data to a project committed to open access research, but I was not ready to associate my name with that data. It is probably not coincidental that of the re-identified PGP participants who have told the media that they did not care that they were outed, most were either independently wealthy or of retirement age. I, by contrast, am at neither stage. And so I enrolled in the PGP a few years ago, after scoring the necessary 100% on an exam that tested my knowledge of genetics and the risks of sharing my genomic, trait, and environmental information. I read and signed this consent document, and knowingly assumed the risks that the PGP might unintentionally leak my name and that a third party might intentionally “hack” my profile. Like the countless other risks I assume every day, including crossing the street, I assumed these risks because I deemed that the value of the underlying activity outweighed the magnitude of the harm, discounted by the probability that it would occur. I submitted plenty of health records and other phenotype data and uploaded my raw 23andMe data. I submitted my saliva sample and now wait for the results of my whole genome sequencing.

I wasn’t able to attend the PGP’s annual GET (Genomes Environments Traits) Conference earlier this month, but I happened to be streaming it live and chatting with other PGPers on Twitter when one of them tweeted a link to this Forbes.com article:

A Harvard professor has re-identified the names of more than 40% of a sample of anonymous participants in a high-profile DNA study, highlighting the dangers that ever greater amounts of personal data available in the Internet era could unravel personal secrets.

From the onset, the [PGP] has warned participants of the risk that someone someday could identify them, meaning anyone could look up the intimate medical histories that many have posted along with their genome data. That day arrived on Thursday.

With a little digging, I learned that there had in fact been two re-identification demonstrations involving the PGP. As described in the Forbes.com article, Latanya Sweeney used the algorithm based on zip code, birth date, and gender that she made famous in her 1997 re-identification of Massachusetts Governor Bill Weld — and also read some participants’ names directly from their decompressed 23andMe files. And Yaniv Erlich used the algorithm based on Y-chromosome data and surnames that he published in Science earlier this year. Both had booths at the GET Conference demonstrating their techniques.

After a little more digging, I found the paper in which Latanya reported her algorithm. Once I had the chance to read about it and to compare it to the information I had provided in my PGP profile page, I concluded with about 99.9% certainty that I was not among those who had been re-identified by either attack: I had not provided all of the information used in Latanya’s algorithm. Heeding the PGP’s own pop-up warning, I had scrubbed my 23andMe file of my name before uploading it (although I leave room for a 0.1% chance that I somehow did not thoroughly scrub it). And my 23andMe results confirmed my longstanding suspicion that I do not have a Y chromosome, so that takes care of Yaniv’s study.

Since I signed a document acknowledging that this was “possible,” I was of course not surprised that PGP participants could be re-identified by third parties. But I was surprised that many observers viewed the fact that PGPers had been re-identified as, variously, really cool or no big deal.

Assuming the Risk of Being Re-Identified Does Not Constitute Consent To Be Re-Identified

The most common argument that has been made for why PGP participants who were re-identified have no basis for complaint is that they explicitly consented to be re-identified. By signing the consent document, each PGP participant formally acknowledges that, although the PGP won’t intentionally name her,

because of the identifiable nature of the information you are providing to the study and generated about you by the study, it is possible that one or more third parties may identify you as a participant in the PGP and associate your published data and information with your name or other information that you have not provided to the PGP and may not have wished to be publicly disclosed. (§ 10.2)

See also § 7.1(a)(v). In this entire episode, what has surprised me the most is how readily so many people have concluded that assuming the risk of re-identification constitutes giving permission to be re-identified. Consider some of the other risks assumed by PGP participants:

  • Although the PGP won’t do so, a third party could share your PGP data with your health care provider or insurer or include it in your medical record (§ 10.4)
  • A third party could use your DNA sequence to “make synthetic DNA and plant it at a crime scene, or otherwise use it to falsely identify you” (§ 7.1(a)(iii)(4)); at least some PGP participants are more concerned about this risk than the risk of re-identification (see paragraph 8 of the Forbes.com article)
  • You could be subject to actual or attempted employment, insurance, financial, or other forms of discrimination or negative treatment due to the public disclosure of your genetic and trait information ((§ 7.1(a)(iv))
  • If you decide not to enroll in the study and not to publish any of the data you have given the PGP, it “is still possible that your DNA sequence data will be publicly disclosed due to unintended data breaches, including hacking or other activities outside of the procedures authorized by the PGP” (§ 7.1(b))
  • “[A] third party could access your publicly available sequence data or other information, change it and republish it to [falsely] suggest that you had a propensity for a disease or other detrimental trait” (§ 7.1(c))
  • “[I]t may one day be possible for a third party to use, without your or the PGP’s authorization, cell lines or biological materials derived from your cell lines for new or unexpected reproductive or other purposes, including cloning.” (§ 7.2(a))

Some of these third-party behaviors would give rise to legally cognizable claims, while others would merely fall into the category of Deeply Obnoxious Things People Can Do To One Another Thanks to the Internet. As Steve’s post reminds us, the legality of re-identification varies depending on jurisdiction, and even in those jurisdictions where it is not explicitly forbidden, should probably be considered an open, or at last evolving, legal question in these relatively early days of big data and re-identification. In any event, to focus on the (il)legality of behaviors that participants have acknowledged may (not will) occur is to miss the point, which is that assuming the risk that an event will occur does not constitute giving permission for someone to ensure that that event does, in fact, occur. Surely the fact that I acknowledge that it is possible that someone will use my DNA sequence to clone me (not currently illegal under federal law, by the way) does not mean that I have given permission to be cloned, that I have waived my right to object to being cloned, or that I should be expected to be blasé or even happy if and when I am cloned.

I would go so far as to say that it is not even clear that it is less ethically problematic to nonconsensually re-identity someone who has been warned of the risk of re-identification than it is to nonconsensually re-identify someone who has not been so warned. If a pickpocket targets two people, only one of whom is walking in an area where a sign is posted that reads Warning: Pickpockets May Lurk Here, do we think the pickpocket’s actions toward his warned victim were any less egregious than his actions toward the unwarned victim?

Unlike Steve (in a comment here), I don’t see the PGP consent document as primarily a way for the project to immunize itself from liability (although perhaps I’m being naïve); I see it primarily as a manifestation of its laudable ethical commitment to be honest with prospective participants about what the PGP (and other data holders) can and cannot guarantee. Yet, if it’s ethical for data holders to tell data providers that their anonymity cannot be guaranteed, then it becomes very important to understand what it does — and does not — mean for participants to acknowledge that disclosure. It would be bizarre if the PGP and other data holders who follow suit were punished for their frankness by a policy that treats disclosure and acknowledgement of risk by data holders and data providers, respectively, as permission that obviates the need for consent.

The Ethics of Naming Public Profiles

So I don’t think that PGP participants explicitly consented to be re-identified. But should we be troubled by that? Two arguments that have been made in the wake of the PGP demonstrations suggest that we should not be troubled by non-consensual re-identification demonstrations, at least under certain circumstances. I am persuaded by neither.

First, it has been noted, re-identification demonstrations rely — mostly* — on publicly available information. (*Demonstrators usually confirm the accuracy of their re-identification with the data holder [or, occasionally, with the individuals themselves], but I bracket for the time being the questions such scoring processes raise.) Assuming that the data providers consented to having what we might call the “predicate data” publicly available, how can simply looking at two or more sets of such publicly available predicate data, and then drawing inferences from their combination, raise any ethical issues?

The answer is that re-identification demonstrators do not just passively and casually look; as Steve suggests, they seek to generate new data that is not in the public domain (or privately given to them) — such as the fact that a rich online profile belongs to Jane Doe of 123 Apple Lane. Of course, just about all research generates new information that is not only generalizable beyond the participants but is also “about” the participants themselves. By studying Facebook “likes,” for example, researchers were able to predict users’ race, age, IQ, sexuality, personality, substance use and political views with degrees of accuracy that ranged from 60% to 95%. Notably, participants consented to that study (through https://mypersonality.org/wiki/doku.php).

Yaniv rightly advises his fellow re-identification researchers:

Your study is quite likely to utilize data from research participants, various Internet websites, and computational tools. These wonderful resources are usually available under a policy or some terms of use. You must adhere to these terms. Do not re-identify datasets with policy use that prohibits that.

A case can be made that re-identification, without consent, violates even the PGP’s generous policy of allowing all comers to have a go at participant data. To review, the understanding that the PGP establishes with participants is as follows:

  1. you choose what data to provide to the PGP (except your name, which is mandatory), but
  2. whatever data you do provide to the PGP (except your name) will be published for all to see and analyze in pursuit of any research question, without reconsenting you (“blanket consent”).

What are we to make of re-identification research within this framework? One view is that re-identification research is simply one kind of study to which PGP (and similar) participants have waived their right to give or refuse specific consent. But the fact that re-identification research necessarily and by design creates additional information that the participant chose not to provide complicates this analysis. If we interpret tenet (2), above, to preclude participants from objecting to research that creates new personally identifiable information, we thereby significantly erode tenet (1) — participants’ right to choose which data they donate to science. This complicates the case that re-identification studies are fully acceptable within the ethos of the PGP.

Note, too, that if we accept the argument that participants giving blanket consent to research using their data thereby consent to being re-identified, then we must also accept that participants have consented to have their names published along with the algorithms used to re-identify them. After all, that, too, is simply a research use of publicly available data to which participants have given blanket consent. If our intuition is that participants have not, in fact, consented to be publicly associated with their data, then we should doubt any intuition that they have consented to be re-identified in the “privacy” of a scholar’s lab, either.

We’re Researchers; We Wear the White Hats

That observation casts doubt on a second dubious argument about why we should not be especially worried about non-consensual re-identification demonstrations: namely, because they are benign compared to re-identification by commercial and criminal entities. Re-identification demonstrators typically and somewhat ironically justify the value of their work by pointing to the dangers of data mining in these other hands: By showing data holders and data providers how their personal data is vulnerable, and by suggesting ways to mitigate those vulnerabilities, demonstrators provide valuable information about the dangers of re-identification and methods of mitigating that risk. Their admitted good intentions and even good works do not, however, mean that privacy researchers’ own re-identification efforts do not need to be justified.

It’s true that re-identification scholars usually refrain from publishing re-identified names (although most commercial and criminal entities refrain from doing so as well). More importantly, it’s surely the very rare re-identification scholar who uses identification to stigmatize, discriminate against, or steal from her target (although much commercial re-identification and other forms of data mining are also used for purposes that many find benign, such as more effectively suggesting books or movies they may like). This is all well and good. But it doesn’t mean that when a scholar re-identifies someone without their consent, that no privacy intrusion has occurred.

Years ago, very early in my doctoral training in ethics, I spent quite a bit of time at a major teaching hospital rotating through all the ICUs and other major units in order to learn about ethical issues in context. One day I was shadowing a busy physician. Before she rushed off to do something else, she handed me the chart of a patient with sickle cell disease and told me to go talk to him in the examining room. “He’s an interesting guy,” she said before disappearing, “an interesting case.” I dutifully shuffled in, not sure exactly what questions to ask him. As it turned out, I wouldn’t have to come up with any, as he quickly began questioning me. He wanted to know who I was, and I explained that I was getting a PhD in ethics. He wanted to know why I was at the hospital, and I explained that I was there to learn. And then he wanted to know whether I thought it was ethical to look at his chart without his permission.

Oh.

We know from considerable empirical research that individuals’ privacy preferences are highly contextual. Those preferences tend to depend not only on what kind of information is disclosed, but by whom, to whom, and for what purpose. I may not care much or at all that some company’s IT hack whom I’ll never meet has associated my name with potentially sensitive information in order to target me for a particular type of pharmaceutical or baldness cure. But I may care very much that a colleague at my own university, who I may see in the bookstore or the faculty lounge, has re-identified me, whatever her purpose.

Producing generalizable knowledge is valuable for society, and it’s valuable (and enjoyable) for researchers who are understandably and rightly passionate about their work. But the risks of human subjects research are concentrated on the subjects, while the benefits of knowledge production tend to be diffused. It’s true that the more thoughtful re-identification scholars often go out of their way to provide benefits not only to society but also to re-identification targets themselves. But such benefits, even when they are provided to targets, do not exhaust the ethical obligations of re-identification researchers.

To see why, imagine a guild of security experts who patrol the streets looking for especially gullible tourists. They pick the tourists’ pockets, then immediately tap them on the shoulder, return their wallets, and explain how they can protect themselves from the real pick pockets in the future. The harm done to the targets arguably would be de minimis. And the security tips could be of real value. Even so, we can imagine that many would not welcome these pickpocket Samaritans.

The risk-benefit profile of nonconsensual re-identification research is even more uncertain. In the case of a re-identification attack, the wallet cannot be returned; the privacy bell cannot be unrung. Once a re-identification scholar learns that a rich online profile belongs to Jane Doe of 123 Apple Lane, she cannot unlearn that fact. Nor, given the considerable heterogeneity of people’s preferences about research risks and benefits, can it be assumed that Jane will necessarily value the security tips bestowed upon her after the fact more than she disvalued being re-identified without her consent. In this respect, nonconsensual re-identification attacks are more like the old New York City squeegee guys than our imaginary guild of pickpocket Samaritans.

I’ve resisted here — quite strongly, at times — some of the arguments that I’ve encountered that re-identification under certain circumstances requires little or nothing in the way of ethical justification. But re-identification demonstrations can have value, and they can be done ethically. In my next post, I’ll offer some thoughts about best practices for both re-identification researchers and data holders.

3 thoughts to “Reflections of a Re-Identification Target, Part I: Some Information Doesn’t Want To Be Free (Re-Identification Symposium)”

  1. Thank you Michelle for writing this! You’ve outlined with a lot of clarity some excellent reasoning for concerns about re-identification research. As a “data holder” my thoughts leap to thinking about solutions, of course, but I expect you’ll have some similarly excellent thoughts in your next post.

    After thinking about what you’ve written here, there was one point you made where I wonder about whether it “proves too much” (https://en.wikipedia.org/wiki/Proving_too_much).

    From this section: “Note, too, that if we accept the argument that participants giving blanket consent to research using their data thereby consent to being re-identified, then we must also accept that participants have consented to have their names published along with the algorithms used to re-identify them.”

    I wonder how this same reasoning would apply to similar situations predicting information potentially considered sensitive from biological profiling data (genome, microbiome, immune repertoire, brain scan). For example: homosexuality, antisocial personality disorder (ASPD), history of STD infection (from immune or microbiome). Would we consider it acceptable for researchers to try to predict these? Would we consider it acceptable to publish a specific list of participants they strongly predict to have these traits (let’s say by name, from folks that shared names)? I suggest we instinctively feel the same way here, that publishing specific name-associated predictions is less acceptable than privately performing research on making these predictions.

    The abstraction here is that researchers try to predict information A based on information B. Sometimes A could be sensitive information.

    I don’t have an answer, and I don’t think this means your thoughts are wrong. Maybe there’s simply a lot of gray areas (as with so many ethical issues). Some related thoughts I’ve had, and have reached no conclusions:

    * Maybe it’s harder to take other hypothetical predictions of sensitive information seriously because it’s currently not possible with particularly high accuracy.

    * In 2009 researchers published that Watson’s ApoE haplotype could be predicted from surrounding DNA, and they explicitly do not describe performing the analysis on his data out of respect for his privacy. It’s unclear to me whether they did or did not perform that analysis privately (they neither admit nor deny). https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2986051/

    * On that topic, ApoE may be an example of such sensitive information. GET-Evidence is a freely shared database that is used to create genome interpretations. (https://evidence.personalgenomes.org). These interpretations are the “genome reports” participants see, and we link the interpretations to their public PGP profiles. Thus, we have published Alzheimer’s risk predictions linked to specific participants.

    * Maybe our instinct is that the ApoE report is okay because participants get to see these reports before the data goes public, and participants tend to be highly aware of this particular case. How would we feel if someone added a new similarly “strongly predictive about sensitive information” item to GET-Evidence and the participant reports automatically reflected this new prediction?

    * More concretely: what if someone added “surname prediction” to GET-Evidence’s genome interpretation?

    1. Thinking about it more, identity (in the form of “someone’s name”) is a very different sort of “sensitive information” — a key that unlocks orders of magnitude more information. Maybe it’s a very different beast.

  2. Pingback: jutropitur

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.