Press and Reporting Considerations for Recent Re-Identification Demonstration Attacks: Part 2 (Re-Identification Symposium)

By Michelle Meyer

This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. Background on the symposium is here. You can call up all of the symposium contributions by clicking here. —MM

Daniel C. Barth-Jones, M.P.H., Ph.D., is a HIV and infectious disease epidemiologist. His work in the area of statistical disclosure control and implementation under the HIPAA Privacy Rule provisions for de-identification is focused on the importance of properly balancing competing goals of protecting patient privacy and preserving the accuracy of scientific research and statistical analyses conducted with de-identified data. You can follow him on Twitter at @dbarthjones.

Forecast for Re-identification: Media Storms Continue…

In Part 1 of this symposium contribution, I wrote about the re-identification “media storm” started in January by the Erlich lab’s “Y-STR” re-identifications which made use of the relationship between Short Tandem Repeats (STRs) on the Y chromosome and paternally inherited surnames. Within months of that attack, April and June brought additional re-identification media storms; this time surrounding re-identification of Personal Genome Project (PGP) participants and a separate attack matching 40 persons within the Washington State hospital discharge database to news reports. However, as I have written has sometimes been the case with past reporting on other re-identification risks, accurate and legitimate characterization of re-identification risks has, unfortunately, once again been over-shadowed by distortive and exaggerated reporting on some aspects of these re-identification attacks. Unfortunately, a careful review of both the popular press coverage and scientific communications for these recent re-identification demonstrations displays some highly misleading communications, the most egregious of which incorrectly informs more than 112 million persons (more than one third of the U.S. population) that they are at potential risk of re-identification when they would not actually be unique and, therefore, re-identifiable. While each separate reporting concern that I’ve addressed here is important in and of itself, the broader pattern that can be observed for these communications about re-identification demonstrations raises some serious concerns about the impact that such distortive reporting could have on the development of sound and prudent public policy for the use of de-identified data.

Reporting Fail (and after-Fails)

University of Arizona law professor Jane Yakowitz Bambauer was the first to call out the distortive “reporting fail” for the PGP “re-identifications” in her blog post on the Harvard Law School Info/Law website. Bambauer pointed out that a Forbes article (written by Adam Tanner, a fellow at Harvard University’s Department of Government, and colleague of the re-identification scientist) covering the PGP re-identification demonstration was misleading with regard to a number of aspects of the actual research report released by Harvard’s Data Privacy Lab. The PGP re-identification study attempted to re-identify 579 persons in the PGP study by linking their “quasi-identifiers” {5-digit Zip Code, date of birth and gender} to both voter registration lists and an online public records database. The Forbes article led with the statement that “more than 40% of a sample of anonymous participants” had been re-identified. (This dubious claim was also repeated in subsequent reporting by the same author in spite of Bambauer’s “call out” of the inaccuracy explained below.) However, the mischaracterization of this data as “anonymous” really should not have fooled anyone beyond the most casual readers. In fact, approximately 80 individuals among the 579 were “re-identified” only because they had their actual names included within file names of the publically available PGP data. Some two dozen additional persons had their names embedded within the PGP file names, but were also “re-identifiable” by matching to voter and online public records data. Bambauer points out that the inclusion of the named individuals was “not relevant to an assessment of re-identification risk because the participants were not de-identified,” and quite correctly adds that “Including these participants in the re-identification number inflates both the re-identification risk and the accuracy rate.”

As one observer humorously tweeted after reading Bambauer’s blog piece,

“It’s like claiming you “reidentified” people from their high school yearbook”.

Yet the Forbes coverage was not nearly as distortive as some of the other epic reporting fails coming from the MIT Technology Review, or worse, Health Security Solutions.com from which the naïve reader would come away with the profoundly incorrect understanding that the study

“…elucidated the genome of more than 1,000 survey participants for the Personal Genome Project” and ”using only zip code, birthdate, and gender in conjunction with access to voter registration records, …accurately re-identified 84 to 97 percent of participants when using first-name variations and nicknames.” [Emphasis Added]

Amazingly, each echo of the original distortive Forbes reporting on the PGP re-identification seems to get more distant from the truth and even more broadly disseminated. The Guardian, the second most popular British newspaper website, reported on August 12, 2013 that supposedly:

“In one recent example, a Harvard professor was able to re-identify almost half of participants in a genetics study by cross-referencing records from its results database with publicly available information. The whole re-identification process was done without individuals’ names, and using only three pieces of data – gender, age and postal code.” [Emphasis Added]

One really has to wonder if the Guardian reporter who wrote this wildly inaccurate paragraph had even bothered to read the original Data Privacy Lab research paper; because the simple objective facts regarding their use of the already embedded names were, at least, spelled out quite clearly there.

Some of this reporting is just so wrong that it might be laughable if not for the fact that it was sure to be believed by a large percentage of readers who won’t bother to read the source materials and fact-check this inaccurate reporting. Although the more egregious reporting failures should clearly warrant public retractions from the editors of these news organizations, thus far, no such corrections have been issued. And if retractions are ever offered, it’s doubtful that the effort put into consistently correcting the entire record would be sufficient to properly offset inaccurate public perceptions regarding how the PGP attack and other re-identification demonstrations have been repeatedly portrayed in the press. (I’ll have more to say about the media’s role in re-identification reporting toward the end of this blog piece.)

Correcting the Record

The manuscript describing the PGP attack indicated 241 out of the 579 PGP participants providing 5-digit Zip Code, gender and full birthdate (42 percent) were able to be matched to unique names. However, once the study results are corrected in order to properly remove results for individuals who were “re-identified” only by virtue of their embedded names, a total of about 161 (= 241 – 80) persons (28 percent) were able to be uniquely matched to either voter or public records data. False positive (incorrect) match percentages were noted in the report’s text as 16 percent (i.e., with 84 percent correct), but perhaps as low as 3 percent (with 97 percent correct), if allowance was made for possible nicknames for those who could be matched to a unique individual. After additional adjustments removing false positive matches are made, the percentage of confirmed re-identifications using the combined re-identification data sources was apparently between 23 percent (135/579) and 27 percent (156/579) based on the information reported by the Data Privacy lab.[1] It is notable though that the false positive rates for this study could have been importantly lowered below what might have been otherwise expected if the researchers had not also possessed access to the more than 100 total names that were embedded within the PGP data file names.

The achieved 23 to 27 percent re-identification results (without reliance on the 80 cases via embedded names only) in the PGP study come in at about one third of the widely renowned previous 87 percent re-identification estimate for the United States population using 5-digit Zip Code, date of birth and gender.

112 Million Disserved: Not as Unique as You’ve Been Told…

Amazingly though, the distortions described above were not by any means the most egregious public-trust “fail” involved in this re-identification demonstration and its reporting. Associated with the PGP re-identifications, the Harvard Data Privacy Lab constructed a website providing a service to allow individuals to determine how unique their demographics may be and how easily they might be re-identified from the combination of their ZIP Code, date of birth, and gender via the Harvard Data Privacy Lab’s “How Unique Are You?” website.

Information directing readers to the website was provided in the research paper posted online by the Data Privacy Lab, in two Forbes news articles which mentioned the PGP re-identification attack and also promoted by Harvard’s Institute for Quantitative Social Science on Twitter and Facebook. The website service, however, provides extremely distorted results with regard to the true risk of being unique and, therefore, at potential risk of re-identification. More than 112 million people (more than one third of the total U.S. population) would be reported as being “easily identifiable” (based on the 2010 Decennial Census data) when, in fact, they wouldn’t be unique on their combination of ZIP Code, date of birth, and gender.

Of course, the U.S. Census data does not report results by date of birth – it only provides age in years, gender and Zip Code Tabulation Area within the 2010 Census “PCT12” table that underlies these calculations. However, the precise expectation for the number of individuals born on a given day and month of the year can be correctly calculated under the same assumption stated on the Data Privacy Lab website that “all birthdays are equally likely and evenly distributed”. Of course, simple logic tells us that whenever we have more than 365 people falling into only 365 days within a year, some birthdays must be shared by more than one person. Yet analysis of the results returned by the website shows that all individuals are informed that they are “Easily identifiable by birthdate (about 1)” until the number of persons in their {year of birth, gender and Zip Code} combination reaches 730 (2 x 365) individuals.[2] The extent of this error is astonishing. For example, with 729 persons in the same {year of birth, gender, Zip Code} equivalence class, the probability of an individual having a unique birth day is actually only 13.6 percent. Even when there are 365 persons in a birth year (one for each day in the year), when they fall into the days of the year just by random chance, more than a third of the individuals would turn out not to be unique because they would have randomly fallen into a day occupied by more than one person.

When the correct statistical calculations are applied to the 2010 Census data, we find that a total 194,806,415 persons (62.3 percent) would be unique on the basis of these three characteristics. The errant Data Privacy Lab website approach, however, would report some 307,512,197 persons (98.4 percent of the U.S. population) as being unique and easily identifiable. It should be noted that the correct 62 percent calculation closely tracks with the 61 percent (for the 1990 Census) and 63 percent (for the 2000 Census) uniqueness results previously reported by Phillippe Golle in 2006 when he was unable to validate the 87 percent estimate produced in 1997. The website’s errant calculation method would inform only 1.6 percent of the U.S. population that they were not “easily identifiable”, when, in reality, about 38 percent would not be unique.

Given that the “How Unique Are You” website’s miscalculations greatly exceed the earlier 87 percent estimate (which was finally acknowledged as an “upper bound” in the Data Privacy Lab’s PGP paper), it’s hard to understand how the Harvard team could present such a severe distortion of the scientific reality when constructing their website without recognizing and correcting this error. The bottom-line though is that by not reporting the correct probabilities of uniqueness, the website results would misinform more than 112 million persons that they are at potential risk of re-identification who would not actually be unique. No matter how an error of this magnitude occurred, the incorrect results produced by this much-publicized website serve as a powerful tool for fear-mongering on an issue that has been addressed and mitigated by the HIPAA Privacy Rule for over a decade now.

Begging the Question?

One particularly troubling aspect of the Data Privacy Lab report of the PGP re-identification is that it cited my previous paper on the famous 1997 re-identification of Massachusetts Governor William Weld as a motivation for the PGP attacks. In that paper, I used U.S. Census data and Golle’s correct probability calculations to show that the probability of there being an unregistered male voter who shared Weld’s full birthday in his same Zip Code would have been somewhere between 32 and 38 percent – a result that is a far cry from the purportedly definitive “re-identification” that has been repeatedly portrayed in the continued reporting of the “Weld” attack. The Data Privacy Lab PGP report states that:

“Recently, others challenged whether there really is any vulnerability to being reidentified by date of birth, gender and ZIP, citing a lack of documented examples and being confused about whether Weld was re-identified because he was targeted or because his demographics were unique [Barth-Jones, 2012 SSRN citation]; begging the question to be revisited. Can people be re-identified by date of birth, gender, and 5-digit ZIP?” [Emphasis Added]

The problem with this much misguided and misdirecting citation is that my paper, in fact, never “challenged whether there really is any vulnerability to being reidentified by date of birth, gender and ZIP”. Rather, my paper clearly stated (on page 9) that:

“With the benefit of hindsight, it is apparent that the Weld re-identification has served an important illustration of privacy risks that were not adequately controlled prior to the advent of the HIPAA Privacy Rule in 2003. It is now quite clear that simple combinations of high resolution variables, like birthdates and ZIP codes, can put an unacceptable portion of the population at risk for potential re-identification. So even though Weld’s “re-identification” plainly wasn’t achievable using the now-famous Cambridge voter list linkage attack, his unfortunate collapse in 1996 has still been a positive force for American privacy policy. The message learned from the Weld re-identification is now widely understood: We must pay attention to the potential for re-identification when data has only been stripped of the directly identifying information such as names and addresses.” [Emphasis from original repeated here]

Moreover, I also strongly disagree that the questions that I had raised about the famous Weld re-identification (or the broader issues involving the challenges for building near-perfect population registers) were in any way “begging the question to be revisited” with new individual subject demonstration attacks like the PGP re-identifications. On the contrary, had I been serving on an Institutional Review Board (IRB) tasked with protecting the rights and welfare of human research subjects in this particular study, I would have strongly advised against there being any possible advancement of the re-identification science necessary to support sound public policy that could have possibly been produced by this latest attack — because even 29 percent re-identification risks (which I had already estimated using probabilistic methods in my 2012 Weld re-identification paper) are simply unacceptably high if re-identification were to be attempted with these quasi-identifiers.

In point of fact, the Data Privacy Lab’s PGP re-identification study actually has provided an important validation of the “myth of the perfect population register” concerns that I raised in my SSRN paper on the William Weld re-identification attack. Rather than the famous 87 percent re-identification rate that has been repeated so often that it has nearly reached the status of an urban myth, the “real world” achievable re-identification risks reported in the April 2013 PGP re-identification study appear to be around one fourth to one third of the 1997 theoretical results. The PGP 23-to-27 percent re-identification rates, even when obtained by combining two separate outside data sources and exerting a good deal of effort, are still even lower than the 29 percent for potential re-identifications predicted in my paper for the famous Cambridge, Massachusetts, re-identification attacks, as can be easily seen in Figure 1 from page 4 of my paper, reproduced below.[3]

Figure 1. Estimated Proportion of the Cambridge Population Subject to Potential Re-identification Risk

There is no question that risks of 29 percent (or even 23 percent) are simply unacceptable as possible uncontrolled re-identification risks and would not qualify as de-identified under the HIPAA Privacy Rule de-identification provisions. However, if the HIPAA Safe Harbor standards, which allow reporting of only 3-digit Zip Codes and year of birth, had been applied to the PGP data, the achievable re-identifications would have plummeted to the point where it is unlikely that a single person could have been re-identified on the basis of their 3-digit Zip Code, Year of Birth and Gender.

The report from the Data Privacy Lab explains their much lower than expected achieved re-identification risks as stemming from: 1) temporal mismatches in the data (the quintessential issue of “data divergence” explained in some detail in the Elliot and Dale reference from my Weld paper); 2) use of incomplete voter data; 3) data quality problems; and 4) the earlier 87 percent prediction finally being acknowledged as a theoretical “upper bound” rather than a proper estimate.

These are, in fact, the very same “perfect population register” complications mentioned in my earlier work, and I believe it can be taken as a positive sign for future balancing of public policy with regard to real world re-identification risks that this unambiguous demonstration of the important limitations faced in such re-identification attempts has now been clearly illustrated by the Data Privacy Lab’s own research.

However, as much as it is gratifying to have my points validated about the practical limits of re-identification, I don’t believe this outcome could have served as sufficient motivation for a re-identification demonstration which took the unnecessary step of re-attaching identities to human subjects. I’ll say more about my reasons for this after I’ve addressed the June Washington state data re-identification attack and it’s reporting.

Tempest in a Teapot

The second demonstration attack to be discussed here broke the news on the morning of June 5, 2013, when Jordan Robertson, a reporter for Bloomberg News, tweeted that he was “thrilled to be presenting results of a yearlong hospital-privacy project….” The results of this lengthy collaboration between the Bloomberg Press and the Harvard Data Privacy Lab were presented at a Health Privacy Summit in Washington, DC; reported in an online manuscript; and covered on three separate dates by the Bloomberg News under the headlines “States’ Hospital Data for Sale Puts Privacy in Jeopardy”, “Patients ID’d From Hospital Records Trigger State Reviews”, and “Your Medical Records Are for Sale”. The Data Privacy Lab manuscript posed the research question, “Can patients be re-identified in today’s State health data?” and, once again, cited my earlier work re-examining the Weld re-identification as a motivation for these new attacks.

This re-identification demonstration and the associated reporting warrants a complete examination with the same scrutiny that I’ve provided for other re-identification demonstrations regarding the methods, results, resultant policy assertions, initial media reporting and subsequent increasingly distortive blog/tweet “echo-chamber” references; but I’ll just briefly summarize some key points here.

The basic facts associated with this latest re-identification attack are fairly straightforward, although the justifications for its conduct and reporting are much more muddled:

The Washington State Comprehensive Hospital Abstract Reporting System (CHARS) Hospital Discharge Database used in the attack contained records for 648,384 hospitalizations in 2011 and is released to provide public health personnel, consumers, purchasers, payers, providers, and researchers with information to use to make informed decisions on health care.
A search of the LexisNexis newspaper archive for news stories printed in 2011 in Washington State newspapers containing the word “hospitalization” and referring to 2011 hospitalizations yielded 66 distinct news stories referencing 111 persons, of whom 90 were selected as “re-identification” targets by virtue of their name or address being already present within the news article.
Through a complicated computer-matching process requiring detailed understanding of hospital claims coding specifications (and several pages of the manuscript to describe), 35 unique matches were made between the news stories and the hospitalization records. A human investigator also spent two days attempting to re-identify five news reports which could not be matched through the automated computer process, corresponding to stories about a Congressman, soccer player, sky-diving accident victim and two other individuals. All five hospitalization records corresponding to these news reports were found by the investigator during the two days which were allotted to this task. In all, a total of 40 individuals (35 by computer matching algorithm and five by intensive web-based “detective” work) were able to be re-identified by the project team.

The question posed by this “re-identification” demonstration was “Can patients be re-identified in today’s State health data?” and the Data Privacy Lab’s results indicated that, even though the data failed to meet the HIPAA Safe Harbor de-identification standard, only 1 in 16,200 (40/ 648,384) hospitalizations could indeed be found within the hospital data in this experiment (when names or addresses and information in the news reports for the target individuals where already known). In fact, 99.994 percent of the Washington State hospitalization records weren’t successfully attacked by the efforts of the researcher/ journalist team, but, of course, that wasn’t the headline. Instead, the Health Privacy Summit at which this demonstration attack was originally presented posted the following summary of the demonstration results to their website:

News From the Summit:

One of the Summit’s most notable presentations came from Harvard Professor Latanya Sweeney and Bloomberg News’ Jordan Robertson, who presented their findings that anyone can re-identify patients using publicly available information. By purchasing state health data sets for $50 and matching the data with newspaper articles, they were able to identify who had been hospitalized for what specific conditions. This means that anybody—a financial institution, an employer, an insurance company, a person snooping on friends and family—can see your private health information by piecing together different bits of information about you that is easily available to the public.

[Emphasis from original repeated here]

The total disconnect reflected in the web blurb above from the actual research results could not be more striking. The extremely small (0.000062) realized risk from this concerted attack is, in fact, a mere fraction (one sixth) of the already very small 1 in 2,500 risk reported for HIPAA Safe Harbor de-identification by Dr. Sweeney in public testimony. However, HHS has repeatedly acknowledged the existence of very small re-identification risks for HIPAA de-identified data and the necessity of such an allowance in order to support important societal interests such as comparative effectiveness studies, policy assessment, life sciences research and other endeavors. This position is clearly reiterated in the most recent HHS HIPAA de-identification guidance. Even though the Washington State CHARS data failed to meet the HIPAA Safe Harbor de-identification requirements (due to its inclusion of some higher risk data elements such as 5-digit Zip Codes and dates more specific than the year), the achievable re-identification risk from this “yearlong” effort by this team against the nearly 650,000 hospitalization records was extremely small.

To provide some relative proportion for these odds of re-identification: The lifetime odds that you will be struck by lightning (assuming you’ll live to age 80) are 1 in 6,250, which is more than two and a half times higher than the odds of a Washingtonian hospitalized in 2011 being re-identified by this attack (1 in 16,200). It speaks critically to our ability to rationally think about risks when fear has been invoked and our understandable dread of the unknown that we don’t really lose any sleep over the possibility that we might be hit by lightning, but this attack – with much lower risks – invokes our apprehensions and concerns.

Even more to the point, in Jordan’s Robertson’s interview on Bloomberg Television (at about 2:28 into the video), he stated that “what I was left with from the story was an affirmation of the HIPAA standard. It’s called Safe Harbor… and more than anything it would appear to me that if States and other entities want to sell this data, if they do it to the HIPAA standard, we wouldn’t be having this conversation. We wouldn’t be able to do that investigation.” Notably, Dr. Sweeney has also publically asserted on the Privacert, Inc. (which utilizes her patented “Risk Assessment Server”) website that “a dataset is HIPAA compliant if no more people are identifiable in the subject data release than would be identifiable if the data release satisfied the HIPAA safe harbor provisions“.

So some real mental contortionism is required to reconcile this team’s statements that Safe Harbor risks (1 / 2,500) constitute acceptable levels of HIPAA compliant risk, while the clearly much smaller risks (1 / 16,200) that were realized for the subjects in this demonstration (and solely due to the direct actions of the team behind this attack) are not. Notably, the Washington State CHARS data was apparently not being released under an associated data use agreement providing contractual restrictions prohibiting re-identification or contacting of individuals in the data, as is required by the HIPAA Privacy Rule for “Limited Data Sets”.[4] In my opinion, it is extremely difficult to find any public policy-advancing justification for undertaking and publicizing this demonstration attack on the medical privacy of 40 individuals, eight of whom were directly contacted by the Bloomberg reporter.[5] The demonstrated re-identification risks are so small that it is highly unlikely that anyone other than re-identification researchers and news reporters (who escape normal economic incentives by being paid to undertake the work in spite of the extremely small success rate and have professional motivation and the combined skills/expertise to implement the matching attack) would be driven to undertake such an attack.

Conflicting Goals and Actions

A fundamental tension exists for the conduct and reporting of re-identification demonstrations in that they seek to expose vulnerabilities, but should also wish not to promote any harm that can be created by these re-identification risks. Re-identification scientists presumably care quite deeply about the risks and associated potential privacy harms associated with re-identification. After all, they have devoted their considerable academic talents and careers to reducing these risks. Researchers viewing themselves as white hat hackers are seeking to alert the world to the re-identification risks that they perceive. Yet by undertaking these attacks, at least in cases where media attention has been focused on particular individuals, or where the media has distorted their results, they can actually be promoting the very privacy risks that they seek to redress and additionally increase potential harms for the long-term welfare of society which relies on de-identified data for many medical and public health research purposes.

Unfortunately, we’ve been recently provided with a prime example of the joint actions of re-identification researchers and the media exacerbating re-identification risks for a specific individual and directly heightening possible privacy harms. Adam Tanner, the same Forbes contributor who first publicized the PGP attack back in April, recently wrote a follow-up article in which he re-identified, and then contacted, a specific person within the PGP data holdings. What makes this matter even worse though is that the article included both sensitive medical information and a number of specific characteristics from the PGP data that might be used to potentially re-identify the individual.[6]

Although the sensitive and quasi-identifying information reported in the article was also already present within the PGP data available online, it existed in a state of considerable practical obscurity in which it was very unlikely to have been subject to a privacy intrusion but for the actions of the re-identification researcher and the journalist.[7] Instead, this re-identified PGP participant had sensitive medical information exposed in a national news report and faced a situation where the person’s complete identity would be much more likely to be subject to full revelation than would have been the case if this information had not been purposely broadcast as part of efforts to publicize this re-identification demonstration.

Professors Woodrow Hartzog and Fredrick Stutzman have written importantly about the concept that obscurity is a critical component of online privacy. While, admittedly, it provides an imperfect form of privacy protection, from a practical perspective, obscurity functioned effectively to provide privacy to the PGP participants and likely would have continued to do so if not for the academic and journalistic professional incentives that lead to this much publicized intrusion into the privacy of a specific PGP research participant.

Similar ethical “Catch-22”s exist related to pursing media exposure for re-identification demonstrations and, in the process, causing direct privacy intrusions for the recent Bloomberg reporting on the Washington State hospital discharge database, or furthering data subjects’ re-identification risks and potential harms for the 2011 Harvard Berkman Center for Internet & Society “Data Privacy Meltdown” involving the University of Wisconsin’s Michael Zimmer. Zimmer showed that supposedly “anonymous” published social network data could be cracked to identify the University from which the data was sourced; but, in doing so, his actions quite clearly brought the identities of the data subjects closer to full revelation than they were prior to all of the media attention. But, as was recently reported in a broader computer security/privacy context, even Facebook (which hasn’t been renowned for visionary leadership in the privacy realm) has white hat hacker rules which clearly recognize that there are significant ethical issues with demonstrating vulnerabilities by using the accounts of real people without their permission, advising that “Exploiting bugs to impact real users is not acceptable behavior for a white hat.” Obviously, there is value in the scientific verification of demonstration attacks in order to quantify the extent of incorrect (false positive) re-identifications, but publicizing details in news reports which publically expose and increase the re-identification potential for individuals who have been attacked seems unnecessary and irreconcilably at odds with the ultimate goals of preventing privacy harms.

Furthermore, in those cases where re-identification risks have been exaggerated in order to draw more attention to the potential threat in question, even more damaging long-term societal harms can be realized through what Professor Jane Yakowitz Bambauer has termed “The Tragedy of the Data Commons”. In this modern version of a tragedy of the commons, each individual would be incentivized by the distortedly inflated re-identification risks to want to remove their personal data from the data commons to avoid perceived high risks of re-identification. However, the collective health and public policy research benefits that are routinely derived from the data commons would importantly degenerate if unfounded fears of re-identification led to changes in public policy that curtailed research with properly de-identified data. Thus, from a public policy perspective, there is a strong motivation to assure that re-identification risks from demonstration attacks are accurately represented and reported in ways that do not falsely inflate fears about the real-world likelihood of re-identifications resulting from de-identified data.

Press Reporting of Re-identification Demonstrations

My comments here would be incomplete without making some final statements about the critical role that both the scientific and popular press have played in the way re-identification research has been conducted and conveyed. It is unfortunate that this symposium has not included contributions from members of the media because media reporting of re-identification demonstrations has held an especially influential and highly intertwined role with regard to the work of re-identification scientists. Unfortunately, I believe press coverage of re-identification research has demonstrated a troubling tendency in which, and quite notably in some examples, the press has gone beyond mere presentation of the facts and fair consideration of countervailing opinions.

For example, although the recent PGP attack was able to achieve, at best, a 27 percent re-identification rate (once the results for re-identification by embedded names had been removed) in spite of having matched to voter and online public record data sets, both the long debunked 87 percent estimate and the famous associated William Weld “re-identification” continue to receive unabated citations in the popular press and as leading examples in influential scientific publications (including the “Science as an Open Enterprise” report of the prestigious Royal Society). Only earlier last month, the Christian Science Monitor’s reference to the Weld re-identification went so far as to misinform the public that:

“Using just three bits of data, Latanya Sweeney showed how to identify everyone – including Weld…” [Emphasis Added]

These well-worn “old chestnuts” (much beloved by privacy alarmists as easily communicated privacy folklore) have unreasonably dominated the re-identification risk discussion — particularly given that the PGP attacks were able to achieve only a fraction of the famous 87 percent level even when assisted by already having names for almost one fifth of the population. It’s time to move on to more accurate communication about re-identification risks, even though this admittedly will involve undertaking a more challenging and detailed communication task.

It is, of course, critical that the public be accurately informed about re-identification risks and that re-identification concerns are covered by the press. But a problem arises in that the media preferentially reports on new and startling problems such as re-identification risks. This tendency has an important downside because there is a counter-balancing harm to societal interests (for research conducted with de-identified data) that can be realized by exaggeration of re-identification risks.

To be perfectly fair, this is not really a “press problem”, it is a broader human problem. We all want to know about threats. Our heightened concern and accentuated reactivity to threats and dangers is quite understandable from an evolutionary perspective. But, of course, it’s inherently biasing. People need to know about threats and deserve to be informed, but if the supposed threat turns out not to be well founded or as great as initially suspected, that simply isn’t seen as newsworthy. Consequently, it is hard to make the pitch to an editor or case for prominent placement or decent word count for an “Eh, well, the threat is not so great” article.

Of course, the very same issue exists for scientific publications. Statistical associations get attention and publication, but lack of association and negative results – not so much. It’s the classic publication bias or “file drawer” effect. Fortunately, we have suitable scientific methods for appropriately dealing with such biases through meta-analyses, and (as I’ve mentioned in my Part 1 blog post) for re-identification risk assessments and public policy evaluations through uncertainty and sensitivity analyses.

Still, the role of a well informed and fair media is critical in helping the public to better understand the importance of balancing the competing goals of protecting patient privacy and preserving the accuracy of scientific research and statistical analyses conducted with de-identified data.

Attentive adherence to the basic ethical codes of journalism includes:

testing the accuracy of information from sources,
making sure that headlines and article leads do not misrepresent,
not reporting information that could cause harm or discomfort to re-identification subjects when they could be affected adversely by news coverage,
not contacting or otherwise intruding into re-identification subjects’ privacy without a truly overriding public need,
avoiding real or perceived conflicts of interest, and
remaining free of associations and activities that compromise the integrity or damage the credibility for re-identification reporting.

In cases when the media has failed to achieve these essential ethical standards, I believe it is the duty of re-identification researchers to take an active role in correcting any resulting inaccuracies, exaggerations and distortions in press reporting. These are all necessary components to effectively deliver the complete story that the public deserves with regard to potential re-identification risks and properly maintain press obligations to the public’s trust.

Closing Comments

Accurate, complete and critically-minded reporting is key to our having a well-informed public and assuring that policy makers are equipped to properly balance protecting patient privacy and preserving the accuracy and utility of de-identified data. Still, even when re-identification demonstrations are accurately and ethically communicated, re-identification research still faces some complex ethical considerations for its conduct. In Part 3 of this symposium contribution I will address a number of admittedly quite complex ethical issues regarding beneficence and justice and assuring our ethical equipoise between potential privacy harms and the very real benefits that result from the advancement of science and healthcare improvements which are accomplished with de-identified data.

[1] Note: some of the numbers in the Data Privacy Lab report do not total correctly between those reported in the tables and text, thus leading to this range of 23-27 percent.

[2] Professor Jane Yakowitz Bambauer first noticed this critical discrepancy and called it to my attention.

[3] Also see paragraph 2 on page 5 of my Weld paper.

[4] See the HIPAA Limited Data Set standards at §164.514(e).

[5] It’s worth pointing out that contacting re-identified individuals is itself a privacy intrusion that is specifically prohibited by the HIPAA Privacy Rule specifications for Data Use Agreements for HIPAA Limited Data Sets. Presumably the reporter didn’t have legal obligations for his actions in this specific case, but I will raise further questions for discussion about the ethical standards for journalists covering re-identification attacks later in this essay.

[6] Because of this, I’ve purposefully decided not to directly cite the article here. I’ve made this decision in order to hopefully minimize my having a role in further advancing this privacy intrusion, but even with the very small comparative readership for this academic blog versus the circulation of the Forbes publishing company, sorting out an ethically appropriate action here left me contemplating the ethical “Principle of Double Effect”.

[7] And in what is now hopefully only a very marginally increased way due to some mitigating efforts, myself.