By Michelle Meyer
This post is part of Bill of Health‘s symposium on the Law, Ethics, and Science of Re-Identification Demonstrations. Background on the symposium is here. You can call up all of the symposium contributions by clicking here. —MM
Daniel C. Barth-Jones, M.P.H., Ph.D., is a HIV and infectious disease epidemiologist. His work in the area of statistical disclosure control and implementation under the HIPAA Privacy Rule provisions for de-identification is focused on the importance of properly balancing competing goals of protecting patient privacy and preserving the accuracy of scientific research and statistical analyses conducted with de-identified data. You can follow him on Twitter at @dbarthjones.
Forecast for Re-identification: Media Storms Continue…
In Part 1 of this symposium contribution, I wrote about the re-identification “media storm” started in January by the Erlich lab’s “Y-STR” re-identifications which made use of the relationship between Short Tandem Repeats (STRs) on the Y chromosome and paternally inherited surnames. Within months of that attack, April and June brought additional re-identification media storms; this time surrounding re-identification of Personal Genome Project (PGP) participants and a separate attack matching 40 persons within the Washington State hospital discharge database to news reports. However, as I have written has sometimes been the case with past reporting on other re-identification risks, accurate and legitimate characterization of re-identification risks has, unfortunately, once again been over-shadowed by distortive and exaggerated reporting on some aspects of these re-identification attacks. Unfortunately, a careful review of both the popular press coverage and scientific communications for these recent re-identification demonstrations displays some highly misleading communications, the most egregious of which incorrectly informs more than 112 million persons (more than one third of the U.S. population) that they are at potential risk of re-identification when they would not actually be unique and, therefore, re-identifiable. While each separate reporting concern that I’ve addressed here is important in and of itself, the broader pattern that can be observed for these communications about re-identification demonstrations raises some serious concerns about the impact that such distortive reporting could have on the development of sound and prudent public policy for the use of de-identified data.
Reporting Fail (and after-Fails)
University of Arizona law professor Jane Yakowitz Bambauer was the first to call out the distortive “reporting fail” for the PGP “re-identifications” in her blog post on the Harvard Law School Info/Law website. Bambauer pointed out that a Forbes article (written by Adam Tanner, a fellow at Harvard University’s Department of Government, and colleague of the re-identification scientist) covering the PGP re-identification demonstration was misleading with regard to a number of aspects of the actual research report released by Harvard’s Data Privacy Lab. The PGP re-identification study attempted to re-identify 579 persons in the PGP study by linking their “quasi-identifiers” {5-digit Zip Code, date of birth and gender} to both voter registration lists and an online public records database. The Forbes article led with the statement that “more than 40% of a sample of anonymous participants” had been re-identified. (This dubious claim was also repeated in subsequent reporting by the same author in spite of Bambauer’s “call out” of the inaccuracy explained below.) However, the mischaracterization of this data as “anonymous” really should not have fooled anyone beyond the most casual readers. In fact, approximately 80 individuals among the 579 were “re-identified” only because they had their actual names included within file names of the publically available PGP data. Some two dozen additional persons had their names embedded within the PGP file names, but were also “re-identifiable” by matching to voter and online public records data. Bambauer points out that the inclusion of the named individuals was “not relevant to an assessment of re-identification risk because the participants were not de-identified,” and quite correctly adds that “Including these participants in the re-identification number inflates both the re-identification risk and the accuracy rate.”
As one observer humorously tweeted after reading Bambauer’s blog piece,
“It’s like claiming you “reidentified” people from their high school yearbook”.
Read More
You must be logged in to post a comment.