Big Data, Genetics, and Re-Identification

by Zachary Shapiro

While all scientific research produces data, genomic analysis is somewhat unique in that it inherently produces vast quantities of data. Every human genome contains roughly 20,000-25,000 genes, so that even the most routine genomic sequencing or mapping will generate enormous amounts of data. Furthermore, next-generation sequencing techniques are being pioneered to allow researchers to quickly sequence genomes. These advances have resulted in both a dramatic reduction in the time needed to sequence a given genome, while also triggering a substantial reduction in cost. Along with novel methods of sequencing genomes, there have been improvements in storing and sharing genomic data, particularly using computer and internet based databases, giving rise to Big Data in the field of genetics.

While big data has proven useful for genomic research, there is a possibility that the aggregation of so much data could give rise to new ethical concerns. One concern is that promises of privacy made to individual participants might be undermined, if there exists a possibility of subject re-identification.

Re-identification of individual participants, from de-identified data contained in genetic databases, can occur when researchers apply unique algorithms that are able to cross-reference numerous data sets with the available genetic information. This can enable diligent researchers to re-identify specific individuals, even from data sets that are thought to be anonymized. Such re-identification represents a genuine threat to the privacy of the individual, as a researcher could learn about genetic risk factors for diseases, or other sensitive health and personal information, from combing through an individual’s genetic information.

In recent years, groups of researchers have demonstrated that concerns about re-identification from genetic information are far from theoretical. Indeed, several groups of researchers have be demonstrated that re-identification is possible, even with the limited information available in any one particular data set.[1]

As the internet facilitates the aggregation of personal information, the potential of re-identification promises to increase in the coming years. Because of this, re-identification is an issue that must be addressed when conducting genetic research, as normal promises of anonymity might be rendered moot by the threat of re-identification.

The potential of re-identification should change the way that researchers discuss anonymized genetic databases that will become available for large scale research. Participants have to understand that while the information will not be linked to them in a traditional sense, there does exist a potential of re-identification, depending on the availability of other information. Re-identification does not mean that ethical genetic research is doomed, but researchers cannot ignore the risk it does present. Rather researchers should explain that the risk remains extremely small, and that any re-identification is incredibly unlikely to cause any genuine problems for the research participant. Furthermore, by discussing the risk of re-identification directly, research participants can be fully informed, so that they can give meaningful consent.

There are also easy steps that researchers can take to help reduce the risk of re-identification. Researchers can try to better control access to sensitive genetic data, so that only established researchers will have access to the information. Furthermore, researchers should establish, and enforce, sanctions against anyone found to have deliberately attempted to re-identify individuals from research data. Combating re-identification is an important job, and it is encouraging to see that researchers are attempting to generate novel ideas concerning how to reduce any risk of re-identification.[2]

In the meantime, it is crucial that researchers begin grappling with how to talk with participants about re-identification. If presented incorrectly, there is the potential that the small risk of re-identification could seriously dissuade individuals from participating in essential genetic research. This would be a truly unfortunate situation, which could turn the small threat of re-identification into something that could severely damage public trust in the genetic research process.

However, if researchers modify the way they discuss anonymity, privacy, and consent, with participants of genetic research, so that expectations can be managed, then research can proceed ethically and respectfully, even with the potential of re-identification.

[1] see Nature; 2013; Schadt et al.

[2] See e.g.,;;

2 thoughts to “Big Data, Genetics, and Re-Identification”

  1. “….researchers should explain that the risk remains extremely small, and that any re-identification is incredibly unlikely to cause any genuine problems for the research participant.”

    What is the absolute and relative risk? What is the number needed to harm? What problems are included in this global statement? Subjects who are employees of healthcare institutions which are self insured run multiple risks of their electronic medical records being accessed inappropriately, having their subject data re-identified and having their employment and careers placed at risk, just off the top of my head. What evidence do you have to support the claim of minimal risk with unlikely consequences?

  2. Thanks for your comment! These are excellent points, and I do not mean to suggest that there should be no concern. Nobody can say what the exact numbers needed would be, but we must remember that the risk of individual harm should be balanced by the fact genetic research brings tangible benefits to large numbers of individuals in society.

    I based my claim based on early research concerning the Genetic Information Non-Discrimination Act (GINA), which suggests that perhaps there is less cause for concern than previously thought, concerning actual instances of genetic discrimination. Since 2010, there has been an annual average of roughly 48 cases reaching merit resolution and damages have not been substantial, averaging less than $1 million in total annual awards. While there has been documentation of discrimination in life insurance, a review of existing data led researchers concluded that “with the notable exception of studies on Huntington’s disease, none of the studies reviewed here (or their combination) brings irrefutable evidence of a systemic problem of GD that would yield a highly negative societal impact.” (Joly, Feze, & Simard, 2013).

    Of course, re-identification could make risks more widespread, to a larger number of participants in genetic research, and thus should not be ignored. However, what I meant to suggest was that the risk does not make the chance of ethical research impossible.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.