Big Data, Genetics, and Re-Identification

by Zachary Shapiro

While all scientific research produces data, genomic analysis is somewhat unique in that it inherently produces vast quantities of data. Every human genome contains roughly 20,000-25,000 genes, so that even the most routine genomic sequencing or mapping will generate enormous amounts of data. Furthermore, next-generation sequencing techniques are being pioneered to allow researchers to quickly sequence genomes. These advances have resulted in both a dramatic reduction in the time needed to sequence a given genome, while also triggering a substantial reduction in cost. Along with novel methods of sequencing genomes, there have been improvements in storing and sharing genomic data, particularly using computer and internet based databases, giving rise to Big Data in the field of genetics.

While big data has proven useful for genomic research, there is a possibility that the aggregation of so much data could give rise to new ethical concerns. One concern is that promises of privacy made to individual participants might be undermined, if there exists a possibility of subject re-identification.

Re-identification of individual participants, from de-identified data contained in genetic databases, can occur when researchers apply unique algorithms that are able to cross-reference numerous data sets with the available genetic information. This can enable diligent researchers to re-identify specific individuals, even from data sets that are thought to be anonymized. Such re-identification represents a genuine threat to the privacy of the individual, as a researcher could learn about genetic risk factors for diseases, or other sensitive health and personal information, from combing through an individual’s genetic information.

Read More