By Timo Minssen (CeBIL, UCPH), Sara Gerke & Carmel Shachar
A recent US lawsuit highlights crucial challenges at the interface of data utility, patient privacy & data misuse
The huge prospects of artificial intelligence and machine learning (ML), as well as the increasing trend toward public-private partnerships in biomedical innovation, stress the importance of an effective governance and regulation of data sharing in the health and life sciences. Cutting-edge biomedical research strongly demands high-quality data to ensure safe and effective health products. It is often argued that greater access to individual patient data collections stored in hospitals’ medical records systems may considerably advance medical science and improve patient care. However, as public and private actors attempt to gain access to such high-quality data to train their advanced algorithms, a number of sensitive ethical and legal aspects also need to be carefully considered. Besides giving rise to safety, antitrust, trade secrets, and intellectual property issues, such practices have resulted in serious concerns with regard to patient privacy, confidentiality, and the commitments made to patients via appropriate informed consent processes.
A recent lawsuit, Dinerstein v. Google, accusing the University of Chicago (UC) of sharing identifiable patient data with Google has pushed the growing privacy concerns into the U.S. spotlight. Let us start with the background to the suit.
On 17 May 2017, UChicago Medicine announced a collaboration with Google “to study ways to use data in electronic medical records to make discoveries that could improve the quality of health care.” With the help of new ML techniques, the new collaboration aimed “to create predictive models that could help prevent unplanned hospital readmissions, avoid costly complications and save lives.” By combining ML tools developed by Google with UChicago Medicine’s health care predictive modeling expertise, Michael Howell hoped back in 2017 to unlock more of the valuable information stored in electronic medical records “to create predictive algorithms that could alert physicians and nurses about patients’ risks for problems.” This breakthrough would be particularly useful for free text or images such as doctors’ notes or X-rays, which are difficult to extrapolate with traditional tools of epidemiology and statistics.
In 2018, the results of a study were published in npj Digital Medicine, which analyzed electronic health record (EHR) data from 216,221 patients that were hospitalized for at least 24 hours at the University of California, San Francisco (UCSF), from 2012 to 2016, or UChicago Medicine from 2009 to 2016. The shared data sets contained information about crucial factors such as diagnoses, procedures, medications, provider orders, patient demographics, and vital signs, with a volume of 46,864,534,945 data points, including clinical notes. Google, together with researchers at UCSF, UC, and Stanford University, took a deep learning – a subset of ML – approach to produce predictions across different healthcare domains, including death, readmissions, length of stay, and diagnoses.
A very important difference between the UCSF dataset and the data set provided by UChicago Medicine was that the “dates of service” were maintained in the UChicago Medicine dataset. In addition, the UChicago Medicine dataset also contained “de-identified, free-text medical notes.” This additional information in the UChicago Medicine data set provided the basis for a class action complaint that was filed on 26 June 2019 by the law firm Edelson PC on behalf of Matt Dinerstein and all others similarly situated. Matt Dinerstein was a former patient who was hospitalized twice at UChicago Medicine in 2015. Google, UC, and the UC Medical Center were named as defendants.
In particular, the complaint accuses both UC and Google that they violated the Health Insurance Portability and Accountability Act (HIPAA) by sharing and receiving hundreds of thousands of patients’ records that contained sufficient information for the tech giant to re-identify the patients. The complaint claims that the sharing and receiving of the datestamps, along with free-text notes data, “would be a prima facie violation of HIPAA.” Google and UC announced that all shared data were “de-identified” and in compliance with HIPAA. However, the complaint contests this and highlights that “in reality, these records were not sufficiently anonymized and put the patients’ privacy at grave risk.” In particular, it emphasizes that Google has access (e.g., through Android phones and mobile apps such as Waze and Maps) to a vast amount of information that empowers the tech company to potentially re-identify medical records. The complaint also alleges that UC did not obtain patients’ express consent before sharing their medical records with Google that pursues commercial purposes.
It is worth noting that Google is not a HIPAA-covered entity, and thus health data collected by the tech giant usually does not fall under HIPAA. In contrast, the California Consumer Privacy Act of 2018 (CCPA) and the European General Data Protection Regulation (GDPR) are broader in their scope. We noted this here and here. The problem that de-identified data under HIPAA may become re-identifiable through the combination of other data sets is known as “data triangulation.” Read more here.
In our next piece, we will evaluate the significance of this suit for the use of big datasets, as well as the convergence of medical and health data.
This research is supported by a Novo Nordisk Foundation-grant for a Collaborative Research Programme (grant agreement number NNF17SA027784).