Illustration of multicolored profiles. An overlay of strings of ones and zeroes is visible

Understanding Racial Bias in Medical AI Training Data

By Adriana Krasniansky

Interest in artificially intelligent (AI) health care has grown at an astounding pace: the global AI health care market is expected to reach $17.8 billion by 2025 and AI-powered systems are being designed to support medical activities ranging from patient diagnosis and triaging to drug pricing. 

Yet, as researchers across technology and medical fields agree, “AI systems are only as good as the data we put into them.” When AI systems are trained on patient datasets that are incomplete or under/misrepresentative of certain populations, they stand to develop discriminatory biases in their outcomes. In this article, we present three examples that demonstrate the potential for racial bias in medical AI based on training data.

Background: Understanding AI Decision-Making

AI has been broadly defined by Stanford professor John McCarthy (and adopted by the FDA) as the science and engineering of making intelligent machines, especially intelligent computer programs. Many AI techniques, particularly in the medical community, design and train algorithms to learn from and act on data. 

In the cases of such techniques, AI algorithms are focused to optimize for a certain goal, established by their creator. An AI algorithm is then given “training data” from which it extracts information patterns and develops corresponding approaches or “rules” to optimize for the goal. After more training and validation, the AI algorithm applies these patterns to real-world data. If the model is adaptive, it will continue to refine its patterns as it operates. 

Training datasets that omit certain demographics, provide skewed or incomplete representation, or disproportionality represent one group over another could train AI models to incorporate these biases into their real-world functioning. Such situations might create “self-fulfilling prophecies” that confirm existing social biases or create new applications of bias altogether. 


Example 1: Computer Visioning & Melanoma Diagnosis

Computer visioning is a field of AI that trains algorithms to understand the content of digital images. It translates street signs, steers autonomous vehicles, and recognizes faces. Yet, the data training computer visioning software often contains a bias. When comparing three major face-recognition technologies, MIT found that the programs incorrectly classified less than 1% of light-skinned men but more than one-third of dark-skinned women.

What happens when such visioning algorithms are used to diagnose skin conditions? In 2016, a team of researchers built an AI model to identify melanomas based on clinical images. They trained the algorithm on more than 100,000 photographs of skin lesions labeled “malignant” or “benign.” The algorithm detected 95% of the melanomas and 82.5% of the moles; but, as science writer Stephanie Dutchen points out, more than 95% of the images used in training were white-skinned. 

Dutchen poses important questions: “If the model were implemented in a broader context, would it miss skin cancers in patients of color? Would it mistake darker skin tones for lesions and overdiagnose cancers instead? Or would it perform well?” The implications of both false positives and false negatives are high: either patient cancers are missed, or doctors and patients come to mistrust a potentially life-saving tool. Further, the research is not clear on the representation and effects for intersectional groups. 


Example 2: Speech Recognition, Natural Language Processing, & Alzheimer’s Presentation

There have been strong developments in the application of AI to voice: speech recognition and natural language processing (NLP) algorithms analyze human language and other linguistic factors. Researchers are testing such algorithms to predict how well they could diagnose conditions ranging from PTSD and dementia to heart disease. 

Like computer visioning, voice analysis is only as good as its training data. Take the case of Winterlight Labs, a Toronto-based company that developed a speech-based cognitive assessment to detect Alzheimer’s disease. In 2016, after publishing their research in the Journal of Alzheimer’s Disease, Quartz reports the company realized its technology was only accurate for English speakers of a specific Canadian dialect, based on its training data. For those who were evaluated by the model but were not native speakers, a tic, pause, or grammatical inconsistency might be mistaken for a disease maker.

In voice, the medical community has multiple responsibilities to address. First, construct datasets that are representative of the patient population; this is particularly important in light of different presentations for disorders based on race, culture, and society. It’s also important to consider patient context when applying medical AI; personal factors, such as native language or comfort with technology, should be considered along with their performance. 


Example 3: Machine Learning & Hospital Care Management

So far, we’ve provided examples of AI being used for patient diagnosis; yet, much of the impact of AI in healthcare has been to improve care delivery. Health care networks are turning to machine learning (ML) algorithms to review large sets of anonymized patient data and identify blockages, rebalance resources, or locate gaps in the care system. (Note: ML is a broad category of AI that overlaps with other techniques in this article.)

Marshall Chin, Professor of Healthcare Ethics at the University of Chicago Medicine, provides an example of how machine learning algorithms can perpetuate racial bias based on training data. He describes a data analytics project at UCM that considered applied ML to help decrease the length of hospital stay for patients. If an algorithm could identify which patients were more likely to be discharged early, they could be assigned case managers to assure no blockages delayed their discharge. The model was built using clinical data and zip code. Chin describes the problem that occurred:

If you live in a poor…or a predominantly African-American neighborhood, you were more likely to have a longer length of stay. So, the algorithm would have led to the paradoxical result of the hospital providing additional case management resources to a predominantly white…population to get them out of the hospital earlier, instead of to a more socially at-risk population who really should be the ones that receive more help.

Chin describes how the data analytics team not only realized their model’s bias, but worked with colleagues to champion new systems for addressing algorithmic bias. Their experience, however, is not isolated; zip code-based training data has proven as a proxy for racial bias in several cases, notably on Facebook’s advertising platform

Adriana Krasniansky

Adriana Krasniansky is a graduate student at the Harvard Divinity School studying the ethical implications of new technologies in healthcare, and her research focuses on personal and societal relationships to devices and algorithms. More specifically, Adriana is interested in how technology can support and scale care for aging and disability populations. Adriana previously worked as a writer and consultant in the technology field, managing projects involving artificial intelligence and robotics. Her work has been featured in publications including The Atlantic, Quartz, and PSFK.