Combating Bias in Healthcare AI: A Conversation with Dr. Marzyeh Ghassemi

Many artificial intelligence (AI) models have been actively deployed and are benefit end-users across a wide variety of applications, whether that’s optimized driving routes served up via an app, face recognition in photo sorting, or personalized recommendations for a new audio book. Beyond these everyday applications, AI is being implemented in healthcare settings as well. But many of these models are trained and evaluated on limited data generated by imperfect humans, which means that their performance may be significantly worse for some subgroups, e.g., women. These disparities are often multifaceted but there is now opportunity in AI and health to stop replicating existing inequities in health, and instead use AI to find and fix them.

Dr. Marzyeh Ghassemi leads the “Healthy Machine Learning” lab at MIT, a group focused on using machine learning to improve delivery of robust, private, fair, and equitable healthcare. Ghassemi is an Assistant Professor at MIT in Electrical Engineering and Computer Science (EECS) and the Institute for Medical Engineering & Science (IMES), and a principal investigator at MIT’s Jameel Clinic in Machine Learning and Health.

In her research, Ghassemi explores how biases within the healthcare system can be reflected in technologies. For instance, organ transplant allocation systems rely heavily upon risk scores that may be mis-calibrated, and individual center preferences on operation risk. Addressing such inequities is a central part of Ghassemi’s research goal to create actionable insights in human health that lead to more equitable healthcare.

Ghassemi shares more about what can lead to biased artificial intelligence systems, why they’re a problem, and how they can be fixed.

How would you describe health equity?

To me, health equity means that people have the same access to, experience with, and outcomes from, the healthcare system. We have not achieved health equity. We know that minority subgroups and even women - who are not a minority globally - have worse access, experiences and outcomes for many conditions.

Figure from Ethical Machine Learning in Health Care Annual review

What does it mean for an AI technology to have "bias"?

Biased AI happens in health settings when biased human knowledge or practice is used to train models. From there, human biases are essentially built into the models. We are training high-capacity AI models with data. Data reflects what we actually do, not just what we think we do. If we show an AI model millions of examples of data where a subgroup of patients is consistently underdiagnosed or treated poorly, then the model is likely to learn those associations. We are seeing biased AI replicate or extend societal biases that permeate medical knowledge and practice because care providers are human, and the systems that they work in are all human-driven.

What role might AI have in achieving health equity?

We’re at a point where AI can do several clinical tasks at or above the level of human clinicians. But what if I picked one of these models, and told you its performance was much worse for women than for men? This happens because our data is generated by humans, and humans live with societal biases. We need diverse teams to help collect diverse data, learn robust models and deploy fair recommendations. Then we can use AI as an advisor to promote health equity, instead of a tool which extends humans’ poor practices.

What are some current biases in the organ transplant allocation system?

Over 6,000 people die annually in the US on transplant waiting lists. This has been partially attributed to systemic inefficiencies and bias. There are many potential issues here to tackle, including the lack of a consistent evaluation metric for organ quality, the inability of organ procurement operation centers to equitably obtain organs from those who want to donate them, and the efficiency of organ matching and scoring systems. Notably, checklists and risk scores are heavily used throughout medicine to make important care decisions, including in organ transplant, but these scores are also often biased or inefficient.

Can you describe your work with Jameel Clinic?

Through our efforts at the Jameel Clinic, we analyze the role that human decisions have in creating inequitable allocations during the organ allocation process, with a specific focus on the rules that create priority ordering of who gets an organ. For example, we looked at the lung transplant allocation process and found that even when you follow the process for ranking priority patients as described, the top candidates often aren’t the ones getting the lungs. We found, on average, it's the 10th person on the sorted list who got the available lung. This is problematic because candidates with an estimated rank 1 were 48% female and 15% black, but those who actually received the lungs were 39% female and 10% black. This means that the process needs to be modified to include those factors, and potentially account for them.

We also know that many lungs are thrown out in the lung transplant allocation process due to an assessment of “marginality”. But this isn’t a well-defined term, so different centers may have different lines of what that constitutes. We want to leverage our recent data-driven method to create optimal medical checklists from observed data to create improved assessments in these organ decisions.

How can other researchers apply your work?

Our analysis on organ procurement and allocation audits could be useful to any system allocating finite resources. Think about how we allocate hospital beds, make school choices, or distribute funding. There are choices that human decision makers may generally agree on, like “the patient with the worst condition should get the most immediate care”. But those general rules often depend very heavily on subjective decisions about what “worst condition” means to an individual. If we can use AI audits and suggestions to facilitate systemic agreement about how to deal with these difficult issues, decision makers can be more efficient and equitable.

In our work on dead-ends treatment learning, we ended up inverting the standard reinforcement learning paradigm to make it work better for clinical settings. Usually, researchers learn what actions or policies work best, and then recommend that clinicians take those actions or have those treatment policies. In health, the data isn’t really set up to support those kinds of models, because we don’t have a well-explored set of random actions and the results to the patient. We wouldn’t want a doctor to work that way! Instead, we wanted to provide guidance to human decision makers on what treatments they should avoid in order to prevent harm to the patients. This makes more sense because we’re working off of actions and responses that had actually been tried before.

Here, we have a figurative roll-out of possible outcomes for a septic patient. Emojis roughly indicate the patient's condition with skulls denoting terminal states leading to death. Each black circle represents treatment (VP=vasopressor, IV=intravenous fluid), with state transitions from left to right. a) the patient state around the presumed onset of sepsis. b) a dead-end state: if reached, there is no hope of recovery regardless of future choices of treatment.