Khoury News
As AI expands into medicine, Northeastern study finds AI models influenced by medical bias
Out of a pressure to publish positive results, medical researchers sometimes spin their findings in published abstracts. But what happens when AI models are trained on that data?
Humans can be easily influenced by language that is one-sided, especially in complex fields like medicine. But a new Khoury-led study shows that large language models, too, can be tricked by bias.
The team, led by PhD student Hye Sun Yun, researched whether LLMs — or AI models trained on text data to generate human-like responses — are susceptible to “spin,” or the tendency of researchers to present their findings in a positive light. By asking various types of LLMs to answer questions about treatment efficacy based off existing study abstracts, Yun and her team found that though AI models can identify spin, they are still easily influenced by it — even more so than clinicians and medical researchers.
“Experts are more likely to say more positive things about an abstract with spin compared to those without,” Yun said. “We wanted to see whether this was true for LLMs and explore the dangers around that, especially when you use them to interpret results.”
With publishing pressures to answer to, researchers sometimes spin their findings in an effort to show an effective treatment or mitigation strategy. This can lead them to exaggerate results or pass over aspects like treatment side effects in their abstracts, Yun said. As more medical professionals use AI to summarize studies and research treatments, AI's susceptibility to medical spin could mislead them into believing a treatment is more effective than it actually is.
“You are incentivized to make it sound better,” Yun added. “The medical field is a high-stakes setting where a lot of clinicians’ decisions and even public health policies are determined by these randomized control trials.”
As LLMs are increasingly used to summarize and analyze medical data, researchers hope the study, published in May, will make AI developers and users — including doctors — more aware of their shortfalls.
In the four-month research project, a six-person research team that included Khoury undergraduate (now recent alumna) Karen Zhang and Sy and Laurie Sternberg Interdisciplinary Associate Professor Byron Wallace used 30 existing abstracts. Each abstract had two versions — one exaggerated to make a treatment sound more effective, the other containing no embellishment. Researchers then fed the abstracts to the LLMs, asking the AI models to detect spin and measure the effectiveness of each treatment. The results showed that models can usually detect bias but are still influenced by it.

The researchers also tested whether bias could be reduced by telling models that an abstract had spin, or asking them to identify it. While this did reduce the likelihood of models exaggerating the benefits of certain treatments, it did not completely fix the problem, said Zhang.
“It provides the model with a bit more context and changes the outputs — even with a very simple prompting approach,” Yun added.
The study also determined that LLMs would continue to spin findings when generating simple, easy-to-read summaries of abstracts, possibly leading people to misinterpret jargon-heavy medical research.
“This is a phenomenon that can happen across all different types of models, regardless of what data it has been trained on,” Yun said, adding that the team worked with 22 different models — including some models specifically trained on medical data — and had to devise unique prompts for each to compensate for differences in the models’ training.

While it’s not clear why LLMs are particularly susceptible to spin when compared to humans’ susceptibility, Yun said it could be because of LLMs’ tendency to please the user. Some of the questions the researchers asked could be interpreted as a user trying to find evidence of a treatment’s benefits.
“When you think about it, the model is actually doing a great job in its task. The issue is the data that we’re providing is already biased, and we’re just asking the model to read the text and interpret it,” she said. “LLMs are sensitive to the style and tone of writing rather than focusing on the objective aspect, which is the numerical results. They are much like humans in this way.”
To prove the study’s conclusions further, Zhang is working to add an additional 150 abstracts to the dataset. In the first trial, only oncological studies were used, but the team is hoping to show that the phenomenon extends across medical specialties.
"Hopefully the additional data will give some new insights,” Zhang said.
Meanwhile, Yun is moving on to a related study about how asking LLMs leading questions can influence responses, even when the data remains the same.
“A lot of this ties back into my dissertation of looking at how we can build safer and more trustworthy LLM technologies for health information access,” she said.
The Khoury Network: Be in the know
Subscribe now to our monthly newsletter for the latest stories and achievements of our students and faculty