Google’s AI for Medicine Shows Clinical Answers More Than 90% Accurate

(Bloomberg) -- One day in February 2022, two AI researchers at Alphabet Inc.’s Google found themselves engrossed in conversation about artificial intelligence and its potential for real applications in healthcare.

As Alan Karthikesalingam and Vivek Natarajan discussed adapting Google’s existing AI models to medical settings, their conversation stretched for hours and into dinner over dosas at a restaurant near the tech giant’s Mountain View headquarters. By the end of the evening, Natarajan had written a first draft of a document that described the possibilities for large language models in health care, including research directions and its challenges.

Their work kicked off one of the most intense research sprints that researchers say they have experienced in their time at Google. It culminated in the publication of Med-PaLM, an AI model that researchers say has the potential to revolutionize healthcare by allowing physicians to retrieve medical knowledge quickly to support their clinical decisions. Large language models are massive AI systems that typically ingest enormous volumes of digital text, but Karthikesalingam and Natarajan envisioned a system that would be trained on specialized medical knowledge.

Peer-reviewed research underpinning the AI model has been accepted by the scientific journal Nature, Google said Wednesday. This makes the company first to publish research detailing an AI model that answers medical questions in the journal, it said.

The paper contains some surprising results. When the model was posed medical questions, a pool of clinicians rated its responses to be 92.6% in line with the scientific consensus, just shy of the 92.9% score that real-life medical professionals received, according to a statement from Nature, though the clinicians’ evaluations of Med-PaLM weren’t based on it being deployed in hospital settings with real-life patient variables. The study also found that just 5.8% of the model’s responses could cause harm, besting the 6.5% rate achieved by clinicians.

Sarah West, managing director of AI Now Institute, a policy research center, said that while publishing in a scientific journal demonstrates some academic oversight of Google’s findings, it’s an insufficient standard for the AI system being ready to use in real health-care settings. “There's all kinds of information that you would want to know, in order to meaningfully evaluate a system before it's deployed into commercial use,” she said. “And you need to look at this system at the level of each hospital, if they're going to be doing any kind of customization of a system for a particular clinical setting.”

Without any other mandates for independent testing or evaluation, “we're stuck in a situation where we have to rely on the company's words that they have adequately evaluated” the AI systems before deployment, West added.

It’s still early days for Med-PaLM. The company has just in the past few months begun to open up the model to a select group of healthcare and life science organizations for testing, and the company says the model is still far from being ready for use in patient care. Google researchers who worked on the model say that in the future, Med-PaLM could have the potential to provide doctors an expert source to consult when encountering unfamiliar cases, help with the drudgery of clinical documentation and extend care to people who might otherwise not receive any form of health care at all.

“Can we catalyze the medical AI community to think seriously about the potential of foundation models for health care?” said Karan Singhal, a software engineer who worked on the project. “That was our guiding North Star.”

In March, Google announced Med-PaLM’s second iteration, which it said reached an 86.5% score when answering US medical licensing-style questions — an improvement over its earlier 67% score. The first generation of Med-PaLM was evaluated by nine clinicians from the UK, the US and India; 15 physicians assessed the second version, Google said.

Google and OpenAI, the Microsoft Corp.-backed startup, are locked in a fierce race in artificial intelligence, and the medical field is no exception. Medical systems have begun experimenting with OpenAI’s technology, the Wall Street Journal had reported. Google, for its part, has begun trying out Med-PaLM with the Mayo Clinic, according to the Journal.

Both Karthikesalingam and Natarajan had long dreamed of bringing AI to health care. Having begun his career as a physician, Karthikesalingam found himself longing for an AI model that could complement his work. Natarajan grew up in parts of India where for many people seeing a doctor was not feasible.

One of the team’s first researchers, Tao Tu, said said he was initially skeptical of the team’s ambitious timetable. “I had an initial call with Vivek, and Vivek said we plan to print out a paper in a month,” Tu said. “And I was like, how is that possible? I have been publishing a number of papers for many years. I know nothing will happen in such a short timescale.”

Yet the team pulled it off. After a five-week sprint that stretched over Thanksgiving and Christmas, which included 15-hour work days, the group had composed Med-PaLM, the first generation of the model, and announced it in December.

Researchers said the rapid advances in the technology were what motivated them to move so quickly.

Along the way, the team began to get a sense of the significance of what they were building. After some early tweaks, the model began achieving a 63% score on the medical licensing exam, clearing the threshold to pass. And in the early stages of the project, the model’s responses were easily distinguishable from the clinicians’ answers by Karthikesalingam, who is a practicing physician himself. But by the end of the process, he could no longer tell which was which, Singhal said.

AI algorithms are already used in health care settings for specific tasks, such as in medical imaging, or to help predict which hospitalized patients are most at-risk for sepsis. But generative AI models pose new risks, which Google itself acknowledges. The models might, for instance, deliver medical misinformation in a convincing fashion or integrate biases that could augment existing health disparities.

In order to mitigate these risks, the Med-PaLM researchers said they incorporated “adversarial testing” into their AI model. They curated a list of questions designed to elicit AI-generated answers with the potential for harm and bias, including a set of questions focused on sensitive medical topics like Covid-19 and mental health, as well as another set of questions on health equity. The latter focused on issues like racial biases in health care.

Google said Med-PaLM 2 gave answers that were more frequently rated as having a “low risk of harm” compared to its first model. But it also said there wasn’t a significant change in the model’s ability to avoid generating inaccurate or irrelevant information.

Shek Azizi, a senior research scientist at Google, said that during testing for Med-PaLM, when they asked the AI model to summarize a patient chart or respond with clinical information, they found Med-PaLM “may hallucinate and refer back to studies that are not basically there, or that weren’t provided.”

Large language models’ propensity for putting out convincing but wrong answers raises concerns about their use in “domains where truth and veracity are paramount and, in this case, life or death issues,” said Meredith Whittaker, president of the Signal Foundation, which supports private messaging, and a former Google manager. She is also concerned about the prospect of “deploying this technology in settings where the incentives are already calibrated to reduce the amount of care and the amount of money spent on care for people who are suffering.”

In a demonstration for Bloomberg reporters, Google showed off an experimental chatbot interface for Med-PaLM 2 in which a user could choose from a variety of medical issues to explore, including conditions like “incontinence,” “loss of balance,” and “acute pancreatitis.”

Selecting one of the conditions generated a description from the AI model along with evaluation results, with ratings for criteria like “reflects clinical and scientific consensus” and “correct recall of knowledge.” The interface also displayed a clinician’s real description of the issue to compare against the AI-generated answers.

In May, at the company’s annual I/O developers conference, Google announced that it was exploring capabilities for Med-PaLM 2 to draw information from both images and text, allowing testers to help interpret information from X-rays and mammograms to someday improve patient outcomes. “Please provide a report to summarize the following chest X-ray,” read one prompt from the experimental Med-PaLM 2 interface seen by Bloomberg.

Though it may not work as advertised in a real clinical setting, the AI’s response looked convincing and comprehensive. “The lung fields are clear without consolidation or edema, the mediastinum is otherwise unremarkable,” it said. “The cardiac silhouette is within normal limits for size, no effusion or pneumothorax is noted, no displaced fractures are evident.”

Comments

Plain text