Researchers from the National Institutes of Health (NIH) have demonstrated that a multimodal AI can achieve high accuracy on a medical diagnostic quiz, yet struggles to describe medical images and explain the reasoning behind its answers. ChatBots in Medical Diagnostics may not be ready for prime time.
Thank you for reading this post, don't forget to subscribe!To evaluate AI’s potential in clinical settings, the research team tasked Generative Pre-trained Transformer 4 with Vision (GPT-4V) with answering 207 questions from the New England Journal of Medicine (NEJM) Image Challenge. This challenge, designed to help healthcare professionals test their diagnostic abilities, prompts users to select a diagnosis from multiple-choice options after reviewing clinical images and a text-based description of patient symptoms.
The researchers asked the AI to both answer the questions and provide a rationale for each answer, including a description of the image presented, a summary of current, relevant clinical knowledge, and step-by-step reasoning for how GPT-4V arrived at its answer.
Nine clinicians from various specialties were also tasked with answering the same questions, first in a closed-book environment with no access to external resources, then in an open-book setting where they could refer to external sources.
The research team then provided the clinicians with the correct answers and the AI’s responses, asking them to score GPT-4V’s ability to describe the images, summarize medical knowledge, and provide step-by-step reasoning.
The analysis revealed that both clinicians and the AI scored highly in choosing the correct diagnosis. In closed-book settings, the AI outperformed the clinicians, whereas humans outperformed the model in open-book settings.
Moreover, GPT-4V frequently made mistakes when explaining its reasoning and describing medical images, even in cases where it selected the correct answer.
Despite the study’s small sample size, the researchers noted that their findings highlight how multimodal AI could be used to provide clinical decision support.
“This technology has the potential to help clinicians augment their capabilities with data-driven insights that may lead to improved clinical decision-making,” said Zhiyong Lu, Ph.D., corresponding author of the study and senior investigator at NIH’s National Library of Medicine (NLM), in a press release. “Understanding the risks and limitations of this technology is essential to harnessing its potential in medicine.”
However, the research team emphasized the importance of assessing AI-based clinical decision support tools.
“Integration of AI into healthcare holds great promise as a tool to help medical professionals diagnose patients faster, allowing them to start treatment sooner,” explained Stephen Sherry, Ph.D., NLM acting director. “However, as this study shows, AI is not advanced enough yet to replace human experience, which is crucial for accurate diagnosis.”