Do you know: New test assesses real-world communication skills of AI doctors
in 2025

Credit: Unsplash/CC0 Public domain
Artificial intelligence tools such as ChatGPT have been touted for their promise to ease clinicians’ workload by triaging patients, taking their medical histories and even providing preliminary diagnoses.
These tools, called broad-speak models, are already used by patients to make sense of their symptoms and medical test results.
But if these AI models perform impressively in standardized medical tests, how well do they fare in situations that more closely mimic the real world?
Not so great, according to the findings of a new study by researchers at Harvard Medical School and Stanford University.
For their analysis, published on January 2 in Natural medicineresearchers designed an assessment framework – or test – called CRAFT-MD (Conversational Reasoning Assessment Framework for Testing in Medicine) and deployed it on four large language models to see how they performed in contexts that closely mimicked interactions real with patients.
All four broad language models performed well on medical exam-style questions, but their performance deteriorated when engaged in conversations that more closely mimicked real-world interactions.
According to the researchers, this gap highlights a dual need: first, to create more realistic assessments that better measure the suitability of AI clinical models for real-world use and, second, to improve the ability of these tools to establish diagnostics. based on more realistic interactions before their deployment in the clinic.
According to the research team, evaluation tools such as CRAFT-MD can not only more accurately evaluate AI models based on actual physical condition, but could also help optimize their performance in the clinic.
“Our work reveals a striking paradox: while these AI models excel at medical exams, they struggle to handle the basic back-and-forth of a doctor’s visit,” said the lead author of the study. study, Pranav Rajpurkar, assistant professor of biomedical informatics at Harvard Medical School. .
“The dynamic nature of medical conversations – the need to ask the right questions at the right time, piece together scattered information, and reason through symptoms – poses unique challenges that go far beyond answering multiple choice questions “When we move from standardized tests to these natural conversations, even the most sophisticated AI models show significant drops in diagnostic accuracy.”
A better test to check real AI performance
Currently, developers test the performance of AI models by asking them to answer multiple-choice medical questions, typically derived from the National Graduate Medical Student Examination or tests administered to medical residents in the framework of their certification.
“This approach assumes that all relevant information is presented in a clear and concise manner, often with medical terminology or buzzwords that simplify the diagnostic process, but in the real world this process is much more complicated,” he said. said Shreya Johri, study co-author and doctoral student. in the Rajpurkar laboratory at Harvard Medical School.
“We need a testing framework that better reflects reality and is therefore better able to predict a model’s performance.”
CRAFT-MD was designed to be one of these more realistic gauges.
To simulate real-world interactions, CRAFT-MD evaluates how well broad language models can collect information about symptoms, medications, and family history and then make a diagnosis. An AI agent is used to pose as a patient, answering questions in a conversational, natural style.
Another AI agent evaluates the accuracy of the final diagnosis rendered by the large language model. Human experts then evaluate the results of each encounter to determine their ability to gather relevant patient information, diagnostic accuracy when presented with scattered information, and compliance with prompts.
Researchers used CRAFT-MD to test four AI models, whether proprietary or commercial and open source, to evaluate their performance in 2,000 clinical vignettes featuring conditions common in primary care and across 12 medical specialties.
All AI models had limitations, including their ability to conduct clinical conversations and reason based on information provided by patients. This, in turn, compromised their ability to gather medical history and make a proper diagnosis. For example, models often struggled to ask the right questions to gather relevant patient history, missed critical information during history taking, and had difficulty synthesizing scattered information.
The accuracy of these models decreased when presented with open-ended information rather than multiple-choice responses. These models also perform worse when engaged in back-and-forth exchanges – as most real-world conversations are – rather than when engaged in summarized conversations.
Recommendations for optimizing real-world AI performance
Based on these findings, the team offers a set of recommendations both to AI developers who design AI models and to regulators responsible for evaluating and approving these tools.
These include:
- Using open-ended conversational questions that more accurately reflect unstructured doctor-patient interactions in the design, training, and testing of AI tools
- Evaluate models for their ability to ask the right questions and extract the most essential information
- Design models capable of following multiple conversations and integrating information from them
- Design AI models capable of integrating textual (conversation notes) and non-textual (images, ECG) data
- Design more sophisticated AI agents that can interpret non-verbal cues such as facial expressions, tone and body language
Additionally, the assessment should include both AI agents and human experts, the researchers recommend, because relying solely on human experts is labor-intensive and expensive. For example, CRAFT-MD outperformed human evaluators, processing 10,000 conversations in 48 to 72 hours, plus 15 to 16 hours of expert evaluation.
In contrast, human-based approaches would require extensive recruitment and approximately 500 hours for patient simulations (nearly three minutes per conversation) and approximately 650 hours for expert assessments (nearly four minutes per conversation). Using front-line AI evaluators has the added benefit of eliminating the risk of exposing real patients to unverified AI tools.
The researchers said they expect CRAFT-MD itself to also be updated and optimized periodically to incorporate improved patient-AI models.
“As a physician scientist, I am interested in AI models that can augment clinical practice in an efficient and ethical manner,” said study co-senior author Roxana Daneshjou, assistant professor of biomedical data science. and dermatology at Stanford University.
“CRAFT-MD creates a framework that more closely reflects real-world interactions and thus helps advance the field when it comes to testing the performance of AI models in healthcare.” »
More information:
An evaluation framework for the clinical use of large language models in patient interaction tasks, Natural medicine (2024). DOI: 10.1038/s41591-024-03328-5
Quote: New test assesses real-world AI doctors’ communication skills (January 2, 2025) January 2, 2025 from -ai-doctors-real-world- communication.html
. Except for fair use for private study or research purposes, no part may be reproduced without written permission. The content is provided for informational purposes only.
