Large-Language Model AI Shows Promise but Requires Further Research for Clinical Integration
In a study led by Mass General Brigham, ChatGPT, a large-language model (LLM) AI chatbot, was found to be 72% accurate in overall clinical decision-making processes. The study, which spanned across primary care and emergency settings, offers a comprehensive look at the potential and limitations of AI in healthcare.
HCN Medical Memo
The study underscores the potential of AI as a supplementary tool in clinical decision-making, particularly in final diagnoses. However, it also highlights areas where human expertise remains irreplaceable, such as differential diagnoses. As AI continues to evolve, it’s crucial for the medical community to engage in further research and dialogue to ensure responsible and effective integration into healthcare.
Key Points
- ChatGPT achieved a 72% accuracy rate in overall clinical decision making, including diagnoses and care management.
- The study utilized 36 standardized, published clinical vignettes to test the AI’s capabilities.
- Marc Succi, MD, associate chair of innovation at Mass General Brigham, compared the AI’s performance to that of a medical school graduate.
- The AI performed consistently across both primary care and emergency settings, without showing gender bias.
- ChatGPT was most accurate in making final diagnoses (77%) but struggled with differential diagnoses (60%).
“ChatGPT struggled with differential diagnosis, which is the meat and potatoes of medicine when a physician has to figure out what to do.”
– Marc Succi, MD, corresponding author and associate chair of innovation and commercialization and strategic innovation leader at Mass General Brigham and executive director of the MESH Incubator
Additional Points
- The study calls for more benchmark research and regulatory guidance before AI tools like ChatGPT can be integrated into clinical care.
- Mass General Brigham is currently evaluating other LLM solutions for clinical documentation and patient messaging.
More on AI