SUMMARY
OBJECTIVE: The aim of this study was to compare the diagnostic accuracy and initial diagnostic test selection capabilities of large language models with an experienced emergency medicine specialist in simulated emergency department scenarios.
METHODS: A series of brief case presentations were created by an expert committee to reflect real-world emergency conditions. Each brief case presentation included clinical history and physical examination findings but excluded laboratory and imaging data. The study compared the diagnostic accuracy and initial test selection performance of an emergency medicine specialist with three different large language model versions: ChatGPT-4, ChatGPT-4o, and ChatGPT-3.5-mini. The accuracy of responses was assessed based on predefined correct diagnoses and appropriate first-line tests. Statistical comparisons were conducted using the Cochran-Q test and McNemar test.
RESULTS: The diagnostic accuracy rates were 92% for the human expert, 97% for ChatGPT-4, and 99% for both ChatGPT-4o and ChatGPT-3.5-mini (p=0.039 for ChatGPT-4o and ChatGPT-3.5-mini vs. human expert). The accuracy of initial diagnostic test selection was 88% for the human expert, 80% for ChatGPT-4, 87% for ChatGPT-4o, and 89% for ChatGPT-3.5-mini (p>0.05 for all comparisons). The most frequent diagnostic errors were related to cardiovascular (7/13) and gastrointestinal (4/13) cases.
CONCLUSIONS: Large language models demonstrated acceptable diagnostic accuracy, outperforming the human expert in diagnosis while performing comparably in selecting initial diagnostic tests. These findings suggest that artificial intelligence models could serve as valuable decision-support tools in emergency medicine. However, further research is needed to evaluate their performance in real-world clinical settings.
KEYWORDS:
Emergency medicine; Artificial intelligence; Diagnosis