76
6

Superhuman performance of a large language model on the reasoning tasks of a physician

Abstract

A seminal paper published by Ledley and Lusted in 1959 introduced complex clinical diagnostic reasoning cases as the gold standard for the evaluation of expert medical computing systems, a standard that has held ever since. Here, we report the results of a physician evaluation of a large language model (LLM) on challenging clinical cases against a baseline of hundreds of physicians. We conduct five experiments to measure clinical reasoning across differential diagnosis generation, display of diagnostic reasoning, triage differential diagnosis, probabilistic reasoning, and management reasoning, all adjudicated by physician experts with validated psychometrics. We then report a real-world study comparing human expert and AI second opinions in randomly-selected patients in the emergency room of a major tertiary academic medical center in Boston, MA. We compared LLMs and board-certified physicians at three predefined diagnostic touchpoints: triage in the emergency room, initial evaluation by a physician, and admission to the hospital or intensive care unit. In all experiments--both vignettes and emergency room second opinions--the LLM displayed superhuman diagnostic and reasoning abilities, as well as continued improvement from prior generations of AI clinical decision support. Our study suggests that LLMs have achieved superhuman performance on general medical diagnostic and management reasoning, fulfilling the vision put forth by Ledley and Lusted, and motivating the urgent need for prospective trials.

View on arXiv
@article{brodeur2025_2412.10849,
  title={ Superhuman performance of a large language model on the reasoning tasks of a physician },
  author={ Peter G. Brodeur and Thomas A. Buckley and Zahir Kanjee and Ethan Goh and Evelyn Bin Ling and Priyank Jain and Stephanie Cabral and Raja-Elie Abdulnour and Adrian D. Haimovich and Jason A. Freed and Andrew Olson and Daniel J. Morgan and Jason Hom and Robert Gallo and Liam G. McCoy and Haadi Mombini and Christopher Lucas and Misha Fotoohi and Matthew Gwiazdon and Daniele Restifo and Daniel Restrepo and Eric Horvitz and Jonathan Chen and Arjun K. Manrai and Adam Rodman },
  journal={arXiv preprint arXiv:2412.10849},
  year={ 2025 }
}
Comments on this paper