1

First, do NOHARM: towards clinically safe large language models

David Wu
Fateme Nateghi Haredasht
Saloni Kumar Maharaj
Priyank Jain
Jessica Tran
Matthew Gwiazdon
Arjun Rustagi
Jenelle Jindal
Jacob M. Koshy
Vinay Kadiyala
Anup Agarwal
Bassman Tappuni
Brianna French
Sirus Jesudasen
Christopher V. Cosgriff
Rebanta Chakraborty
Jillian Caldwell
Susan Ziolkowski
David J. Iberri
Robert Diep
Rahul S. Dalal
Kira L. Newman
Kristin Galetta
J. Carl Pallais
Nancy Wei
Kathleen M. Buchheit
David I. Hong
Ernest Y. Lee
Allen Shih
Vartan Pahalyants
Tamara B. Kaplan
Vishnu Ravi
Sarita Khemani
April S. Liang
Daniel Shirvani
Advait Patil
Nicholas Marshall
Kanav Chopra
Joel Koh
Adi Badhwar
Liam G. McCoy
David J. H. Wu
Yingjie Weng
Sumant Ranji
Kevin Schulman
Nigam H. Shah
Jason Hom
Arnold Milstein
Adam Rodman
Jonathan H. Chen
Ethan Goh
Main:44 Pages
16 Figures
Abstract

Large language models (LLMs) are routinely used by physicians and patients for medical advice, yet their clinical safety profiles remain poorly characterized. We present NOHARM (Numerous Options Harm Assessment for Risk in Medicine), a benchmark using 100 real primary-care-to-specialist consultation cases to measure harm frequency and severity from LLM-generated medical recommendations. NOHARM covers 10 specialties, with 12,747 expert annotations for 4,249 clinical management options. Across 31 LLMs, severe harm occurs in up to 22.2% (95% CI 21.6-22.8%) of cases, with harms of omission accounting for 76.6% (95% CI 76.4-76.8%) of errors. Safety performance is only moderately correlated (r = 0.61-0.64) with existing AI and medical knowledge benchmarks. The best models outperform generalist physicians on safety (mean difference 9.7%, 95% CI 7.0-12.5%), and a diverse multi-agent approach reduces harm compared to solo models (mean difference 8.0%, 95% CI 4.0-12.1%). Therefore, despite strong performance on existing evaluations, widely used AI models can produce severely harmful medical advice at nontrivial rates, underscoring clinical safety as a distinct performance dimension necessitating explicit measurement.

View on arXiv
Comments on this paper