ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2507.03152
309
2
v1v2v3v4 (latest)

Expert-level validation of AI-generated medical text with scalable language models

3 July 2025
Asad Aali
Vasiliki Bikia
M. Varma
Nicole Chiou
Sophie Ostmeier
Arnav Singhvi
Magdalini Paschali
Ashwin Kumar
Andrew Johnston
Karimar Amador-Martinez
Eduardo Juan Perez Guerrero
Paola Naovi Cruz Rivera
S. Gatidis
Christian Bluethgen
Eduardo Reis
Eddy D. Zandee van Rilland
Poonam Hosamani
Kevin R Keet
Minjoung Go
Evelyn Bin Ling
David B. Larson
Curtis P. Langlotz
R. Daneshjou
Jason Hom
Sanmi Koyejo
Emily Alsentzer
Akshay Chaudhari
    LM&MAELM
ArXiv (abs)PDFHTMLGithub (12★)
Main:15 Pages
7 Figures
Bibliography:3 Pages
7 Tables
Appendix:10 Pages
Abstract

With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the "LM-as-judge" paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase (this https URL), 2) MedVAL-Bench (this https URL), and 3) MedVAL-4B (this https URL), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.

View on arXiv
Comments on this paper