ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.13508
56
0

It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education

13 March 2025
Shrutika Singh
Anton Alyakin
Daniel Alber
Jaden Stryker
Ai Phuong S Tong
Karl L. Sangwon
Nicolas K. Goff
Mathew de la Paz
Miguel Hernandez-Rovira
Ki Yun Park
Eric Leuthardt
E. Oermann
    AI4MH
    AI4Ed
    ELM
ArXivPDFHTML
Abstract

The performance of Large Language Models (LLMs) on multiple-choice question (MCQ) benchmarks is frequently cited as proof of their medical capabilities. We hypothesized that LLM performance on medical MCQs may in part be illusory and driven by factors beyond medical content knowledge and reasoning capabilities. To assess this, we created a novel benchmark of free-response questions with paired MCQs (FreeMedQA). Using this benchmark, we evaluated three state-of-the-art LLMs (GPT-4o, GPT-3.5, and LLama-3-70B-instruct) and found an average absolute deterioration of 39.43% in performance on free-response questions relative to multiple-choice (p = 1.3 * 10-5) which was greater than the human performance decline of 22.29%. To isolate the role of the MCQ format on performance, we performed a masking study, iteratively masking out parts of the question stem. At 100% masking, the average LLM multiple-choice performance was 6.70% greater than random chance (p = 0.002) with one LLM (GPT-4o) obtaining an accuracy of 37.34%. Notably, for all LLMs the free-response performance was near zero. Our results highlight the shortcomings in medical MCQ benchmarks for overestimating the capabilities of LLMs in medicine, and, broadly, the potential for improving both human and machine assessments using LLM-evaluated free-response questions.

View on arXiv
@article{singh2025_2503.13508,
  title={ It is Too Many Options: Pitfalls of Multiple-Choice Questions in Generative AI and Medical Education },
  author={ Shrutika Singh and Anton Alyakin and Daniel Alexander Alber and Jaden Stryker and Ai Phuong S Tong and Karl Sangwon and Nicolas Goff and Mathew de la Paz and Miguel Hernandez-Rovira and Ki Yun Park and Eric Claude Leuthardt and Eric Karl Oermann },
  journal={arXiv preprint arXiv:2503.13508},
  year={ 2025 }
}
Comments on this paper