BEnchmarking LLMs for Ophthalmology (BELO) for Ophthalmological Knowledge and Reasoning

21 July 2025

Sahana Srinivasan

Xuguang Ai

Thaddaeus Wai Soon Lo

Aidan Gilson

Minjie Zou

Ke Zou

Hyunjae Kim

Mingjia Yang

Krithi Pushpanathan

Samantha Yew

Wan Ting Loke

Jocelyn Goh

Yibing Chen

Yiming Kong

Emily Yuelei Fu

Michelle Ongyong Hui

Kristen Nwanyanwu

Amisha Dave

Kelvin Zhenghao Li

Chen-Hsin Sun

Mark Chia

Gabriel Dawei Yang

Wendy Meihua Wong

David Ziyou Chen

Dianbo Liu

Maxwell Singer

Fares Antaki

Lucian V Del Priore

Jost Jonas

Ron Adelman

Qingyu Chen

Yih-Chung Tham

ELM

ArXiv (abs)PDF HTML Github

Main:47 Pages

11 Figures

9 Tables

Abstract

Current benchmarks evaluating large language models (LLMs) in ophthalmology are limited in scope and disproportionately prioritise accuracy. We introduce BELO (BEnchmarking LLMs for Ophthalmology), a standardized and comprehensive evaluation benchmark developed through multiple rounds of expert checking by 13 ophthalmologists. BELO assesses ophthalmology-related clinical accuracy and reasoning quality. Using keyword matching and a fine-tuned PubMedBERT model, we curated ophthalmology-specific multiple-choice-questions (MCQs) from diverse medical datasets (BCSC, MedMCQA, MedQA, BioASQ, and PubMedQA). The dataset underwent multiple rounds of expert checking. Duplicate and substandard questions were systematically removed. Ten ophthalmologists refined the explanations of each MCQ's correct answer. This was further adjudicated by three senior ophthalmologists. To illustrate BELO's utility, we evaluated six LLMs (OpenAI o1, o3-mini, GPT-4o, DeepSeek-R1, Llama-3-8B, and Gemini 1.5 Pro) using accuracy, macro-F1, and five text-generation metrics (ROUGE-L, BERTScore, BARTScore, METEOR, and AlignScore). In a further evaluation involving human experts, two ophthalmologists qualitatively reviewed 50 randomly selected outputs for accuracy, comprehensiveness, and completeness. BELO consists of 900 high-quality, expert-reviewed questions aggregated from five sources: BCSC (260), BioASQ (10), MedMCQA (572), MedQA (40), and PubMedQA (18). A public leaderboard has been established to promote transparent evaluation and reporting. Importantly, the BELO dataset will remain a hold-out, evaluation-only benchmark to ensure fair and reproducible comparisons of future models.

View on arXiv

Comments on this paper