0

LingLanMiDian: Systematic Evaluation of LLMs on TCM Knowledge and Clinical Reasoning

Rui Hua
Yu Wei
Zixin Shu
Kai Chang
Dengying Yan
Jianan Xia
Zeyu Liu
Hui Zhu
Shujie Song
Mingzhong Xiao
Xiaodong Li
Dongmei Jia
Zhuye Gao
Yanyan Meng
Naixuan Zhao
Yu Fu
Haibin Yu
Benman Yu
Yuanyuan Chen
Fei Dong
Zhizhou Meng
Pengcheng Yang
Songxue Zhao
Lijuan Pei
Yunhui Hu
Kan Ding
Jiayuan Duan
Wenmao Yin
Yang Gu
Runshun Zhang
Qiang Zhu
Jian Yu
Jiansheng Li
Baoyan Liu
Wenjia Wang
Xuezhong Zhou
Abstract

Large language models (LLMs) are advancing rapidly in medical NLP, yet Traditional Chinese Medicine (TCM) with its distinctive ontology, terminology, and reasoning patterns requires domain-faithful evaluation. Existing TCM benchmarks are fragmented in coverage and scale and rely on non-unified or generation-heavy scoring that hinders fair comparison. We present the LingLanMiDian (LingLan) benchmark, a large-scale, expert-curated, multi-task suite that unifies evaluation across knowledge recall, multi-hop reasoning, information extraction, and real-world clinical decision-making. LingLan introduces a consistent metric design, a synonym-tolerant protocol for clinical labels, a per-dataset 400-item Hard subset, and a reframing of diagnosis and treatment recommendation into single-choice decision recognition. We conduct comprehensive, zero-shot evaluations on 14 leading open-source and proprietary LLMs, providing a unified perspective on their strengths and limitations in TCM commonsense knowledge understanding, reasoning, and clinical decision support; critically, the evaluation on Hard subset reveals a substantial gap between current models and human experts in TCM-specialized reasoning. By bridging fundamental knowledge and applied reasoning through standardized evaluation, LingLan establishes a unified, quantitative, and extensible foundation for advancing TCM LLMs and domain-specific medical AI research. All evaluation data and code are available atthis https URLandthis http URL.

View on arXiv
Comments on this paper