DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

19 December 2025

Janghoon Han

Heegyu Kim

Changho Lee

Dahm Lee

Min Hyung Park

Hosung Song

Stanley Jungkyu Choi

Moontae Lee

Honglak Lee

ALM

HILM

ArXiv (abs)PDF HTML Github

Main:7 Pages

10 Figures

Bibliography:6 Pages

16 Tables

Appendix:26 Pages

Abstract

As large language models advance, deep research systems capable of generating expert-level reports through multi-step reasoning and evidence-based synthesis are emerging. However, evaluating such reports remains challenging. Existing benchmarks often lack systematic evaluation criteria, rely heavily on LLM-based judges that may miss issues requiring expert judgment, and verify only a limited subset of explicitly cited statements rather than report-wide factual reliability. To address these limitations, we introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains, along with an expert-grounded evaluation taxonomy with seven dimensions and 25 subdimensions, operationalized into 101 fine-grained rubric items. To improve evaluation consistency, DEER provides task-specific Expert Evaluation Guidance to support LLM-based judging. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that verifies both cited and uncited claims and quantifies the quality and reliability of the supporting evidence. Experimental results show that DEER exhibits strong correlation with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

View on arXiv

Comments on this paper