298

DEER: A Benchmark for Evaluating Deep Research Agents on Expert Report Generation

Janghoon Han
Heegyu Kim
Changho Lee
Dahm Lee
Min Hyung Park
Hosung Song
Stanley Jungkyu Choi
Moontae Lee
Honglak Lee
Main:7 Pages
10 Figures
Bibliography:6 Pages
16 Tables
Appendix:26 Pages
Abstract

As large language models advance, deep research systems capable of generating expert-level reports through multi-step reasoning and evidence-based synthesis are emerging. However, evaluating such reports remains challenging. Existing benchmarks often lack systematic evaluation criteria, rely heavily on LLM-based judges that may miss issues requiring expert judgment, and verify only a limited subset of explicitly cited statements rather than report-wide factual reliability. To address these limitations, we introduce DEER, a benchmark for evaluating expert-level deep research reports. DEER comprises 50 report-writing tasks spanning 13 domains, along with an expert-grounded evaluation taxonomy with seven dimensions and 25 subdimensions, operationalized into 101 fine-grained rubric items. To improve evaluation consistency, DEER provides task-specific Expert Evaluation Guidance to support LLM-based judging. Complementing rubric-based assessment, we propose a document-level fact-checking architecture that verifies both cited and uncited claims and quantifies the quality and reliability of the supporting evidence. Experimental results show that DEER exhibits strong correlation with human expert judgments and yields interpretable diagnostics of system strengths and weaknesses.

View on arXiv
Comments on this paper