CausalT5K: Diagnosing and Informing Refusal for Trustworthy Causal Reasoning of Skepticism, Sycophancy, Detection-Correction, and Rung Collapse

9 February 2026

Longling Geng

Andy Ouyang

Theodore Wu

Daphne Barretto

Matthew John Hayes

Rachael Cooper

Yuqiao Zeng

Sameer Vijay

Gia Ancone

Ankit Rai

Matthew Wolfman

Patrick Flanagan

Edward Y. Chang

ELM

LRM

ArXiv (abs)PDF HTML Github

Main:8 Pages

5 Figures

Bibliography:1 Pages

21 Tables

Appendix:8 Pages

Abstract

LLM failures in causal reasoning, including sycophancy, rung collapse, and miscalibrated refusal, are well-documented, yet progress on remediation is slow because no benchmark enables systematic diagnosis. We introduce CausalT5K, a diagnostic benchmark of over 5,000 cases across 10 domains that tests three critical capabilities: (1) detecting rung collapse, where models answer interventional queries with associational evidence; (2) resisting sycophantic drift under adversarial pressure; and (3) generating Wise Refusals that specify missing information when evidence is underdetermined. Unlike synthetic benchmarks, CausalT5K embeds causal traps in realistic narratives and decomposes performance into Utility (sensitivity) and Safety (specificity), revealing failure modes invisible to aggregate accuracy. Developed through a rigorous human-machine collaborative pipeline involving 40 domain experts, iterative cross-validation cycles, and composite verification via rule-based, LLM, and human scoring, CausalT5K implements Pearl's Ladder of Causation as research infrastructure. Preliminary experiments reveal a Four-Quadrant Control Landscape where static audit policies universally fail, a finding that demonstrates CausalT5K's value for advancing trustworthy reasoning systems. Repository:this https URL

View on arXiv

Comments on this paper