ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2511.14366
283
0
v1v2 (latest)

ATLAS: A High-Difficulty, Multidisciplinary Benchmark for Frontier Scientific Reasoning

18 November 2025
Hongwei Liu
J. Liu
Shudong Liu
Haodong Duan
Yuqiang Li
Mao Su
Xiaohong Liu
G. Zhai
Xinyu Fang
Qianhong Ma
Taolin Zhang
Zihan Ma
Yufeng Zhao
Peiheng Zhou
Linchen Xiao
Wenlong Zhang
Shijie Zhou
Xingjian Ma
S. Sun
J. Ge
Meng Li
Y. Liu
Jianxin Dong
Jiaying Li
H. Wu
H. Liang
Jintai Lin
Y Samuel Wang
J. Dong
Tong Zhu
Tianfan Fu
Conghui He
Qi Zhang
Songyang Zhang
Lei Bai
Kai Chen
    LRMALMELM
ArXiv (abs)PDFHTMLHuggingFace (14 upvotes)Github (6311★)
Main:21 Pages
11 Figures
Bibliography:4 Pages
15 Tables
Appendix:14 Pages
Abstract

The rapid advancement of Large Language Models (LLMs) has led to performance saturation on many established benchmarks, questioning their ability to distinguish frontier models. Concurrently, existing high-difficulty benchmarks often suffer from narrow disciplinary focus, oversimplified answer formats, and vulnerability to data contamination, creating a fidelity gap with real-world scientific inquiry. To address these challenges, we introduce ATLAS (AGI-Oriented Testbed for Logical Application in Science), a large-scale, high-difficulty, and cross-disciplinary evaluation suite composed of approximately 800 original problems. Developed by domain experts (PhD-level and above), ATLAS spans seven core scientific fields: mathematics, physics, chemistry, biology, computer science, earth science, and materials science. Its key features include: (1) High Originality and Contamination Resistance, with all questions newly created or substantially adapted to prevent test data leakage; (2) Cross-Disciplinary Focus, designed to assess models' ability to integrate knowledge and reason across scientific domains; (3) High-Fidelity Answers, prioritizing complex, open-ended answers involving multi-step reasoning and LaTeX-formatted expressions over simple multiple-choice questions; and (4) Rigorous Quality Control, employing a multi-stage process of expert peer review and adversarial testing to ensure question difficulty, scientific value, and correctness. We also propose a robust evaluation paradigm using a panel of LLM judges for automated, nuanced assessment of complex answers. Preliminary results on leading models demonstrate ATLAS's effectiveness in differentiating their advanced scientific reasoning capabilities. We plan to develop ATLAS into a long-term, open, community-driven platform to provide a reliable "ruler" for progress toward Artificial General Intelligence.

View on arXiv
Comments on this paper