v1v2 (latest)

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

29 February 2024

Qintong Li

Leyang Cui

Xueliang Zhao

Lingpeng Kong

Wei Bi

LRM

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github

Papers citing "GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers"

50 / 78 papers shown

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

304

20 Nov 2025

Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models

189

11 Nov 2025

Generalized-Scale Object Counting with Gradual Query Aggregation

300

11 Nov 2025

RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning

389

06 Nov 2025

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Michalis Vazirgiannis

Guokan Shang

OffRL ReLM LRM

266

02 Nov 2025

OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning

221

20 Oct 2025

Evaluating LLM Reasoning Beyond Correctness and CoT

Soheil Abbasloo

LRM

204

20 Oct 2025

The Idola Tribus of AI: Large Language Models tend to perceive order where none exists

150

10 Oct 2025

MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

155

10 Oct 2025

AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching

145

09 Oct 2025

Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks

233

26 Aug 2025

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks

374

23 Aug 2025

ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models

178

17 Aug 2025

MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy

179

07 Aug 2025

Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time

331

04 Aug 2025

Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration

310

31 Jul 2025

Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities

260

31 Jul 2025

League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models

Xiaofeng Wang

Baosheng Wang

ELM ALM

292

30 Jul 2025

TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards

Andreea Nica

Ivan Zakazov

Nicolas Mario Baldwin

Saibo Geng

Robert West

ReLM LRM

316

24 Jul 2025

WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training

313

23 Jul 2025

Towards Compute-Optimal Many-Shot In-Context Learning

279

22 Jul 2025

ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models

...

264

03 Jul 2025

Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training

345

02 Jul 2025

Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness

300

16 Jun 2025

A Survey on Large Language Models for Mathematical Reasoning

...

381

10 Jun 2025

Toward Automated Robustness Evaluation of Mathematical Reasoning

337

05 Jun 2025

Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models

442

04 Jun 2025

STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent FrameworkAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

...

449

02 Jun 2025

The Role of Diversity in In-Context Learning for Large Language Models

Wenyang Xiao

Haoyu Zhao

Lingxiao Huang

452

26 May 2025

SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

622

22 May 2025

A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP

352

22 May 2025

LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models

269

21 May 2025

DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning

465

21 May 2025

Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving

406

07 May 2025

Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging

Shi Jie Yu

Sehyun Choi

MoMe

396

23 Apr 2025

Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?

665

16 Apr 2025

Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors

440

07 Apr 2025

Exploring LLM Reasoning Through Controlled Prompt Variations

Giannis Chatziveroglou

Richard Yun

Maura Kelleher

AAML LRM

201

02 Apr 2025

QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?

543

28 Mar 2025

Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps

388

25 Mar 2025

Why Do Multi-Agent LLM Systems Fail?

...

816

288

17 Mar 2025

Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems

279

09 Mar 2025

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models

Meghana Arakkal Rajeev

Sathwik Tejaswi Madhusudan

James Zou

Nazneen Rajani

AAML LRM

420

03 Mar 2025

Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?

...

374

26 Feb 2025

Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual EnvironmentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

Patomporn Payoungkhamdee

Pume Tuchinda

Jinheon Baek

Samuel Cahyawijaya

Can Udomcharoenchaikit

Potsawee Manakul

Peerat Limkonchotiwat

Ekapol Chuangsuwanich

Sarana Nutanong

LRM

357

25 Feb 2025

The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?Annual Meeting of the Association for Computational Linguistics (ACL), 2025

259

20 Feb 2025

Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements

582

18 Feb 2025

MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations

...

764

10 Feb 2025

The Best Instruction-Tuning Data are Those That Fit

659

06 Feb 2025

Coarse-to-Fine Process Reward Modeling for Mathematical Reasoning

425

23 Jan 2025