ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.19255
  4. Cited By
GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of
  LLMs as Mathematical Problem Solvers
v1v2 (latest)

GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers

29 February 2024
Qintong Li
Leyang Cui
Xueliang Zhao
Lingpeng Kong
Wei Bi
    LRM
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github

Papers citing "GSM-Plus: A Comprehensive Benchmark for Evaluating the Robustness of LLMs as Mathematical Problem Solvers"

50 / 78 papers shown
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
Zhenyu Bi
Gaurav Srivastava
Yang Li
Meng Lu
Swastik Roy
Morteza Ziyadi
Xuan Wang
ELM
304
0
0
20 Nov 2025
Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models
Numerical Sensitivity and Robustness: Exploring the Flaws of Mathematical Reasoning in Large Language Models
Zhishen Sun
Guang Dai
Ivor Tsang
Haishan Ye
AAMLLRM
189
1
0
11 Nov 2025
Generalized-Scale Object Counting with Gradual Query Aggregation
Generalized-Scale Object Counting with Gradual Query Aggregation
Jer Pelhan
A. Lukežič
Matej Kristan
ObjD
300
1
0
11 Nov 2025
RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
RIDE: Difficulty Evolving Perturbation with Item Response Theory for Mathematical Reasoning
Xinyuan Li
Murong Xu
Wenbiao Tao
Hanlun Zhu
Yike Zhao
Jipeng Zhang
Yunshi Lan
AIMatLRM
389
0
0
06 Nov 2025
Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR
Abdelaziz Bounhar
Hadi Abdine
Evan Dufraisse
Ahmad Chamma
Amr Mohamed
Dani Bouch
Michalis Vazirgiannis
Guokan Shang
OffRLReLMLRM
266
1
0
02 Nov 2025
OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
OPTAGENT: Optimizing Multi-Agent LLM Interactions Through Verbal Reinforcement Learning for Enhanced Reasoning
Zhenyu Bi
Meng Lu
Yang Li
Swastik Roy
Weijie Guan
Morteza Ziyadi
Xuan Wang
LLMAGLRM
221
2
0
20 Oct 2025
Evaluating LLM Reasoning Beyond Correctness and CoT
Evaluating LLM Reasoning Beyond Correctness and CoT
Soheil Abbasloo
LRM
204
0
0
20 Oct 2025
The Idola Tribus of AI: Large Language Models tend to perceive order where none exists
The Idola Tribus of AI: Large Language Models tend to perceive order where none exists
Shin-nosuke Ishikawa
Masato Todo
Taiki Ogihara
Hirotsugu Ohba
LRM
150
0
0
10 Oct 2025
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics
Jiapeng Wang
Changxin Tian
Kunlong Chen
Ziqi Liu
Jiaxin Mao
Wayne Xin Zhao
Zhiqiang Zhang
Jun Zhou
155
1
0
10 Oct 2025
AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching
AdaSwitch: Balancing Exploration and Guidance in Knowledge Distillation via Adaptive Switching
Jingyu Peng
Xinjian Zhao
Hengyi Cai
Yuchen Li
Kai Zhang
Shuaiqiang Wang
D. Yin
Xiangyu Zhao
145
1
0
09 Oct 2025
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Optimal Sparsity of Mixture-of-Experts Language Models for Reasoning Tasks
Taishi Nakamura
Satoki Ishikawa
Masaki Kawamura
Takumi Okamoto
Daisuke Nohara
Jun Suzuki
Rio Yokota
MoELRM
233
0
0
26 Aug 2025
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks
H. Kim
Junwoo Ha
Sangyoon Yu
Haon Park
ELM
374
2
0
23 Aug 2025
ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models
ReaLM: Reflection-Enhanced Autonomous Reasoning with Small Language Models
Yuanfeng Xu
Zehui Dai
Jian Liang
Jiapeng Guan
Guangrun Wang
Liang Lin
Xiaohui Lv
LLMAGLRM
178
0
0
17 Aug 2025
MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
MathSmith: Towards Extremely Hard Mathematical Reasoning by Forging Synthetic Problems with a Reinforced Policy
Shaoxiong Zhan
Yanlin Lai
Ziyu Lu
Dahua Lin
Ziqing Yang
Fei Tang
LRM
179
16
0
07 Aug 2025
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Diagnosing Memorization in Chain-of-Thought Reasoning, One Token at a Time
Huihan Li
You Chen
Siyuan Wang
Yixin He
Ninareh Mehrabi
Rahul Gupta
Xiang Ren
LRM
331
4
0
04 Aug 2025
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
Beyond Passive Critical Thinking: Fostering Proactive Questioning to Enhance Human-AI Collaboration
Ante Wang
Yujie Lin
Jingyao Liu
Suhang Wu
Hao Liu
Xinyan Xiao
Jinsong Su
AIMatLRM
310
3
0
31 Jul 2025
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Yunxiang Yan
Tomohiro Sawada
Kartik Goyal
ELM
260
0
0
31 Jul 2025
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
League of LLMs: A Benchmark-Free Paradigm for Mutual Evaluation of Large Language Models
Q. Guo
Wei Xie
Xiaofang Cai
Enze Wang
Shuoyoucheng Ma
Kai Chen
Xiaofeng Wang
Baosheng Wang
Xiaofeng Wang
Baosheng Wang
ELMALM
292
0
0
30 Jul 2025
TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards
TRPrompt: Bootstrapping Query-Aware Prompt Optimization from Textual Rewards
Andreea Nica
Ivan Zakazov
Nicolas Mario Baldwin
Saibo Geng
Robert West
ReLMLRM
316
1
0
24 Jul 2025
WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training
WSM: Decay-Free Learning Rate Schedule via Checkpoint Merging for LLM Pre-training
Changxin Tian
Jiapeng Wang
Qian Zhao
Kunlong Chen
Jia-Ling Liu
Ziqi Liu
Jiaxin Mao
Wayne Xin Zhao
Zhiqiang Zhang
Jun Zhou
MoMeCLL
313
11
0
23 Jul 2025
Towards Compute-Optimal Many-Shot In-Context Learning
Towards Compute-Optimal Many-Shot In-Context Learning
Shahriar Golchin
Yanfei Chen
Rujun Han
Manan Gandhi
Tianli Yu
Swaroop Mishra
Mihai Surdeanu
Rishabh Agarwal
Chen-Yu Lee
Tomas Pfister
279
0
0
22 Jul 2025
ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models
ReliableMath: Benchmark of Reliable Mathematical Reasoning on Large Language Models
Boyang Xue
Qi Zhu
Rui Wang
Sheng Wang
Hongru Wang
...
Fei Mi
Yasheng Wang
Lifeng Shang
Qun Liu
Kam-Fai Wong
LRM
264
3
0
03 Jul 2025
Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training
Tuning without Peeking: Provable Generalization Bounds and Robust LLM Post-Training
Ismail Labiad
Mathurin Videau
Matthieu Kowalski
Marc Schoenauer
Alessandro Leite
Julia Kempe
O. Teytaud
AAML
345
0
0
02 Jul 2025
Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness
Position: Pause Recycling LoRAs and Prioritize Mechanisms to Uncover Limits and Effectiveness
Mei-Yen Chen
Thi Thu Uyen Hoang
Michael Hahn
M. Sarfraz
MoMe
300
0
0
16 Jun 2025
A Survey on Large Language Models for Mathematical Reasoning
Peng-Yuan Wang
Tian-Shuo Liu
Chenyang Wang
Yi-Di Wang
Shu Yan
...
Xu-Hui Liu
Xin-Wei Chen
Jia-Cheng Xu
Ziniu Li
Yang Yu
LRM
381
36
0
10 Jun 2025
Toward Automated Robustness Evaluation of Mathematical Reasoning
Toward Automated Robustness Evaluation of Mathematical Reasoning
Yutao Hou
Zeguan Xiao
Fei Yu
Yihan Jiang
Xuetao Wei
Hailiang Huang
Yun-Nung Chen
Guanhua Chen
Guanhua Chen
AAMLLRM
337
7
0
05 Jun 2025
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models
Soumya Suvra Ghosal
Souradip Chakraborty
Avinash Reddy
Yifu Lu
Mengdi Wang
Dinesh Manocha
Furong Huang
Mohammad Ghavamzadeh
Amrit Singh Bedi
ReLMLRM
442
17
0
04 Jun 2025
STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent Framework
STORM-BORN: A Challenging Mathematical Derivations Dataset Curated via a Human-in-the-Loop Multi-Agent FrameworkAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Wenhao Liu
Zhenyi Lu
Xinyu Hu
Jierui Zhang
Dailin Li
...
Pei Zhang
Chengbo Zhang
Yuxiang Ren
Xiaohong Huang
Yan Ma
OffRL
449
4
0
02 Jun 2025
The Role of Diversity in In-Context Learning for Large Language Models
The Role of Diversity in In-Context Learning for Large Language Models
Wenyang Xiao
Haoyu Zhao
Lingxiao Huang
452
5
0
26 May 2025
SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
Yujie Hou
Ting Zhang
Mei Wang
Xuetao Ma
Hua Huang
Hua Huang
ReLMLRMELM
622
0
0
22 May 2025
A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
A Japanese Language Model and Three New Evaluation Benchmarks for Pharmaceutical NLP
Issey Sukeda
Takuro Fujii
Kosei Buma
Shunsuke Sasaki
Shinnosuke Ono
ELM
352
3
0
22 May 2025
LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models
LFTF: Locating First and Then Fine-Tuning for Mitigating Gender Bias in Large Language Models
Zhanyue Qin
Yue Ding
Deyuan Liu
Qingbin Liu
Junxian Cai
Xi Chen
Zhiying Tu
Dianhui Chu
Cuiyun Gao
Dianbo Sui
269
2
0
21 May 2025
DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
DEBATE, TRAIN, EVOLVE: Self Evolution of Language Model Reasoning
Gaurav Srivastava
Zhenyu Bi
Meng Lu
Xuan Wang
LLMAGLRM
465
6
0
21 May 2025
Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
Beyond Theorem Proving: Formulation, Framework and Benchmark for Formal Problem-Solving
Zijun Chen
Xinhao Zheng
Renqiu Xia
Xingzhi Qi
Qinxiang Cao
Junchi Yan
AIMat
406
2
0
07 May 2025
Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging
Parameter-Efficient Checkpoint Merging via Metrics-Weighted Averaging
Shi Jie Yu
Sehyun Choi
MoMe
396
0
0
23 Apr 2025
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?
Yiyou Sun
Georgia Zhou
Jian Shu
Dexun Li
Nouha Dziri
Dawn Song
Dawn Song
ReLMALMELMLRM
665
20
1
16 Apr 2025
Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors
Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors
Fan Nie
Lan Feng
Haotian Ye
Weixin Liang
Pan Lu
Huaxiu Yao
Alexandre Alahi
James Zou
440
12
0
07 Apr 2025
Exploring LLM Reasoning Through Controlled Prompt Variations
Exploring LLM Reasoning Through Controlled Prompt Variations
Giannis Chatziveroglou
Richard Yun
Maura Kelleher
AAMLLRM
201
13
0
02 Apr 2025
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
QuestBench: Can LLMs ask the right question to acquire information in reasoning tasks?
Belinda Z. Li
Been Kim
Zehao Wang
LRM
543
34
0
28 Mar 2025
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps
Process or Result? Manipulated Ending Tokens Can Mislead Reasoning LLMs to Ignore the Correct Reasoning Steps
Yu Cui
Bryan Hooi
Yujun Cai
Yiwei Wang
LRM
388
9
0
25 Mar 2025
Why Do Multi-Agent LLM Systems Fail?
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri
Melissa Z. Pan
Shuyi Yang
Lakshya A Agrawal
Bhavya Chopra
...
Dan Klein
Kannan Ramchandran
Matei A. Zaharia
Joseph E. Gonzalez
Ion Stoica
LLMAG
816
288
0
17 Mar 2025
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems
Beyond Black-Box Benchmarking: Observability, Analytics, and Optimization of Agentic Systems
Dany Moshkovich
Hadar Mulian
Sergey Zeltyn
Natti Eder
Inna Skarbovsky
Roy Abitbol
279
10
0
09 Mar 2025
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models
Meghana Arakkal Rajeev
Rajkumar Ramamurthy
Prapti Trivedi
Vikas Yadav
Oluwanifemi Bamgbose
Sathwik Tejaswi Madhusudan
James Zou
Nazneen Rajani
AAMLLRM
420
18
0
03 Mar 2025
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?
Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones?
Yudi Zhang
Lu Wang
Meng Fang
Yali Du
Chenghua Huang
...
Qingwei Lin
Mykola Pechenizkiy
Dongmei Zhang
Saravan Rajmohan
Qi Zhang
ALM
374
11
0
26 Feb 2025
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual Environments
Towards Better Understanding of Program-of-Thought Reasoning in Cross-Lingual and Multilingual EnvironmentsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Patomporn Payoungkhamdee
Pume Tuchinda
Jinheon Baek
Samuel Cahyawijaya
Can Udomcharoenchaikit
Potsawee Manakul
Peerat Limkonchotiwat
Ekapol Chuangsuwanich
Sarana Nutanong
LRM
357
5
0
25 Feb 2025
The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?
The Self-Improvement Paradox: Can Language Models Bootstrap Reasoning Capabilities without External Scaffolding?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Yutao Sun
Mingshuai Chen
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Yuxiang Cai
ReLMSyDaLRM
259
10
0
20 Feb 2025
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Large Language Models Badly Generalize across Option Length, Problem Types, and Irrelevant Noun Replacements
Guangxiang Zhao
Saier Hu
Xiaoqi Jian
Jinzhu Wu
Yuhan Wu
Change Jia
Lin Sun
Xiangzheng Zhang
582
2
0
18 Feb 2025
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
MATH-Perturb: Benchmarking LLMs' Math Reasoning Abilities against Hard Perturbations
Kaixuan Huang
Jiacheng Guo
Zihao Li
X. Ji
Jiawei Ge
...
Yangsibo Huang
Chi Jin
Xinyun Chen
Chiyuan Zhang
Mengdi Wang
AAMLLRM
764
72
0
10 Feb 2025
The Best Instruction-Tuning Data are Those That Fit
The Best Instruction-Tuning Data are Those That Fit
Dylan Zhang
Qirun Dai
Hao Peng
ALM
659
34
0
06 Feb 2025
Coarse-to-Fine Process Reward Modeling for Mathematical Reasoning
Coarse-to-Fine Process Reward Modeling for Mathematical Reasoning
Yihan Hu
Sheng Ouyang
Jinman Zhao
Yong Liu
LRM
425
0
0
23 Jan 2025
12
Next
Page 1 of 2