Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2103.03874
Cited By
Measuring Mathematical Problem Solving With the MATH Dataset
5 March 2021
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
D. Song
Jacob Steinhardt
ReLM
FaML
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Measuring Mathematical Problem Solving With the MATH Dataset"
50 / 1,395 papers shown
Title
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
PolyMath: Evaluating Mathematical Reasoning in Multilingual Contexts
Y. Wang
Pei Zhang
Jialong Tang
H. Wei
Baosong Yang
...
Y. Zhang
Fei Huang
Junyang Lin
Fei Huang
Jingren Zhou
LRM
50
0
0
25 Apr 2025
Evaluating Grounded Reasoning by Code-Assisted Large Language Models for Mathematics
Zena Al-Khalili
Nick Howell
Dietrich Klakow
LRM
24
0
0
24 Apr 2025
Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency
Zhikai Wang
Jiashuo Sun
W. Zhang
Zhiqiang Hu
Xin Li
F. Wang
Deli Zhao
VLM
LRM
75
0
0
24 Apr 2025
Exploring How LLMs Capture and Represent Domain-Specific Knowledge
Mirian Hipolito Garcia
Camille Couturier
Daniel Madrigal Diaz
Ankur Mallick
Anastasios Kyrillidis
Robert Sim
Victor Rühle
Saravan Rajmohan
25
0
0
23 Apr 2025
MAGIC: Near-Optimal Data Attribution for Deep Learning
Andrew Ilyas
Logan Engstrom
TDI
35
0
0
23 Apr 2025
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
44
1
0
23 Apr 2025
Lightweight Latent Verifiers for Efficient Meta-Generation Strategies
Bartosz Piotrowski
Witold Drzewakowski
Konrad Staniszewski
Piotr Miłoś
LRM
36
0
0
23 Apr 2025
AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Ivan Moshkov
Darragh Hanley
Ivan Sorokin
Shubham Toshniwal
Christof Henkel
Benedikt D. Schifferer
Wei Du
Igor Gitman
ReLM
LRM
40
1
0
23 Apr 2025
Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Case Study
Mohammad Khodadad
Ali Shiraee Kasmaee
Mahdi Astaraki
Nicholas Sherck
H. Mahyar
Soheila Samiee
LRM
75
0
0
23 Apr 2025
Param
Δ
Δ
Δ
for Direct Weight Mixing: Post-Train Large Language Model at Zero Cost
Sheng Cao
Mingrui Wu
Karthik Prasad
Yuandong Tian
Zechun Liu
MoMe
77
0
0
23 Apr 2025
Tina: Tiny Reasoning Models via LoRA
Shangshang Wang
Julian Asilis
Ömer Faruk Akgül
Enes Burak Bilgin
Ollie Liu
W. Neiswanger
OffRL
LRM
29
1
0
22 Apr 2025
Dynamic Early Exit in Reasoning Models
Chenxu Yang
Qingyi Si
Yongjie Duan
Zheliang Zhu
Chenyu Zhu
Zheng-Shen Lin
Li Cao
Weiping Wang
ReLM
LRM
30
0
0
22 Apr 2025
Instruction-Tuning Data Synthesis from Scratch via Web Reconstruction
Yuxin Jiang
Y. Wang
Chuhan Wu
Xinyi Dai
Yan Xu
...
Y. Wang
Xin Jiang
Lifeng Shang
R. Tang
W. Wang
29
0
0
22 Apr 2025
DianJin-R1: Evaluating and Enhancing Financial Reasoning in Large Language Models
Jie Zhu
Qian Chen
Huaixia Dou
Junhui Li
Lifan Guo
Feng-Xiang Chen
C. Zhang
LRM
27
0
0
22 Apr 2025
TTRL: Test-Time Reinforcement Learning
Yuxin Zuo
Kaiyan Zhang
Shang Qu
Li Sheng
Xuekai Zhu
Biqing Qi
Youbang Sun
Ganqu Cui
Ning Ding
Bowen Zhou
OffRL
68
1
0
22 Apr 2025
Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning
Jie Cheng
Ruixi Qiao
Lijun Li
Chao Guo
J. Z. Wang
Gang Xiong
Yisheng Lv
Fei-Yue Wang
LRM
80
0
0
21 Apr 2025
MARFT: Multi-Agent Reinforcement Fine-Tuning
Junwei Liao
Muning Wen
J. Wang
W. Zhang
OffRL
25
0
0
21 Apr 2025
Trillion 7B Technical Report
Sungjun Han
Juyoung Suk
Suyeong An
Hyungguk Kim
Kyuseok Kim
Wonsuk Yang
Seungtaek Choi
Jamin Shin
60
0
0
21 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark
Jasper Götting
Pedro Medeiros
Jon G Sanders
Nathaniel Li
Long Phan
Karam Elabd
Lennart Justen
Dan Hendrycks
Seth Donoughe
ELM
49
2
0
21 Apr 2025
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq R. Joty
ELM
ALM
LRM
45
2
0
21 Apr 2025
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
Yao Shi
Rongkeng Liang
Yong Xu
LLMAG
AI4Ed
ELM
62
0
0
21 Apr 2025
Learning to Reason under Off-Policy Guidance
Jianhao Yan
Yafu Li
Zican Hu
Zhi Wang
Ganqu Cui
Xiaoye Qu
Yu Cheng
Yue Zhang
OffRL
LRM
39
0
0
21 Apr 2025
Improving RL Exploration for LLM Reasoning through Retrospective Replay
Shihan Dou
Muling Wu
Jingwen Xu
Rui Zheng
Tao Gui
Qi Zhang
Xuanjing Huang
OffRL
LRM
22
0
0
19 Apr 2025
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model
Grace Byun
Jinho D. Choi
EGVM
33
0
0
18 Apr 2025
Prejudge-Before-Think: Enhancing Large Language Models at Test-Time by Process Prejudge Reasoning
J. T. Wang
Jin Jiang
Yang Liu
M. Zhang
Xunliang Cai
LRM
32
0
0
18 Apr 2025
Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
Yang Yue
Zhiqi Chen
Rui Lu
Andrew Zhao
Zhaokai Wang
Yang Yue
Shiji Song
Gao Huang
ReLM
LRM
42
9
0
18 Apr 2025
Think Deep, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods
Junlin Wang
Shang Zhu
Jon Saad-Falcon
Ben Athiwaratkun
Qingyang Wu
Jue Wang
S. Song
Ce Zhang
Bhuwan Dhingra
James Y. Zou
LRM
43
1
0
18 Apr 2025
Antidistillation Sampling
Yash Savani
Asher Trockman
Zhili Feng
Avi Schwarzschild
Alexander Robey
Marc Finzi
J. Zico Kolter
44
0
0
17 Apr 2025
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
Haidar Khan
H. A. Alyahya
Yazeed Alnumay
M Saiful Bari
B. Yener
ELM
LRM
52
0
0
17 Apr 2025
GeoSense: Evaluating Identification and Application of Geometric Principles in Multimodal Reasoning
Liangyu Xu
Yingxiu Zhao
J. Wang
Yingyao Wang
Bu Pi
...
Jihao Gu
X. Li
Xiaoyong Zhu
Jun Song
Bo Zheng
LRM
109
1
0
17 Apr 2025
THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models
Xiao Pu
Michael Stephen Saxon
Wenyue Hua
William Yang Wang
LRM
28
0
0
17 Apr 2025
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol
Batu El
Mirac Suzgun
Mert Yuksekgonul
J. Zou
ELM
33
0
0
17 Apr 2025
ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs
Yan Yang
Yixia Li
Hongru Wang
Xuetao Wei
Jianqiao Yu
Yun-Nung Chen
Guanhua Chen
MoMe
26
0
0
17 Apr 2025
FLIP Reasoning Challenge
Andreas Plesner
Turlan Kuzhagaliyev
Roger Wattenhofer
AAML
VLM
LRM
72
0
0
16 Apr 2025
d1: Scaling Reasoning in Diffusion Large Language Models via Reinforcement Learning
Siyan Zhao
Devaansh Gupta
Qinqing Zheng
Aditya Grover
DiffM
LRM
AI4CE
42
0
0
16 Apr 2025
Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
Shizhan Cai
Liang Ding
Dacheng Tao
WaLM
52
0
0
16 Apr 2025
Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT?
Yiyou Sun
Georgia Zhou
H. Wang
D. Li
Nouha Dziri
Dawn Song
ReLM
ALM
ELM
LRM
72
0
1
16 Apr 2025
Nemotron-CrossThink: Scaling Self-Learning beyond Math Reasoning
Syeda Nahida Akter
Shrimai Prabhumoye
Matvei Novikov
Seungju Han
Ying Lin
...
Eric Nyberg
Yejin Choi
M. Patwary
M. Shoeybi
Bryan Catanzaro
ReLM
OffRL
LRM
79
0
1
15 Apr 2025
A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce
Wei Xiong
Jiarui Yao
Yuhui Xu
Bo Pang
Lei Wang
...
Junnan Li
Nan Jiang
Tong Zhang
Caiming Xiong
Hanze Dong
OffRL
LRM
36
2
0
15 Apr 2025
Efficient Reasoning Models: A Survey
Sicheng Feng
Gongfan Fang
Xinyin Ma
Xinchao Wang
ReLM
LRM
86
0
0
15 Apr 2025
Mathematical Capabilities of Large Language Models in Finnish Matriculation Examination
Mika Setälä
Pieta Sikström
Ville Heilala
T. Karkkainen
ELM
LRM
23
1
0
15 Apr 2025
Efficient Process Reward Model Training via Active Learning
Keyu Duan
Zichen Liu
Xin Mao
Tianyu Pang
Changyu Chen
Qiguang Chen
Michael Shieh
Longxu Dou
LRM
25
1
0
14 Apr 2025
RealSafe-R1: Safety-Aligned DeepSeek-R1 without Compromising Reasoning Capability
Y. Zhang
Zihao Zeng
Dongbai Li
Yao Huang
Zhijie Deng
Yinpeng Dong
LRM
24
4
0
14 Apr 2025
Weight Ensembling Improves Reasoning in Language Models
Xingyu Dang
Christina Baek
Kaiyue Wen
Zico Kolter
Aditi Raghunathan
MoMe
LRM
60
1
0
14 Apr 2025
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
Ding Chen
Qingchen Yu
P. Wang
W. Zhang
Bo Tang
Feiyu Xiong
X. Li
Minchuan Yang
Z. Li
ALM
LRM
28
2
0
14 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
26
0
0
14 Apr 2025
Two Heads are Better Than One: Test-time Scaling of Multi-agent Collaborative Reasoning
Can Jin
Hongwu Peng
Qixin Zhang
Yujin Tang
Dimitris N. Metaxas
Tong Che
LLMAG
LRM
84
2
0
14 Apr 2025
M1: Towards Scalable Test-Time Compute with Mamba Reasoning Models
Junxiong Wang
Wen-Ding Li
Daniele Paliotta
Daniel Ritter
Alexander M. Rush
Tri Dao
LRM
24
0
0
14 Apr 2025
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu
Weiyun Wang
Zhe Chen
Z. Liu
Shenglong Ye
...
D. Lin
Yu Qiao
Jifeng Dai
Wenhai Wang
W. Wang
MLLM
VLM
66
7
1
14 Apr 2025
Previous
1
2
3
4
5
...
26
27
28
Next