Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2405.00332
Cited By
A Careful Examination of Large Language Model Performance on Grade School Arithmetic
1 May 2024
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
Will Song
Tiffany Zhao
P. Raja
Dylan Slack
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
ALM
LRM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"A Careful Examination of Large Language Model Performance on Grade School Arithmetic"
50 / 74 papers shown
Title
Towards Contamination Resistant Benchmarks
Rahmatullah Musawi
Sheng Lu
27
0
0
13 May 2025
Cost-of-Pass: An Economic Framework for Evaluating Language Models
Mehmet Hamza Erol
Batu El
Mirac Suzgun
Mert Yuksekgonul
J. Zou
ELM
35
0
0
17 Apr 2025
HELIOS: Adaptive Model And Early-Exit Selection for Efficient LLM Inference Serving
Avinash Kumar
Shashank Nag
Jason Clemons
L. John
Poulami Das
26
0
0
14 Apr 2025
Short-Path Prompting in LLMs: Analyzing Reasoning Instability and Solutions for Robust Performance
Zuoli Tang
Junjie Ou
Kaiqin Hu
Chunwei Wu
Zhaoxin Huan
Chilin Fu
Xiaolu Zhang
Jun Zhou
Chenliang Li
ReLM
LRM
38
0
0
13 Apr 2025
Have Large Language Models Learned to Reason? A Characterization via 3-SAT Phase Transition
Rishi Hazra
Gabriele Venturato
Pedro Zuidberg Dos Martires
Luc de Raedt
ReLM
LRM
58
0
0
04 Apr 2025
Generative Evaluation of Complex Reasoning in Large Language Models
Haowei Lin
X. Wang
Ruilin Yan
Baizhou Huang
Haotian Ye
Jianhua Zhu
Zihao Wang
James Y. Zou
Jianzhu Ma
Yitao Liang
ReLM
ELM
LRM
124
0
0
03 Apr 2025
Do "New Snow Tablets" Contain Snow? Large Language Models Over-Rely on Names to Identify Ingredients of Chinese Drugs
Sifan Li
Yujun Cai
Bryan Hooi
Nanyun Peng
Y. Wang
24
0
0
03 Apr 2025
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models
Irtaza Khalid
Amir Masoud Nourollah
Steven Schockaert
LRM
38
0
0
30 Mar 2025
MARS: A Multi-Agent Framework Incorporating Socratic Guidance for Automated Prompt Optimization
Jian Zhang
Z. Wang
Haiping Zhu
Jun Liu
Qika Lin
Erik Cambria
LLMAG
81
1
0
21 Mar 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
Zhaofeng Wu
Michihiro Yasunaga
Andrew Cohen
Yoon Kim
Asli Celikyilmaz
Marjan Ghazvininejad
38
1
0
14 Mar 2025
Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models
Afrar Jahin
Arif Hassan Zidan
Yu Bao
Shizhe Liang
T. Liu
W. Zhang
LRM
61
1
0
13 Mar 2025
NeurIPS 2023 LLM Efficiency Fine-tuning Competition
Mark Saroufim
Yotam Perlitz
Leshem Choshen
Luca Antiga
Greg Bowyer
...
Ashvini Kumar
Jindal Pawan Kumar
Rajpoot Ankur Parikh
Joe Isaacson
Weiwei Yang
ELM
38
0
0
13 Mar 2025
Toward an Evaluation Science for Generative AI Systems
Laura Weidinger
Deb Raji
Hanna M. Wallach
Margaret Mitchell
Angelina Wang
Olawale Salaudeen
Rishi Bommasani
Sayash Kapoor
Deep Ganguli
Sanmi Koyejo
EGVM
ELM
65
4
0
07 Mar 2025
Are Large Vision Language Models Good Game Players?
Xinyu Wang
Bohan Zhuang
Qi Wu
MLLM
ELM
LRM
94
3
0
04 Mar 2025
EAIRA: Establishing a Methodology for Evaluating AI Models as Scientific Research Assistants
Franck Cappello
Sandeep Madireddy
Robert Underwood
N. Getty
Nicholas Chia
...
M. Rafique
Eliu A. Huerta
B. Li
Ian Foster
Rick L. Stevens
72
1
0
27 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
69
0
0
24 Feb 2025
Reasoning about Affordances: Causal and Compositional Reasoning in LLMs
Magnus F. Gjerde
Vanessa Cheung
David Lagnado
ReLM
LRM
50
0
0
23 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs
Andreas Opedal
Haruki Shirakami
Bernhard Schölkopf
Abulhair Saparov
Mrinmaya Sachan
LRM
57
1
0
17 Feb 2025
A Survey of Theory of Mind in Large Language Models: Evaluations, Representations, and Safety Risks
Hieu Minh "Jord" Nguyen
LM&MA
LRM
49
0
0
10 Feb 2025
UGPhysics: A Comprehensive Benchmark for Undergraduate Physics Reasoning with Large Language Models
Xin Xu
Qiyun Xu
Tong Xiao
Tianhao Chen
Yuchen Yan
Jiaxin Zhang
Shizhe Diao
Can Yang
Yang Wang
ELM
LRM
AI4CE
100
2
0
01 Feb 2025
Spend Wisely: Maximizing Post-Training Gains in Iterative Synthetic Data Boostrapping
Pu Yang
Yunzhen Feng
Ziyuan Chen
Yuhang Wu
Zhuoyuan Li
DiffM
101
0
0
31 Jan 2025
Parameters vs FLOPs: Scaling Laws for Optimal Sparsity for Mixture-of-Experts Language Models
Samira Abnar
Harshay Shah
Dan Busbridge
Alaaeldin Mohamed Elnouby Ali
J. Susskind
Vimal Thilak
MoE
LRM
33
5
0
28 Jan 2025
Multi-Step Reasoning in Korean and the Emergent Mirage
Guijin Son
Hyunwoo Ko
Dasol Choi
LRM
ReLM
59
0
0
10 Jan 2025
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang
Yuchang Su
Yiming Liu
Xiaohan Wang
James Burgess
...
Josiah Aklilu
Alejandro Lozano
Anjiang Wei
Ludwig Schmidt
Serena Yeung-Levy
50
3
0
06 Jan 2025
Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
Hyunwoo Ko
Guijin Son
Dasol Choi
RALM
LRM
78
7
0
05 Jan 2025
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Xiaobao Wu
Liangming Pan
Yuxi Xie
Ruiwen Zhou
Shuai Zhao
Yubo Ma
Mingzhe Du
Rui Mao
Anh Tuan Luu
William Yang Wang
122
9
0
18 Dec 2024
Can Language Models Rival Mathematics Students? Evaluating Mathematical Reasoning through Textual Manipulation and Human Experiments
Andrii Nikolaiev
Yiannos Stathopoulos
Simone Teufel
LRM
74
0
0
16 Dec 2024
INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge
Angelika Romanou
Negar Foroutan
Anna Sotnikova
Zeming Chen
Sree Harsha Nelaturu
...
Mike Zhang
Imanol Schlag
Marzieh Fadaee
Sara Hooker
Antoine Bosselut
ELM
105
6
0
29 Nov 2024
Do Large Language Models Perform Latent Multi-Hop Reasoning without Exploiting Shortcuts?
Sohee Yang
Nora Kassner
E. Gribovskaya
Sebastian Riedel
Mor Geva
KELM
LRM
ReLM
78
4
0
25 Nov 2024
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie
Yangsibo Huang
Chiyuan Zhang
Da Yu
Xinyun Chen
Bill Yuchen Lin
Bo Li
Badih Ghazi
Ravi Kumar
LRM
45
20
0
30 Oct 2024
WorldMedQA-V: a multilingual, multimodal medical examination dataset for multimodal language models evaluation
João Matos
Shan Chen
Siena Placino
Yingya Li
Juan Carlos Climent Pardo
...
Hugo J. W. L. Aerts
L. A. Celi
A. I. Wong
Danielle S. Bitterman
Jack Gallifant
26
0
0
16 Oct 2024
3DArticCyclists: Generating Synthetic Articulated 8D Pose-Controllable Cyclist Data for Computer Vision Applications
Eduardo R. Corral-Soto
Yang Liu
Tongtong Cao
Y. Ren
Liu Bingbing
44
0
0
14 Oct 2024
Benchmark Inflation: Revealing LLM Performance Gaps Using Retro-Holdouts
Jacob Haimes
Cenny Wenner
Kunvar Thaman
Vassil Tashev
Clement Neo
Esben Kran
Jason Schreiber
27
5
0
11 Oct 2024
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
Iman Mirzadeh
Keivan Alizadeh
Hooman Shahrokhi
Oncel Tuzel
Samy Bengio
Mehrdad Farajtabar
AIMat
LRM
58
127
0
07 Oct 2024
Not All LLM Reasoners Are Created Equal
Arian Hosseini
Alessandro Sordoni
Daniel Toyama
Aaron C. Courville
Rishabh Agarwal
LRM
39
11
0
02 Oct 2024
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Shubham Toshniwal
Wei Du
Ivan Moshkov
Branislav Kisacanin
Alexan Ayrapetyan
Igor Gitman
LRM
18
49
0
02 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
33
6
0
30 Sep 2024
Scheherazade: Evaluating Chain-of-Thought Math Reasoning in LLMs with Chain-of-Problems
Stephen Miner
Yoshiki Takashima
Simeng Han
Ferhat Erata
Timos Antonopoulos
R. Piskac
Scott J. Shapiro
LRM
36
3
0
30 Sep 2024
Revisiting the Superficial Alignment Hypothesis
Mohit Raghavendra
Vaskar Nath
Sean Hendryx
LRM
23
0
0
27 Sep 2024
Small Language Models: Survey, Measurements, and Insights
Zhenyan Lu
Xiang Li
Dongqi Cai
Rongjie Yi
Fangming Liu
Xiwen Zhang
Nicholas D. Lane
Mengwei Xu
ObjD
LRM
51
36
0
24 Sep 2024
To CoT or not to CoT? Chain-of-thought helps mainly on math and symbolic reasoning
Zayne Sprague
Fangcong Yin
Juan Diego Rodriguez
Dongwei Jiang
Manya Wadhwa
Prasann Singhal
Xinyu Zhao
Xi Ye
Kyle Mahowald
Greg Durrett
ReLM
LRM
114
82
0
18 Sep 2024
Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
Jasper Dekoninck
Maximilian Baader
Martin Vechev
ALM
92
0
0
01 Sep 2024
Can Large Language Models Reason? A Characterization via 3-SAT
Rishi Hazra
Gabriele Venturato
Pedro Zuidberg Dos Martires
Luc de Raedt
ELM
ReLM
LRM
30
4
0
13 Aug 2024
A Perspective on Large Language Models, Intelligent Machines, and Knowledge Acquisition
V. Cherkassky
Eng Hock Lee
ELM
25
1
0
13 Aug 2024
Evaluating Language Model Math Reasoning via Grounding in Educational Curricula
L. Lucy
Tal August
Rose E. Wang
Luca Soldaini
Courtney Allison
Kyle Lo
ReLM
LRM
29
1
0
08 Aug 2024
Active Testing of Large Language Model via Multi-Stage Sampling
Yuheng Huang
Jiayang Song
Qiang Hu
Felix Juefei-Xu
Lei Ma
21
2
0
07 Aug 2024
AI-Assisted Generation of Difficult Math Questions
Vedant Shah
Dingli Yu
Kaifeng Lyu
Simon Park
Nan Rosemary Ke
...
Yoshua Bengio
Sanjeev Arora
Anirudh Goyal
Sanjeev Arora
Anirudh Goyal
38
15
0
30 Jul 2024
Questionable practices in machine learning
Gavin Leech
Juan J. Vazquez
Misha Yagudin
Niclas Kupper
Laurence Aitchison
42
3
0
17 Jul 2024
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang
Bo Li
Peiyuan Zhang
Fanyi Pu
Joshua Adrian Cahyono
...
Shuai Liu
Yuanhan Zhang
Jingkang Yang
Chunyuan Li
Ziwei Liu
91
74
0
17 Jul 2024
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist
Zihao Zhou
Shudong Liu
Maizhen Ning
Wei Liu
Jindong Wang
Derek F. Wong
Xiaowei Huang
Qiufeng Wang
Kaizhu Huang
ELM
LRM
61
23
0
11 Jul 2024
1
2
Next