ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2407.08733
  4. Cited By
Is Your Model Really A Good Math Reasoner? Evaluating Mathematical
  Reasoning with Checklist

Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist

11 July 2024
Zihao Zhou
Shudong Liu
Maizhen Ning
Wei Liu
Jindong Wang
Derek F. Wong
Xiaowei Huang
Qiufeng Wang
Kaizhu Huang
    ELM
    LRM
ArXivPDFHTML

Papers citing "Is Your Model Really A Good Math Reasoner? Evaluating Mathematical Reasoning with Checklist"

10 / 10 papers shown
Title
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
60
0
0
05 May 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
35
1
0
01 Apr 2025
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
PRMBench: A Fine-grained and Challenging Benchmark for Process-Level Reward Models
Mingyang Song
Zhaochen Su
Xiaoye Qu
Jiawei Zhou
Yu-Xi Cheng
LRM
36
29
0
06 Jan 2025
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
UTMath: Math Evaluation with Unit Test via Reasoning-to-Coding Thoughts
Bo Yang
Qingping Yang
Runtao Liu
Runtao Liu
LRM
ReLM
ELM
AIMat
51
1
0
11 Nov 2024
MathBench: Evaluating the Theory and Application Proficiency of LLMs
  with a Hierarchical Mathematics Benchmark
MathBench: Evaluating the Theory and Application Proficiency of LLMs with a Hierarchical Mathematics Benchmark
Hongwei Liu
Zilong Zheng
Yuxuan Qiao
Haodong Duan
Zhiwei Fei
Fengzhe Zhou
Wenwei Zhang
Songyang Zhang
Dahua Lin
Kai-xiang Chen
44
6
0
20 May 2024
Compression Represents Intelligence Linearly
Compression Represents Intelligence Linearly
Yuzhen Huang
Jinghan Zhang
Zifei Shan
Junxian He
29
24
0
15 Apr 2024
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
126
895
0
21 Dec 2023
Don't Make Your LLM an Evaluation Benchmark Cheater
Don't Make Your LLM an Evaluation Benchmark Cheater
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
ELM
94
136
0
03 Nov 2023
Penguins Don't Fly: Reasoning about Generics through Instantiations and
  Exceptions
Penguins Don't Fly: Reasoning about Generics through Instantiations and Exceptions
Emily Allaway
Jena D. Hwang
Chandra Bhagavatula
Kathleen McKeown
Doug Downey
Yejin Choi
LRM
35
20
0
23 May 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
1