ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.00332
  4. Cited By
A Careful Examination of Large Language Model Performance on Grade
  School Arithmetic

A Careful Examination of Large Language Model Performance on Grade School Arithmetic

1 May 2024
Hugh Zhang
Jeff Da
Dean Lee
Vaughn Robinson
Catherine Wu
Will Song
Tiffany Zhao
P. Raja
Dylan Slack
Qin Lyu
Sean Hendryx
Russell Kaplan
Michele Lunati
Summer Yue
    ALM
    LRM
    ELM
ArXivPDFHTML

Papers citing "A Careful Examination of Large Language Model Performance on Grade School Arithmetic"

24 / 74 papers shown
Title
A Systematic Survey and Critical Review on Evaluating Large Language
  Models: Challenges, Limitations, and Recommendations
A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations
Md Tahmid Rahman Laskar
Sawsan Alqahtani
M Saiful Bari
Mizanur Rahman
Mohammad Abdullah Matin Khan
...
Chee Wei Tan
Md. Rizwan Parvez
Enamul Hoque
Shafiq R. Joty
Jimmy Huang
ELM
ALM
27
27
0
04 Jul 2024
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human
  Curricula
MathCAMPS: Fine-grained Synthesis of Mathematical Problems From Human Curricula
Shubhra Mishra
Gabriel Poesia
Belinda Mo
Noah D. Goodman
36
3
0
01 Jul 2024
Too Late to Train, Too Early To Use? A Study on Necessity and Viability
  of Low-Resource Bengali LLMs
Too Late to Train, Too Early To Use? A Study on Necessity and Viability of Low-Resource Bengali LLMs
Tamzeed Mahfuz
Satak Kumar Dey
Ruwad Naswan
Hasnaen Adil
Khondker Salman Sayeed
Haz Sameen Shahgir
29
0
0
29 Jun 2024
UniGen: A Unified Framework for Textual Dataset Generation Using Large
  Language Models
UniGen: A Unified Framework for Textual Dataset Generation Using Large Language Models
Siyuan Wu
Yue Huang
Chujie Gao
Dongping Chen
Qihui Zhang
...
Tianyi Zhou
Xiangliang Zhang
Jianfeng Gao
Chaowei Xiao
Lichao Sun
SyDa
33
22
0
27 Jun 2024
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
LiveBench: A Challenging, Contamination-Limited LLM Benchmark
Colin White
Samuel Dooley
Manley Roberts
Arka Pal
Ben Feuer
...
W. Neiswanger
Micah Goldblum
Tom Goldstein
Willie Neiswanger
Micah Goldblum
ELM
37
6
0
27 Jun 2024
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
Huixuan Zhang
Yun Lin
Xiaojun Wan
48
0
0
26 Jun 2024
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning
  Graph
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang
Jiaao Chen
Diyi Yang
LRM
35
7
0
25 Jun 2024
Autonomous Agents for Collaborative Task under Information Asymmetry
Autonomous Agents for Collaborative Task under Information Asymmetry
Wei Liu
Chenxi Wang
Yifei Wang
Zihao Xie
Rennai Qiu
Yufan Dang
Zhuoyun Du
Weize Chen
Cheng Yang
Chen Qian
LLMAG
36
4
0
21 Jun 2024
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large
  Language Model Evaluation
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
Qin Zhu
Qingyuan Cheng
Runyu Peng
Xiaonan Li
Tengxiao Liu
Ru Peng
Xipeng Qiu
Xuanjing Huang
38
6
0
20 Jun 2024
What Are the Odds? Language Models Are Capable of Probabilistic
  Reasoning
What Are the Odds? Language Models Are Capable of Probabilistic Reasoning
Akshay Paruchuri
Jake Garrison
Shun Liao
John Hernandez
Jacob Sunshine
Tim Althoff
Xin Liu
Daniel J. McDuff
LRM
36
7
0
18 Jun 2024
Language Models are Surprisingly Fragile to Drug Names in Biomedical
  Benchmarks
Language Models are Surprisingly Fragile to Drug Names in Biomedical Benchmarks
Jack Gallifant
Shan Chen
Pedro Moreira
Nikolaj Munch
Mingye Gao
Jackson Pond
Leo Anthony Celi
Hugo J. W. L. Aerts
Thomas Hartvigsen
Danielle S. Bitterman
39
9
0
17 Jun 2024
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen
  Reference Content
RepLiQA: A Question-Answering Dataset for Benchmarking LLMs on Unseen Reference Content
Joao Monteiro
Pierre-Andre Noel
Étienne Marcotte
Sai Rajeswar
Valentina Zantedeschi
David Vazquez
Nicolas Chapados
Christopher Pal
Perouz Taslakian
39
4
0
17 Jun 2024
Are Large Language Models True Healthcare Jacks-of-All-Trades?
  Benchmarking Across Health Professions Beyond Physician Exams
Are Large Language Models True Healthcare Jacks-of-All-Trades? Benchmarking Across Health Professions Beyond Physician Exams
Zheheng Luo
Chenhan Yuan
Qianqian Xie
Sophia Ananiadou
ELM
AI4MH
LM&MA
41
0
0
17 Jun 2024
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes
  in Mathematical Reasoning
Exposing the Achilles' Heel: Evaluating LLMs Ability to Handle Mistakes in Mathematical Reasoning
Joykirat Singh
A. Nambi
Vibhav Vineet
LRM
32
5
0
16 Jun 2024
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in
  Low-Resource and Extinct Languages
LINGOLY: A Benchmark of Olympiad-Level Linguistic Reasoning Puzzles in Low-Resource and Extinct Languages
Andrew M. Bean
Simi Hellsten
Harry Mayne
Jabez Magomere
Ethan A. Chi
Ryan A. Chi
Scott A. Hale
Hannah Rose Kirk
ELM
LRM
37
7
0
10 Jun 2024
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase
  for Math Reasoning
DICE: Detecting In-distribution Contamination in LLM's Fine-tuning Phase for Math Reasoning
Shangqing Tu
Kejian Zhu
Yushi Bai
Zijun Yao
Lei Hou
Juanzi Li
42
4
0
06 Jun 2024
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
MixEval: Deriving Wisdom of the Crowd from LLM Benchmark Mixtures
Jinjie Ni
Fuzhao Xue
Xiang Yue
Yuntian Deng
Mahir Shah
Kabir Jain
Graham Neubig
Yang You
ELM
30
35
0
03 Jun 2024
Adaptive In-conversation Team Building for Language Model Agents
Adaptive In-conversation Team Building for Language Model Agents
Linxin Song
Jiale Liu
Jieyu Zhang
Shaokun Zhang
Ao Luo
Shijian Wang
Qingyun Wu
Chi Wang
LLMAG
63
10
0
29 May 2024
On Fairness of Low-Rank Adaptation of Large Models
On Fairness of Low-Rank Adaptation of Large Models
Zhoujie Ding
Ken Ziyu Liu
Pura Peetathawatchai
Berivan Isik
Sanmi Koyejo
38
4
0
27 May 2024
ConStat: Performance-Based Contamination Detection in Large Language
  Models
ConStat: Performance-Based Contamination Detection in Large Language Models
Jasper Dekoninck
Mark Niklas Muller
Martin Vechev
37
5
0
25 May 2024
Lessons from the Trenches on Reproducible Evaluation of Language Models
Lessons from the Trenches on Reproducible Evaluation of Language Models
Stella Biderman
Hailey Schoelkopf
Lintang Sutawika
Leo Gao
J. Tow
...
Xiangru Tang
Kevin A. Wang
Genta Indra Winata
Franccois Yvon
Andy Zou
ELM
ALM
130
53
3
23 May 2024
Benchmarking Benchmark Leakage in Large Language Models
Benchmarking Benchmark Leakage in Large Language Models
Ruijie Xu
Zengzhi Wang
Run-Ze Fan
Pengfei Liu
56
42
0
29 Apr 2024
Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models
Min-K%++: Improved Baseline for Detecting Pre-Training Data from Large Language Models
Jingyang Zhang
Jingwei Sun
Eric C. Yeats
Ouyang Yang
Martin Kuo
Jianyi Zhang
Hao Frank Yang
Hai Li
37
41
0
03 Apr 2024
Conditional and Modal Reasoning in Large Language Models
Conditional and Modal Reasoning in Large Language Models
Wesley H. Holliday
M. Mandelkern
Cedegao E. Zhang
LRM
24
5
0
30 Jan 2024
Previous
12