ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2406.04127
  4. Cited By
Are We Done with MMLU?

Are We Done with MMLU?

6 June 2024
Aryo Pradipta Gema
Joshua Ong Jun Leang
Giwon Hong
Alessio Devoto
Alberto Carlo Maria Mancino
Rohit Saxena
Xuanli He
Yu Zhao
Xiaotang Du
Mohammad Reza Ghasemi Madani
Claire Barale
R. McHardy
Joshua Harris
Jean Kaddour
Emile van Krieken
Pasquale Minervini
    ELM
ArXivPDFHTML

Papers citing "Are We Done with MMLU?"

24 / 24 papers shown
Title
LLMs Outperform Experts on Challenging Biology Benchmarks
LLMs Outperform Experts on Challenging Biology Benchmarks
Lennart Justen
ELM
20
0
0
09 May 2025
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks
Minghao Wu
Weixuan Wang
Sinuo Liu
Huifeng Yin
Xintong Wang
Yu Zhao
Chenyang Lyu
Longyue Wang
Weihua Luo
Kaifu Zhang
ELM
74
0
0
22 Apr 2025
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
What the HellaSwag? On the Validity of Common-Sense Reasoning Benchmarks
Pavel Chizhov
Mattia Nee
Pierre-Carl Langlais
Ivan P. Yamshchikov
ReLM
ELM
LRM
31
1
0
10 Apr 2025
HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
HKCanto-Eval: A Benchmark for Evaluating Cantonese Language Understanding and Cultural Comprehension in LLMs
Tsz Chung Cheng
Chung Shing Cheng
Chaak Ming Lau
Eugene Tin-Ho Lam
Chun Yat Wong
Hoi On Yu
Cheuk Hei Chong
ELM
59
1
0
16 Mar 2025
Unveiling the Mathematical Reasoning in DeepSeek Models: A Comparative Study of Large Language Models
Afrar Jahin
Arif Hassan Zidan
Yu Bao
Shizhe Liang
T. Liu
W. Zhang
LRM
61
1
0
13 Mar 2025
Every FLOP Counts: Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs
Ling Team
B. Zeng
C. Huang
Chao Zhang
Changxin Tian
...
Zhaoxin Huan
Zujie Wen
Zhenhang Sun
Zhuoxuan Du
Z. He
MoE
ALM
100
2
0
07 Mar 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
H. Li
Tao Xie
Baishakhi Ray
41
2
0
23 Feb 2025
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
Maybe I Should Not Answer That, but... Do LLMs Understand The Safety of Their Inputs?
Maciej Chrabąszcz
Filip Szatkowski
Bartosz Wójcik
Jan Dubiñski
Tomasz Trzciñski
47
0
0
22 Feb 2025
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial?
Wenzhe Li
Yong Lin
Mengzhou Xia
Chi Jin
MoE
74
2
0
02 Feb 2025
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Improving Model Evaluation using SMART Filtering of Benchmark Datasets
Vipul Gupta
Candace Ross
David Pantoja
R. Passonneau
Megan Ung
Adina Williams
43
1
0
26 Oct 2024
As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative
  Feedback Loss
As Simple as Fine-tuning: LLM Alignment via Bidirectional Negative Feedback Loss
Xin Mao
Feng-Lin Li
Huimin Xu
Wei Zhang
Wang Chen
A. Luu
24
1
0
07 Oct 2024
Aligning LLMs with Individual Preferences via Interaction
Aligning LLMs with Individual Preferences via Interaction
Shujin Wu
May Fung
Cheng Qian
Jeonghwan Kim
Dilek Z. Hakkani-Tür
Heng Ji
26
9
0
04 Oct 2024
Mitigating Selection Bias with Node Pruning and Auxiliary Options
Mitigating Selection Bias with Node Pruning and Auxiliary Options
Hyeong Kyu Choi
Weijie Xu
Chi Xue
Stephanie Eckman
Chandan K. Reddy
23
1
0
27 Sep 2024
Qwen2.5-Coder Technical Report
Qwen2.5-Coder Technical Report
Binyuan Hui
Jian Yang
Zeyu Cui
Jiaxi Yang
Dayiheng Liu
...
Fei Huang
Xingzhang Ren
Xuancheng Ren
Jingren Zhou
Junyang Lin
OSLM
70
195
0
18 Sep 2024
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
Guijin Son
Hyunwoo Ko
Hoyoung Lee
Yewon Kim
Seunghyeok Hong
ALM
ELM
40
5
0
17 Sep 2024
Questionable practices in machine learning
Questionable practices in machine learning
Gavin Leech
Juan J. Vazquez
Misha Yagudin
Niclas Kupper
Laurence Aitchison
42
2
0
17 Jul 2024
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating
  Test-Taking Strategies from Benchmark Performance
The Base-Rate Effect on LLM Benchmark Performance: Disambiguating Test-Taking Strategies from Benchmark Performance
Kyle Moore
Jesse Roberts
Thao Pham
Oseremhen Ewaleifoh
Doug Fisher
26
2
0
17 Jun 2024
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs
  with Nothing
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Zhangchen Xu
Fengqing Jiang
Luyao Niu
Yuntian Deng
Radha Poovendran
Yejin Choi
Bill Yuchen Lin
SyDa
27
110
0
12 Jun 2024
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Toward Annotator Group Bias in Crowdsourcing
Toward Annotator Group Bias in Crowdsourcing
Haochen Liu
J. Thekinen
Sinem Mollaoglu
Da Tang
Ji Yang
Youlong Cheng
Hui Liu
Jiliang Tang
36
16
0
08 Oct 2021
Deduplicating Training Data Makes Language Models Better
Deduplicating Training Data Makes Language Models Better
Katherine Lee
Daphne Ippolito
A. Nystrom
Chiyuan Zhang
Douglas Eck
Chris Callison-Burch
Nicholas Carlini
SyDa
234
447
0
14 Jul 2021
Are We Modeling the Task or the Annotator? An Investigation of Annotator
  Bias in Natural Language Understanding Datasets
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets
Mor Geva
Yoav Goldberg
Jonathan Berant
232
319
0
21 Aug 2019
Hypothesis Only Baselines in Natural Language Inference
Hypothesis Only Baselines in Natural Language Inference
Adam Poliak
Jason Naradowsky
Aparajita Haldar
Rachel Rudinger
Benjamin Van Durme
187
574
0
02 May 2018
ImageNet Large Scale Visual Recognition Challenge
ImageNet Large Scale Visual Recognition Challenge
Olga Russakovsky
Jia Deng
Hao Su
J. Krause
S. Satheesh
...
A. Karpathy
A. Khosla
Michael S. Bernstein
Alexander C. Berg
Li Fei-Fei
VLM
ObjD
279
39,083
0
01 Sep 2014
1