ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.01964
  4. Cited By
Don't Make Your LLM an Evaluation Benchmark Cheater

Don't Make Your LLM an Evaluation Benchmark Cheater

3 November 2023
Kun Zhou
Yutao Zhu
Zhipeng Chen
Wentong Chen
Wayne Xin Zhao
Xu Chen
Yankai Lin
Ji-Rong Wen
Jiawei Han
    ELM
ArXivPDFHTML

Papers citing "Don't Make Your LLM an Evaluation Benchmark Cheater"

15 / 15 papers shown
Title
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
LLM-Evaluation Tropes: Perspectives on the Validity of LLM-Evaluations
Laura Dietz
Oleg Zendel
P. Bailey
Charles L. A. Clarke
Ellese Cotterill
Jeff Dalton
Faegheh Hasibi
Mark Sanderson
Nick Craswell
ELM
35
0
0
27 Apr 2025
Large language models could be rote learners
Large language models could be rote learners
Yuyang Xu
Renjun Hu
Haochao Ying
J. Wu
Xing Shi
Wei Lin
ELM
39
0
0
11 Apr 2025
Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina
Take Caution in Using LLMs as Human Surrogates: Scylla Ex Machina
Yuan Gao
Dokyun Lee
Gordon Burtch
Sina Fazelpour
LRM
36
7
0
25 Oct 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
29
3
0
24 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples
Baiqi Li
Zhiqiu Lin
Wenxuan Peng
Jean de Dieu Nyandwi
Daniel Jiang
Zixian Ma
Simran Khanuja
Ranjay Krishna
Graham Neubig
Deva Ramanan
AAML
CoGe
VLM
34
20
0
18 Oct 2024
Detecting Training Data of Large Language Models via Expectation Maximization
Detecting Training Data of Large Language Models via Expectation Maximization
Gyuwan Kim
Yang Li
Evangelia Spiliopoulou
Jie Ma
Miguel Ballesteros
William Yang Wang
MIALM
80
3
2
10 Oct 2024
Fine-tuning can Help Detect Pretraining Data from Large Language Models
Fine-tuning can Help Detect Pretraining Data from Large Language Models
H. Zhang
Songxin Zhang
Bingyi Jing
Hongxin Wei
29
0
0
09 Oct 2024
Training on the Benchmark Is Not All You Need
Training on the Benchmark Is Not All You Need
Shiwen Ni
Xiangtao Kong
Chengming Li
Xiping Hu
Ruifeng Xu
Jia Zhu
Min Yang
34
5
0
03 Sep 2024
Bringing AI Participation Down to Scale: A Comment on Open AIs Democratic Inputs to AI Project
Bringing AI Participation Down to Scale: A Comment on Open AIs Democratic Inputs to AI Project
David Moats
Chandrima Ganguly
VLM
27
0
0
16 Jul 2024
Training on the Test Task Confounds Evaluation and Emergence
Training on the Test Task Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo
Florian E. Dorner
Moritz Hardt
ELM
42
6
1
10 Jul 2024
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Data Mixing Laws: Optimizing Data Mixtures by Predicting Language Modeling Performance
Jiasheng Ye
Peiju Liu
Tianxiang Sun
Yunhua Zhou
Jun Zhan
Xipeng Qiu
24
58
0
25 Mar 2024
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Spurious Feature Eraser: Stabilizing Test-Time Adaptation for Vision-Language Foundation Model
Huan Ma
Yan Zhu
Changqing Zhang
Peilin Zhao
Baoyuan Wu
Long-Kai Huang
Qinghua Hu
Bing Wu
VLM
39
1
0
01 Mar 2024
Institutional Platform for Secure Self-Service Large Language Model Exploration
Institutional Platform for Secure Self-Service Large Language Model Exploration
V. Bumgardner
Mitchell A. Klusty
W. V. Logan
Samuel E. Armstrong
Caylin D. Hickey
Jeff Talbert
Caylin Hickey
Jeff Talbert
27
1
0
01 Feb 2024
Can we trust the evaluation on ChatGPT?
Can we trust the evaluation on ChatGPT?
Rachith Aiyappa
Jisun An
Haewoon Kwak
Yong-Yeol Ahn
ELM
ALM
LLMAG
AI4MH
LRM
94
76
0
22 Mar 2023
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
The Pile: An 800GB Dataset of Diverse Text for Language Modeling
Leo Gao
Stella Biderman
Sid Black
Laurence Golding
Travis Hoppe
...
Horace He
Anish Thite
Noa Nabeshima
Shawn Presser
Connor Leahy
AIMat
236
1,508
0
31 Dec 2020
1