Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2402.03927
Cited By
Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs
6 February 2024
Simone Balloccu
Patrícia Schmidtová
Mateusz Lango
Ondrej Dusek
SILM
ELM
PILM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Leak, Cheat, Repeat: Data Contamination and Evaluation Malpractices in Closed-Source LLMs"
46 / 46 papers shown
Title
Position: AI Competitions Provide the Gold Standard for Empirical Rigor in GenAI Evaluation
D. Sculley
Will Cukierski
Phil Culliton
Sohier Dane
Maggie Demkin
...
Addison Howard
Paul Mooney
Walter Reade
Megan Risdal
Nate Keating
26
0
0
01 May 2025
Using LLMs in Generating Design Rationale for Software Architecture Decisions
Xiyu Zhou
Ruiyin Li
Peng Liang
Beiqi Zhang
Mojtaba Shahin
Z. Li
Chen Yang
33
0
0
29 Apr 2025
SAGE
\texttt{SAGE}
SAGE
: A Generic Framework for LLM Safety Evaluation
Madhur Jindal
Hari Shrawgi
Parag Agrawal
Sandipan Dandapat
ELM
47
0
0
28 Apr 2025
Generative Induction of Dialogue Task Schemas with Streaming Refinement and Simulated Interactions
James D. Finch
Yasasvi Josyula
Jinho D. Choi
33
0
0
25 Apr 2025
Generative Evaluation of Complex Reasoning in Large Language Models
Haowei Lin
X. Wang
Ruilin Yan
Baizhou Huang
Haotian Ye
Jianhua Zhu
Zihao Wang
James Y. Zou
Jianzhu Ma
Yitao Liang
ReLM
ELM
LRM
76
0
0
03 Apr 2025
Leveraging Large Language Models for Building Interpretable Rule-Based Data-to-Text Systems
Jędrzej Warczyński
Mateusz Lango
Ondrej Dusek
31
0
0
28 Feb 2025
Integration of LLM Quality Assurance into an NLG System
Ching-Yi Chen
Johanna Heininger
Adela Schneider
Christian Eckard
Andreas Madsack
Robert Weißgraeber
31
0
0
28 Jan 2025
Open or Closed LLM for Lesser-Resourced Languages? Lessons from Greek
John Pavlopoulos
Juli Bakagianni
K. Pouli
M. Gavriilidou
45
0
0
22 Jan 2025
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
33
3
0
24 Oct 2024
Fine-tuning can Help Detect Pretraining Data from Large Language Models
H. Zhang
Songxin Zhang
Bingyi Jing
Hongxin Wei
31
0
0
09 Oct 2024
ForecastBench: A Dynamic Benchmark of AI Forecasting Capabilities
Ezra Karger
Houtan Bastani
Chen Yueh-Han
Zachary Jacobs
Danny Halawi
Fred Zhang
P. Tetlock
24
6
0
30 Sep 2024
Leveraging Open-Source Large Language Models for Native Language Identification
Yee Man Ng
Ilia Markov
17
0
0
15 Sep 2024
WinoPron: Revisiting English Winogender Schemas for Consistency, Coverage, and Grammatical Case
Vagrant Gautam
Julius Steuer
Eileen Bingert
Ray Johns
Anne Lauscher
Dietrich Klakow
33
3
0
09 Sep 2024
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers
Manuel Mondal
Ljiljana Dolamic
Gérôme Bovet
Philippe Cudré-Mauroux
Julien Audiffren
16
2
0
21 Jun 2024
SEC-QA: A Systematic Evaluation Corpus for Financial QA
Viet Dac Lai
Michael Krumdick
Charles Lovering
Varshini Reddy
Craig W. Schmidt
Chris Tanner
33
3
0
20 Jun 2024
Automating Easy Read Text Segmentation
Jesús Calleja
Thierry Etchegoyhen
David Ponce
20
1
0
17 Jun 2024
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models
Tianle Gu
Zeyang Zhou
Kexin Huang
Dandan Liang
Yixu Wang
...
Keqing Wang
Yujiu Yang
Yan Teng
Yu Qiao
Yingchun Wang
ELM
28
9
0
11 Jun 2024
AGB-DE: A Corpus for the Automated Legal Assessment of Clauses in German Consumer Contracts
Daniel Braun
Florian Matthes
AILaw
ELM
18
3
0
10 Jun 2024
Benchmark Data Contamination of Large Language Models: A Survey
Cheng Xu
Shuhao Guan
Derek Greene
Mohand-Tahar Kechadi
ELM
ALM
27
38
0
06 Jun 2024
Large Language Models Can Better Understand Knowledge Graphs Than We Thought
Xinbang Dai
Yuncheng Hua
Tongtong Wu
Yang Sheng
Qiu Ji
Guilin Qi
63
0
0
18 Feb 2024
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation
Zhongshen Zeng
Pengguang Chen
Shu Liu
Haiyun Jiang
Jiaya Jia
ReLM
ELM
LRM
19
18
0
28 Dec 2023
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
Xingyao Wang
Zihan Wang
Jiateng Liu
Yangyi Chen
Lifan Yuan
Hao Peng
Heng Ji
LRM
120
137
0
19 Sep 2023
Can Large Language Models Understand Real-World Complex Instructions?
Qi He
Jie Zeng
Wenhao Huang
Lina Chen
Jin Xiao
...
Shisong Chen
Yikai Zhang
Zhouhong Gu
Jiaqing Liang
Yanghua Xiao
ALM
LRM
ELM
84
50
0
17 Sep 2023
RecurrentGPT: Interactive Generation of (Arbitrarily) Long Text
Wangchunshu Zhou
Yuchen Eleanor Jiang
Peng Cui
Tiannan Wang
Zhenxin Xiao
Yifan Hou
Ryan Cotterell
Mrinmaya Sachan
RALM
LLMAG
73
58
0
22 May 2023
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts
Jian Xie
Kai Zhang
Jiangjie Chen
Renze Lou
Yu-Chuan Su
RALM
198
150
0
22 May 2023
Chain-of-Knowledge: Grounding Large Language Models via Dynamic Knowledge Adapting over Heterogeneous Sources
Xingxuan Li
Ruochen Zhao
Yew Ken Chia
Bosheng Ding
Shafiq R. Joty
Soujanya Poria
Lidong Bing
HILM
BDL
LRM
77
85
0
22 May 2023
Faithful Question Answering with Monte-Carlo Planning
Ruixin Hong
Hongming Zhang
Honghui Zhao
Dong Yu
Changshui Zhang
ReLM
LRM
46
12
0
04 May 2023
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
201
559
0
03 May 2023
Don't Stop Pretraining? Make Prompt-based Fine-tuning Powerful Learner
Zhengxiang Shi
Aldo Lipani
VLM
CLL
19
16
0
02 May 2023
Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation
Jiawei Liu
Chun Xia
Yuyao Wang
Lingming Zhang
ELM
ALM
163
388
0
02 May 2023
We're Afraid Language Models Aren't Modeling Ambiguity
Alisa Liu
Zhaofeng Wu
Julian Michael
Alane Suhr
Peter West
Alexander Koller
Swabha Swayamdipta
Noah A. Smith
Yejin Choi
51
87
0
27 Apr 2023
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
Jingfeng Yang
Hongye Jin
Ruixiang Tang
Xiaotian Han
Qizhang Feng
Haoming Jiang
Bing Yin
Xia Hu
LM&MA
123
593
0
26 Apr 2023
ChatGPT-Crawler: Find out if ChatGPT really knows what it's talking about
Aman Rangapur
Haoran Wang
AI4MH
23
3
0
06 Apr 2023
MGTBench: Benchmarking Machine-Generated Text Detection
Xinlei He
Xinyue Shen
Z. Chen
Michael Backes
Yang Zhang
DeLMO
48
99
0
26 Mar 2023
GPT is becoming a Turing machine: Here are some ways to program it
A. Jojic
Zhen Wang
Nebojsa Jojic
LRM
24
17
0
25 Mar 2023
Towards Making the Most of ChatGPT for Machine Translation
Keqin Peng
Liang Ding
Qihuang Zhong
Li Shen
Xuebo Liu
Min Zhang
Y. Ouyang
Dacheng Tao
LRM
79
132
0
24 Mar 2023
Is ChatGPT A Good Keyphrase Generator? A Preliminary Study
M. Song
Haiyun Jiang
Shuming Shi
Songfang Yao
Shilong Lu
Yi Feng
Huafeng Liu
L. Jing
47
26
0
23 Mar 2023
Can we trust the evaluation on ChatGPT?
Rachith Aiyappa
Jisun An
Haewoon Kwak
Yong-Yeol Ahn
ELM
ALM
LLMAG
AI4MH
LRM
101
76
0
22 Mar 2023
Sparks of Artificial General Intelligence: Early experiments with GPT-4
Sébastien Bubeck
Varun Chandrasekaran
Ronen Eldan
J. Gehrke
Eric Horvitz
...
Scott M. Lundberg
Harsha Nori
Hamid Palangi
Marco Tulio Ribeiro
Yi Zhang
ELM
AI4MH
AI4CE
ALM
197
2,953
0
22 Mar 2023
Can ChatGPT Replace Traditional KBQA Models? An In-depth Analysis of the Question Answering Performance of the GPT LLM Family
Yiming Tan
Dehai Min
Y. Li
Wenbo Li
Nan Hu
Yongrui Chen
Guilin Qi
AI4MH
ELM
47
51
0
14 Mar 2023
A comprehensive evaluation of ChatGPT's zero-shot Text-to-SQL capability
Aiwei Liu
Xuming Hu
Lijie Wen
Philip S. Yu
LMTD
AI4MH
56
143
0
12 Mar 2023
Do large language models resemble humans in language use?
Zhenguang G. Cai
Xufeng Duan
David A. Haslett
Shuqi Wang
M. Pickering
ALM
67
37
0
10 Mar 2023
Will Affective Computing Emerge from Foundation Models and General AI? A First Evaluation on ChatGPT
Mostafa M. Amin
Erik Cambria
Björn W. Schuller
AI4MH
46
70
0
03 Mar 2023
Language Generation Models Can Cause Harm: So What Can We Do About It? An Actionable Survey
Sachin Kumar
Vidhisha Balachandran
Lucille Njoo
Antonios Anastasopoulos
Yulia Tsvetkov
ELM
58
59
0
14 Oct 2022
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity
Yao Lu
Max Bartolo
Alastair Moore
Sebastian Riedel
Pontus Stenetorp
AILaw
LRM
274
882
0
18 Apr 2021
1