ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2311.04850
  4. Cited By
Rethinking Benchmark and Contamination for Language Models with
  Rephrased Samples

Rethinking Benchmark and Contamination for Language Models with Rephrased Samples

8 November 2023
Shuo Yang
Wei-Lin Chiang
Lianmin Zheng
Joseph E. Gonzalez
Ion Stoica
    ALM
ArXivPDFHTML

Papers citing "Rethinking Benchmark and Contamination for Language Models with Rephrased Samples"

50 / 84 papers shown
Title
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
OSVBench: Benchmarking LLMs on Specification Generation Tasks for Operating System Verification
Shangyu Li
Juyong Jiang
Tiancheng Zhao
Jiasi Shen
41
0
0
29 Apr 2025
Private Federated Learning using Preference-Optimized Synthetic Data
Private Federated Learning using Preference-Optimized Synthetic Data
Charlie Hou
Mei-Yu Wang
Yige Zhu
Daniel Lazar
Giulia Fanti
FedML
Presented at ResearchTrend Connect | FedML on 07 May 2025
54
0
0
23 Apr 2025
AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
AIMO-2 Winning Solution: Building State-of-the-Art Mathematical Reasoning Models with OpenMathReasoning dataset
Ivan Moshkov
Darragh Hanley
Ivan Sorokin
Shubham Toshniwal
Christof Henkel
Benedikt D. Schifferer
Wei Du
Igor Gitman
ReLM
LRM
40
1
0
23 Apr 2025
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
ZeroSumEval: Scaling LLM Evaluation with Inter-Model Competition
Haidar Khan
H. A. Alyahya
Yazeed Alnumay
M Saiful Bari
B. Yener
ELM
LRM
47
0
0
17 Apr 2025
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving
PuzzleBench: A Fully Dynamic Evaluation Framework for Large Multimodal Models on Puzzle Solving
Zeyu Zhang
Z. Chen
Zicheng Zhang
Yuze Sun
Yuan Tian
Ziheng Jia
Chunyi Li
Xiaohong Liu
Xiongkuo Min
Guangtao Zhai
MLLM
36
0
0
15 Apr 2025
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
OpenCodeReasoning: Advancing Data Distillation for Competitive Coding
Wasi Uddin Ahmad
Sean Narenthiran
Somshubra Majumdar
Aleksander Ficek
Siddhartha Jain
Jocelyn Huang
Vahid Noroozi
Boris Ginsburg
LRM
50
2
0
02 Apr 2025
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
WinoWhat: A Parallel Corpus of Paraphrased WinoGrande Sentences with Common Sense Categorization
I. Gevers
Victor De Marez
Luna De Bruyne
Walter Daelemans
37
0
0
31 Mar 2025
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
The Emperor's New Clothes in Benchmarking? A Rigorous Examination of Mitigation Strategies for LLM Benchmark Data Contamination
Yifan Sun
Han Wang
Dongbai Li
Gang Wang
Huan Zhang
AAML
53
0
0
20 Mar 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition
H. A. Alyahya
Haidar Khan
Yazeed Alnumay
M Saiful Bari
B. Yener
LRM
58
0
0
10 Mar 2025
Dynamic-KGQA: A Scalable Framework for Generating Adaptive Question Answering Datasets
Preetam Prabhu Srikar Dammu
Himanshu Naidu
Chirag Shah
42
0
0
06 Mar 2025
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Recent Advances in Large Langauge Model Benchmarks against Data Contamination: From Static to Dynamic Evaluation
Simin Chen
Yiming Chen
Zexin Li
Yifan Jiang
Zhongwei Wan
...
Dezhi Ran
Tianle Gu
H. Li
Tao Xie
Baishakhi Ray
41
2
0
23 Feb 2025
Multilingual Language Model Pretraining using Machine-translated Data
Multilingual Language Model Pretraining using Machine-translated Data
Jiayi Wang
Yao Lu
Maurice Weber
Max Ryabinin
David Ifeoluwa Adelani
Yihong Chen
Raphael Tang
Pontus Stenetorp
LRM
73
2
0
20 Feb 2025
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
Batayan: A Filipino NLP benchmark for evaluating Large Language Models
Jann Railey Montalan
Jimson Paulo Layacan
David Demitri Africa
Richell Isaiah Flores
Michael T. Lopez II
Theresa Denise Magsajo
Anjanette Cayabyab
William-Chandra Tjhi
34
0
0
19 Feb 2025
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
Eva Sánchez Salido
Julio Gonzalo
Guillermo Marco
ELM
58
2
0
18 Feb 2025
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Nurit Cohen-Inger
Yehonatan Elisha
Bracha Shapira
L. Rokach
Seffi Cohen
ELM
89
0
0
11 Feb 2025
Improving Your Model Ranking on Chatbot Arena by Vote Rigging
Improving Your Model Ranking on Chatbot Arena by Vote Rigging
Rui Min
Tianyu Pang
Chao Du
Qian Liu
Minhao Cheng
Min-Bin Lin
AAML
57
2
0
29 Jan 2025
AntiLeak-Bench: Preventing Data Contamination by Automatically
  Constructing Benchmarks with Updated Real-World Knowledge
AntiLeak-Bench: Preventing Data Contamination by Automatically Constructing Benchmarks with Updated Real-World Knowledge
Xiaobao Wu
Liangming Pan
Yuxi Xie
Ruiwen Zhou
Shuai Zhao
Yubo Ma
Mingzhe Du
Rui Mao
Anh Tuan Luu
William Yang Wang
90
9
0
18 Dec 2024
Text2Cypher: Bridging Natural Language and Graph Databases
Text2Cypher: Bridging Natural Language and Graph Databases
Makbule Gulcin Ozsoy
Leila Messallem
Jon Besga
Gianandrea Minneci
67
4
0
13 Dec 2024
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and
  Establishing Best Practices
BetterBench: Assessing AI Benchmarks, Uncovering Issues, and Establishing Best Practices
Anka Reuel
Amelia F. Hardy
Chandler Smith
Max Lamparth
Malcolm Hardy
Mykel J. Kochenderfer
ELM
62
16
0
20 Nov 2024
Evaluation data contamination in LLMs: how do we measure it and (when)
  does it matter?
Evaluation data contamination in LLMs: how do we measure it and (when) does it matter?
Aaditya K. Singh
Muhammed Yusuf Kocyigit
Andrew Poulton
David Esiobu
Maria Lomeli
Gergely Szilvasy
Dieuwke Hupkes
20
7
0
06 Nov 2024
On Memorization of Large Language Models in Logical Reasoning
On Memorization of Large Language Models in Logical Reasoning
Chulin Xie
Yangsibo Huang
Chiyuan Zhang
Da Yu
Xinyun Chen
Bill Yuchen Lin
Bo Li
Badih Ghazi
Ravi Kumar
LRM
45
20
0
30 Oct 2024
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Does Data Contamination Detection Work (Well) for LLMs? A Survey and Evaluation on Detection Assumptions
Yujuan Fu
Özlem Uzuner
Meliha Yetisgen
Fei Xia
52
3
0
24 Oct 2024
Contamination Report for Multilingual Benchmarks
Contamination Report for Multilingual Benchmarks
Sanchit Ahuja
Varun Gumma
Sunayana Sitaram
16
0
0
21 Oct 2024
CAP: Data Contamination Detection via Consistency Amplification
CAP: Data Contamination Detection via Consistency Amplification
Yi Zhao
Jing Li
Linyi Yang
24
1
0
19 Oct 2024
WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for
  LLMs
WILT: A Multi-Turn, Memorization-Robust Inductive Logic Benchmark for LLMs
Eryk Banatt
Jonathan Cheng
Skanda Vaidyanath
Tiffany Hwu
LRM
27
1
0
14 Oct 2024
Toward General Instruction-Following Alignment for Retrieval-Augmented
  Generation
Toward General Instruction-Following Alignment for Retrieval-Augmented Generation
Guanting Dong
Xiaoshuai Song
Y. X. Zhu
Runqi Qiao
Zhicheng Dou
Ji-Rong Wen
3DV
46
4
0
12 Oct 2024
Language model developers should report train-test overlap
Language model developers should report train-test overlap
Andy K. Zhang
Kevin Klyman
Yifan Mai
Yoav Levine
Yian Zhang
Rishi Bommasani
Percy Liang
VLM
ELM
21
8
0
10 Oct 2024
How Much Can We Forget about Data Contamination?
How Much Can We Forget about Data Contamination?
Sebastian Bordt
Suraj Srinivas
Valentyn Boreiko
U. V. Luxburg
41
1
0
04 Oct 2024
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring
  the (Lack of) Cultural Knowledge of LLMs
CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs
Yu Ying Chiu
Liwei Jiang
Bill Yuchen Lin
Chan Young Park
Shuyue Stella Li
...
Mehar Bhatia
Maria Antoniak
Yulia Tsvetkov
Vered Shwartz
Yejin Choi
ELM
ALM
45
18
0
03 Oct 2024
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source
  Instruction Data
OpenMathInstruct-2: Accelerating AI for Math with Massive Open-Source Instruction Data
Shubham Toshniwal
Wei Du
Ivan Moshkov
Branislav Kisacanin
Alexan Ayrapetyan
Igor Gitman
LRM
18
48
0
02 Oct 2024
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language
  Models
Beyond Prompts: Dynamic Conversational Benchmarking of Large Language Models
David Castillo-Bolado
Joseph Davidson
Finlay Gray
Marek Rosa
24
2
0
30 Sep 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination
Eva Sánchez Salido
Roser Morante
Julio Gonzalo
Guillermo Marco
Jorge Carrillo-de-Albornoz
...
Enrique Amigó
Andrés Fernández
Alejandro Benito-Santos
Adrián Ghajari Espinosa
Victor Fresno
ELM
39
0
0
19 Sep 2024
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Constructing Domain-Specific Evaluation Sets for LLM-as-a-judge
Ravi Raju
Swayambhoo Jain
Bo Li
Jonathan Li
Urmish Thakker
ALM
ELM
42
11
0
16 Aug 2024
Validation Requirements for AI-based Intervention-Evaluation in Aging
  and Longevity Research and Practice
Validation Requirements for AI-based Intervention-Evaluation in Aging and Longevity Research and Practice
G. Fuellen
Anton Y Kulaga
Sebastian Lobentanzer
Maximilian Unfried
Roberto Avelar
Daniel Palmer
Brian K. Kennedy
21
1
0
11 Aug 2024
StructEval: Deepen and Broaden Large Language Model Assessment via
  Structured Evaluation
StructEval: Deepen and Broaden Large Language Model Assessment via Structured Evaluation
Boxi Cao
Mengjie Ren
Hongyu Lin
Xianpei Han
Feng Zhang
Junfeng Zhan
Le Sun
ELM
21
3
0
06 Aug 2024
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models
Somshubra Majumdar
Vahid Noroozi
Sean Narenthiran
Aleksander Ficek
Aleksander Ficek
Wasi Uddin Ahmad
Jocelyn Huang
Jagadeesh Balam
Boris Ginsburg
SyDa
45
2
0
29 Jul 2024
Questionable practices in machine learning
Questionable practices in machine learning
Gavin Leech
Juan J. Vazquez
Misha Yagudin
Niclas Kupper
Laurence Aitchison
42
2
0
17 Jul 2024
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models
Kaichen Zhang
Bo Li
Peiyuan Zhang
Fanyi Pu
Joshua Adrian Cahyono
...
Shuai Liu
Yuanhan Zhang
Jingkang Yang
Chunyuan Li
Ziwei Liu
85
73
0
17 Jul 2024
Training on the Test Task Confounds Evaluation and Emergence
Training on the Test Task Confounds Evaluation and Emergence
Ricardo Dominguez-Olmedo
Florian E. Dorner
Moritz Hardt
ELM
53
6
1
10 Jul 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards
Zhimin Zhao
A. A. Bangash
F. Côgo
Bram Adams
Ahmed E. Hassan
52
0
0
04 Jul 2024
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
PaCoST: Paired Confidence Significance Testing for Benchmark Contamination Detection in Large Language Models
Huixuan Zhang
Yun Lin
Xiaojun Wan
38
0
0
26 Jun 2024
VarBench: Robust Language Model Benchmarking Through Dynamic Variable
  Perturbation
VarBench: Robust Language Model Benchmarking Through Dynamic Variable Perturbation
Kun Qian
Shunji Wan
Claudia Tang
Youzhi Wang
Xuanming Zhang
Maximillian Chen
Zhou Yu
AAML
35
8
0
25 Jun 2024
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning
  Graph
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang
Jiaao Chen
Diyi Yang
LRM
32
7
0
25 Jun 2024
AutoDetect: Towards a Unified Framework for Automated Weakness Detection
  in Large Language Models
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Jiale Cheng
Yida Lu
Xiaotao Gu
Pei Ke
Xiao-Yang Liu
Yuxiao Dong
Hongning Wang
Jie Tang
Minlie Huang
30
4
0
24 Jun 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
65
125
0
22 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement
  on Multilingual and Multi-Cultural Data
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
32
8
0
21 Jun 2024
Unveiling the Spectrum of Data Contamination in Language Models: A
  Survey from Detection to Remediation
Unveiling the Spectrum of Data Contamination in Language Models: A Survey from Detection to Remediation
Chunyuan Deng
Yilun Zhao
Yuzhao Heng
Yitong Li
Jiannan Cao
Xiangru Tang
Arman Cohan
27
13
0
20 Jun 2024
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large
  Language Model Evaluation
Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation
Qin Zhu
Qingyuan Cheng
Runyu Peng
Xiaonan Li
Tengxiao Liu
Ru Peng
Xipeng Qiu
Xuanjing Huang
23
6
0
20 Jun 2024
Self-play with Execution Feedback: Improving Instruction-following
  Capabilities of Large Language Models
Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models
Guanting Dong
K. Lu
Chengpeng Li
Tingyu Xia
Bowen Yu
Chang Zhou
Jingren Zhou
SyDa
ALM
LRM
47
13
0
19 Jun 2024
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All
  Tools
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
Team GLM
:
Aohan Zeng
Bin Xu
Bowen Wang
...
Zhaoyu Wang
Zhen Yang
Zhengxiao Du
Zhenyu Hou
Zihan Wang
ALM
62
473
0
18 Jun 2024
12
Next