TRUCE: Private Benchmarking to Prevent Contamination and Improve
Comparative Evaluation of LLMs

TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs

1 March 2024

Nishanth Chandran

Sunayana Sitaram

Manohar Swaminathan

Papers citing "TRUCE: Private Benchmarking to Prevent Contamination and Improve Comparative Evaluation of LLMs"

11 / 11 papers shown

Title
TLUE: A Tibetan Language Understanding Evaluation Benchmark Fan Gao Cheng Huang Nyima Tashi Xiangxiang Wang Thupten Tsering ... Gadeng Luosang Rinchen Dongrub Dorje Tashi Xiao Feng Yongbin Yu ELM 76 2 0 15 Mar 2025
LSHBloom: Memory-efficient, Extreme-scale Document Deduplication A. Khan Robert Underwood Carlo Siebenschuh Y. Babuji Aswathy Ajith Kyle Hippe Ozan Gokdemir Alexander Brace Kyle Chard Ian T. Foster 38 0 0 06 Nov 2024
Bilingual Evaluation of Language Models on General Knowledge in University Entrance Exams with Minimal Contamination Eva Sánchez Salido Roser Morante Julio Gonzalo Guillermo Marco Jorge Carrillo-de-Albornoz ... Enrique Amigó Andrés Fernández Alejandro Benito-Santos Adrián Ghajari Espinosa Victor Fresno ELM 39 0 0 19 Sep 2024
Benchmark Data Contamination of Large Language Models: A Survey Cheng Xu Shuhao Guan Derek Greene Mohand-Tahar Kechadi ELM ALM 38 38 0 06 Jun 2024
Task Contamination: Language Models May Not Be Few-Shot Anymore Changmao Li Jeffrey Flanigan 95 91 0 26 Dec 2023
Can Large Language Models Be an Alternative to Human Evaluations? Cheng-Han Chiang Hung-yi Lee ALM LM&MA 221 571 0 03 May 2023
CrypTFlow2: Practical 2-Party Secure Inference Deevashwer Rathee Mayank Rathee Nishant Kumar Nishanth Chandran Divya Gupta Aseem Rastogi Rahul Sharma 77 301 0 13 Oct 2020
MLQA: Evaluating Cross-lingual Extractive Question Answering Patrick Lewis Barlas Oğuz Ruty Rinott Sebastian Riedel Holger Schwenk ELM 244 491 0 16 Oct 2019
Are We Modeling the Task or the Annotator? An Investigation of Annotator Bias in Natural Language Understanding Datasets Mor Geva Yoav Goldberg Jonathan Berant 237 319 0 21 Aug 2019
Hypothesis Only Baselines in Natural Language Inference Adam Poliak Jason Naradowsky Aparajita Haldar Rachel Rudinger Benjamin Van Durme 190 576 0 02 May 2018
ImageNet Large Scale Visual Recognition Challenge Olga Russakovsky Jia Deng Hao Su J. Krause S. Satheesh ... A. Karpathy A. Khosla Michael S. Bernstein Alexander C. Berg Li Fei-Fei VLM ObjD 287 39,194 0 01 Sep 2014