Title
Understanding the Skill Gap in Recurrent Language Models: The Role of the Gather-and-Aggregate Mechanism Aviv Bick Eric P. Xing Albert Gu RALM 81 0 0 22 Apr 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition H. A. Alyahya Haidar Khan Yazeed Alnumay M Saiful Bari B. Yener LRM 58 0 0 10 Mar 2025
Investigating Non-Transitivity in LLM-as-a-Judge Yi Xu Laura Ruis Tim Rocktaschel Robert Kirk 38 0 0 19 Feb 2025
Improving Model Evaluation using SMART Filtering of Benchmark Datasets Vipul Gupta Candace Ross David Pantoja R. Passonneau Megan Ung Adina Williams 49 1 0 26 Oct 2024
Position: LLM Unlearning Benchmarks are Weak Measures of Progress Pratiksha Thaker Shengyuan Hu Neil Kale Yash Maurya Zhiwei Steven Wu Virginia Smith MU 45 10 0 03 Oct 2024
On the Workflows and Smells of Leaderboard Operations (LBOps): An Exploratory Study of Foundation Model Leaderboards Zhimin Zhao A. A. Bangash F. Côgo Bram Adams Ahmed E. Hassan 52 0 0 04 Jul 2024
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers Manuel Mondal Ljiljana Dolamic Gérôme Bovet Philippe Cudré-Mauroux Julien Audiffren 24 2 0 21 Jun 2024
Examining the robustness of LLM evaluation to the distributional assumptions of benchmarks Melissa Ailem Katerina Marazopoulou Charlotte Siska James Bono 51 13 0 25 Apr 2024
Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy P. Schoenegger Indre Tuminauskaite Peter S. Park Rafael Valdece Sousa Bastos P. Tetlock 29 24 0 29 Feb 2024
Adaptive Chameleon or Stubborn Sloth: Revealing the Behavior of Large Language Models in Knowledge Conflicts Jian Xie Kai Zhang Jiangjie Chen Renze Lou Yu-Chuan Su RALM 198 150 0 22 May 2023
Leveraging Large Language Models for Multiple Choice Question Answering Joshua Robinson Christopher Rytting David Wingate ELM 129 181 0 22 Oct 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization Victor Sanh Albert Webson Colin Raffel Stephen H. Bach Lintang Sutawika ... T. Bers Stella Biderman Leo Gao Thomas Wolf Alexander M. Rush LRM 203 1,651 0 15 Oct 2021
Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity Yao Lu Max Bartolo Alastair Moore Sebastian Riedel Pontus Stenetorp AILaw LRM 274 1,114 0 18 Apr 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding Alex Jinpeng Wang Amanpreet Singh Julian Michael Felix Hill Omer Levy Samuel R. Bowman ELM 294 6,927 0 20 Apr 2018