Title
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns Michael A. Hedderich Anyi Wang Raoyuan Zhao Florian Eichin Barbara Plank 30 0 0 22 Apr 2025
Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark Jasper Götting Pedro Medeiros Jon G Sanders Nathaniel Li Long Phan Karam Elabd Lennart Justen Dan Hendrycks Seth Donoughe ELM 49 2 0 21 Apr 2025
Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora Alex Warstadt Aaron Mueller Leshem Choshen E. Wilcox Chengxu Zhuang ... Rafael Mosquera Bhargavi Paranjape Adina Williams Tal Linzen Ryan Cotterell 38 107 0 10 Apr 2025
ZeroSumEval: An Extensible Framework For Scaling LLM Evaluation with Inter-Model Competition H. A. Alyahya Haidar Khan Yazeed Alnumay M Saiful Bari B. Yener LRM 65 1 0 10 Mar 2025
Toward an Evaluation Science for Generative AI Systems Laura Weidinger Deb Raji Hanna M. Wallach Margaret Mitchell Angelina Wang Olawale Salaudeen Rishi Bommasani Sayash Kapoor Deep Ganguli Sanmi Koyejo EGVM ELM 65 4 0 07 Mar 2025
Position: Model Collapse Does Not Mean What You Think Rylan Schaeffer Joshua Kazdan Alvan Caleb Arulandu Sanmi Koyejo 60 0 0 05 Mar 2025
Varco Arena: A Tournament Approach to Reference-Free Benchmarking Large Language Models Seonil Son Ju-Min Oh Heegon Jin Cheolhun Jang Jeongbeom Jeong Kuntae Kim 44 0 0 20 Feb 2025
MathGAP: Out-of-Distribution Evaluation on Problems with Arbitrarily Complex Proofs Andreas Opedal Haruki Shirakami Bernhard Schölkopf Abulhair Saparov Mrinmaya Sachan LRM 57 1 0 17 Feb 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models Yulei Qin Yuncheng Yang Pengcheng Guo Gang Li Hang Shao Yuchen Shi Zihan Xu Yun Gu Ke Li Xing Sun ALM 90 12 0 31 Dec 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games Davide Paglieri Bartłomiej Cupiał Samuel Coward Ulyana Piterbarg Maciej Wolczyk ... Lerrel Pinto Rob Fergus Jakob Foerster Jack Parker-Holder Tim Rocktaschel LLMAG LRM 108 10 0 20 Nov 2024
Improving Model Evaluation using SMART Filtering of Benchmark Datasets Vipul Gupta Candace Ross David Pantoja R. Passonneau Megan Ung Adina Williams 70 1 0 26 Oct 2024
Enabling Scalable Evaluation of Bias Patterns in Medical LLMs Hamed Fayyaz Raphael Poulain Rahmatollah Beheshti 32 1 0 18 Oct 2024
NaturalBench: Evaluating Vision-Language Models on Natural Adversarial Samples Baiqi Li Zhiqiu Lin Wenxuan Peng Jean de Dieu Nyandwi Daniel Jiang Zixian Ma Simran Khanuja Ranjay Krishna Graham Neubig Deva Ramanan AAML CoGe VLM 69 21 0 18 Oct 2024
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems Nandan Thakur Suleman Kazi Ge Luo Jimmy J. Lin Amin Ahmad VLM RALM 28 7 0 17 Oct 2024
SYNTHEVAL: Hybrid Behavioral Testing of NLP Models with Synthetic CheckLists Raoyuan Zhao Abdullatif Köksal Yihong Liu Leonie Weissweiler Anna Korhonen Hinrich Schütze SyDa 36 1 0 30 Aug 2024
K-Sort Arena: Efficient and Reliable Benchmarking for Generative Models via K-wise Human Preferences Zhikai Li Xuewen Liu Dongrong Fu Jianquan Li Qingyi Gu Kurt Keutzer Zhen Dong EGVM VGen DiffM 83 1 0 26 Aug 2024
Extrinsic Evaluation of Cultural Competence in Large Language Models Shaily Bhatt Fernando Diaz ELM EGVM 47 4 0 17 Jun 2024
HelloFresh: LLM Evaluations on Streams of Real-World Human Editorial Actions across X Community Notes and Wikipedia edits Tim Franzmeyer Aleksandar Shtedritski Samuel Albanie Philip H. S. Torr João F. Henriques Jakob N. Foerster 27 1 0 05 Jun 2024
What is it for a Machine Learning Model to Have a Capability? Jacqueline Harding Nathaniel Sharadin ELM 36 3 0 14 May 2024
Adversarial Nibbler: An Open Red-Teaming Method for Identifying Diverse Harms in Text-to-Image Generation Jessica Quaye Alicia Parrish Oana Inel Charvi Rastogi Hannah Rose Kirk ... Nathan Clement Rafael Mosquera Juan Ciro Vijay Janapa Reddi Lora Aroyo 31 7 0 14 Feb 2024
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation Yoo Yeon Sung Ishani Mondal Jordan L. Boyd-Graber 28 0 0 20 Jan 2024
Deep Emotions Across Languages: A Novel Approach for Sentiment Propagation in Multilingual WordNets Jan Kocoñ 22 5 0 07 Dec 2023
Latent Feature-based Data Splits to Improve Generalisation Evaluation: A Hate Speech Detection Case Study Maike Zufle Verna Dankers Ivan Titov 31 0 0 16 Nov 2023
Toward Stronger Textual Attack Detectors Pierre Colombo Marine Picot Nathan Noiry Guillaume Staerman Pablo Piantanida 38 5 0 21 Oct 2023
Anchor Points: Benchmarking Models with Much Fewer Examples Rajan Vivek Kawin Ethayarajh Diyi Yang Douwe Kiela ALM 29 21 0 14 Sep 2023
No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models Jean Kaddour Oscar Key Piotr Nawrot Pasquale Minervini Matt J. Kusner 17 41 0 12 Jul 2023
Which Spurious Correlations Impact Reasoning in NLI Models? A Visual Interactive Diagnosis through Data-Constrained Counterfactuals Robin Shing Moon Chan Afra Amini Mennatallah El-Assady LRM AAML 29 2 0 21 Jun 2023
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena Lianmin Zheng Wei-Lin Chiang Ying Sheng Siyuan Zhuang Zhanghao Wu ... Dacheng Li Eric P. Xing Haotong Zhang Joseph E. Gonzalez Ion Stoica ALM OSLM ELM 16 3,781 0 09 Jun 2023
From Adversarial Arms Race to Model-centric Evaluation: Motivating a Unified Automatic Robustness Evaluation Framework Yangyi Chen Hongcheng Gao Ganqu Cui Lifan Yuan Dehan Kong ... Longtao Huang H. Xue Zhiyuan Liu Maosong Sun Heng Ji AAML ELM 25 6 0 29 May 2023
Collaborative Development of NLP models Fereshte Khani Marco Tulio Ribeiro 25 2 0 20 May 2023
What's the Meaning of Superhuman Performance in Today's NLU? Simone Tedeschi Johan Bos T. Declerck Jan Hajic Daniel Hershcovich ... Simon Krek Steven Schockaert Rico Sennrich Ekaterina Shutova Roberto Navigli ELM LM&MA VLM ReLM LRM 34 26 0 15 May 2023
LINGO : Visually Debiasing Natural Language Instructions to Support Task Diversity Anjana Arunkumar Shubham Sharma Rakhi Agrawal Sriramakrishnan Chandrasekaran Chris Bryan 26 0 0 12 Apr 2023
Angler: Helping Machine Translation Practitioners Prioritize Model Improvements Samantha Robertson Zijie J. Wang Dominik Moritz Mary Beth Kery Fred Hohman 30 15 0 12 Apr 2023
The Vector Grounding Problem Dimitri Coelho Mollo Raphael Milliere 36 26 0 04 Apr 2023
SemEval-2023 Task 10: Explainable Detection of Online Sexism Hannah Rose Kirk Wenjie Yin Bertie Vidgen Paul Röttger 10 117 0 07 Mar 2023
Auditing large language models: a three-layered approach Jakob Mokander Jonas Schuett Hannah Rose Kirk Luciano Floridi AILaw MLAU 42 194 0 16 Feb 2023
Backdoor Learning for NLP: Recent Advances, Challenges, and Future Research Directions Marwan Omar SILM AAML 25 20 0 14 Feb 2023
Evaluation for Change Rishi Bommasani ELM 38 0 0 20 Dec 2022
Evaluating Human-Language Model Interaction Mina Lee Megha Srivastava Amelia Hardy John Thickstun Esin Durmus ... Hancheng Cao Tony Lee Rishi Bommasani Michael S. Bernstein Percy Liang LM&MA ALM 46 99 0 19 Dec 2022
Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor Or Honovich Thomas Scialom Omer Levy Timo Schick ALM 31 359 0 19 Dec 2022
CiteBench: A benchmark for Scientific Citation Text Generation Martin Funkquist Ilia Kuznetsov Yufang Hou Iryna Gurevych 15 16 0 19 Dec 2022
Graph Learning Indexer: A Contributor-Friendly and Metadata-Rich Platform for Graph Learning Benchmarks Jiaqi Ma Xingjian Zhang Hezheng Fan Jin Huang Tianyue Li Tinghong Li Yiwen Tu Chen Zhu Qiaozhu Mei 35 5 0 08 Dec 2022
Validating Large Language Models with ReLM Michael Kuchnik Virginia Smith George Amvrosiadis 21 27 0 21 Nov 2022
GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-distribution Generalization Perspective Linyi Yang Shuibai Zhang Libo Qin Yafu Li Yidong Wang Hanmeng Liu Jindong Wang Xingxu Xie Yue Zhang ELM 39 79 0 15 Nov 2022
NaturalAdversaries: Can Naturalistic Adversaries Be as Effective as Artificial Adversaries? Saadia Gabriel Hamid Palangi Yejin Choi AAML 37 1 0 08 Nov 2022
TCAB: A Large-Scale Text Classification Attack Benchmark Kalyani Asthana Zhouhang Xie Wencong You Adam Noack Jonathan Brophy Sameer Singh Daniel Lowd 24 3 0 21 Oct 2022
Data-Efficient Strategies for Expanding Hate Speech Detection into Under-Resourced Languages Paul Röttger Debora Nozza Federico Bianchi Dirk Hovy 23 10 0 20 Oct 2022
Understanding Practices, Challenges, and Opportunities for User-Engaged Algorithm Auditing in Industry Practice Wesley Hanwen Deng B. Guo Alicia DeVrio Hong Shen Motahhare Eslami Kenneth Holstein MLAU 17 58 0 07 Oct 2022
InferES : A Natural Language Inference Corpus for Spanish Featuring Negation-Based Contrastive and Adversarial Examples Venelin Kovatchev Mariona Taulé 25 4 0 06 Oct 2022
State-of-the-art generalisation research in NLP: A taxonomy and review Dieuwke Hupkes Mario Giulianelli Verna Dankers Mikel Artetxe Yanai Elazar ... Leila Khalatbari Maria Ryskina Rita Frieske Ryan Cotterell Zhijing Jin 114 93 0 06 Oct 2022