Title
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations Chalamalasetti Kranti Sherzod Hakimov David Schlangen LLMAG 38 0 0 08 May 2025
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay Soumik Dey Hansi Wu Binbin Li 36 0 0 07 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering Yixiong Hao Ayush Panda Stepan Shabalin Sheikh Abdur Raheem Ali LLMSV 58 0 0 06 May 2025
Process Reward Models That Think Muhammad Khalifa Rishabh Agarwal Lajanugen Logeswaran Jaekyeom Kim Hao Peng Moontae Lee Honglak Lee Lu Wang OffRL ALM LRM 39 1 0 23 Apr 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns Michael A. Hedderich Anyi Wang Raoyuan Zhao Florian Eichin Barbara Plank 30 0 0 22 Apr 2025
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation Andrea Piergentili Beatrice Savoldi Matteo Negri L. Bentivogli ELM 35 0 0 16 Apr 2025
Large Language Models as Span Annotators Zdeněk Kasner Vilém Zouhar Patrícia Schmidtová Ivan Kartáč Kristýna Onderková Ondřej Plátek Dimitra Gkatzia Saad Mahamood Ondrej Dusek Simone Balloccu ALM 27 0 0 11 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation Mingxuan Li Hanchen Li Chenhao Tan ALM ELM 42 0 0 09 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications Hongliu Cao Ilias Driouich Robin Singh Eoin Thomas ELM 36 0 0 01 Apr 2025
Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories Y. Zhang Qimeng Liu Qiuchi Li Peng Zhang Jing Qin AAML 28 0 0 28 Mar 2025
SPHERE: An Evaluation Card for Human-AI Systems Qianou Ma Dora Zhao Xinran Zhao Chenglei Si Chenyang Yang Ryan Louie Ehud Reiter Diyi Yang Tongshuang Wu ALM 50 0 0 24 Mar 2025
PinLanding: Content-First Keyword Landing Page Generation via Multi-Modal AI for Web-Scale Discovery Faye Zhang Jasmine Wan Qianyu Cheng Jinfeng Rao 31 0 0 01 Mar 2025
Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations? Bo Wang Yiqiao Li Jianlong Zhou Fang Chen XAI ELM 35 0 0 28 Feb 2025
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark Davide Testa Giovanni Bonetta Raffaella Bernardi Alessandro Bondielli Alessandro Lenci Alessio Miaschi Lucia Passaro Bernardo Magnini VGen LRM 50 0 0 24 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks Rylan Schaeffer Punit Singh Koura Binh Tang R. Subramanian Aaditya K. Singh ... Vedanuj Goswami Sergey Edunov Dieuwke Hupkes Sanmi Koyejo Sharan Narang ALM 69 0 0 24 Feb 2025
Natural Language Generation from Visual Sequences: Challenges and Future Directions Aditya K Surikuchi Raquel Fernández Sandro Pezzelle EGVM 99 0 0 18 Feb 2025
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models Artyom Kharinaev Viktor Moskvoretskii Egor Shvetsov Kseniia Studenikina Bykov Mikhail E. Burnaev MQ 38 0 0 18 Feb 2025
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon Nurit Cohen-Inger Yehonatan Elisha Bracha Shapira L. Rokach Seffi Cohen ELM 89 0 0 11 Feb 2025
Aligning Black-box Language Models with Human Judgments Gerrit J. J. van den Burg Gen Suzuki Wei Liu Murat Sensoy ALM 66 0 0 07 Feb 2025
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering Zongxi Li Y. Li Haoran Xie S. J. Qin 66 0 0 03 Feb 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge Aparna Elangovan Jongwoo Ko Lei Xu Mahsa Elyasi Ling Liu S. Bodapati Dan Roth 33 5 0 28 Jan 2025
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking Benjamin Feuer Micah Goldblum Teresa Datta Sanjana Nambiar Raz Besaleli Samuel Dooley Max Cembalest John P. Dickerson ALM 35 0 0 28 Jan 2025
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models Thibaut Thonet Jos Rozen Laurent Besacier RALM 129 2 0 20 Jan 2025
JuStRank: Benchmarking LLM Judges for System Ranking Ariel Gera Odellia Boni Yotam Perlitz Roy Bar-Haim Lilach Eden Asaf Yehudai ALM ELM 92 3 0 12 Dec 2024
Self-Generated Critiques Boost Reward Modeling for Language Models Yue Yu Zhengxing Chen Aston Zhang L Tan Chenguang Zhu ... Suchin Gururangan Chao-Yue Zhang Melanie Kambadur Dhruv Mahajan Rui Hou LRM ALM 87 14 0 25 Nov 2024
FactLens: Benchmarking Fine-Grained Fact Verification Kushan Mitra Dan Zhang Sajjadur Rahman Estevam R. Hruschka HILM 32 1 0 08 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models Mete Ismayilzada Claire Stevenson Lonneke van der Plas LM&MA LRM 30 3 0 04 Nov 2024
SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation Huimin Zheng Xiaofeng Xing Xiangmin Xu VLM 43 1 0 22 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models Eddie L. Ungless Nikolas Vitsakis Zeerak Talat James Garforth Bjorn Ross Arno Onken Atoosa Kasirzadeh Alexandra Birch 25 1 0 17 Oct 2024
Red and blue language: Word choices in the Trump & Harris 2024 presidential debate Philipp Wicke M. Bolognesi 19 1 0 17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data Florian E. Dorner Vivian Y. Nastl Moritz Hardt ELM ALM 33 5 0 17 Oct 2024
Black-box Uncertainty Quantification Method for LLM-as-a-Judge Nico Wagner Michael Desmond Rahul Nair Zahra Ashktorab Elizabeth M. Daly Qian Pan Martin Santillan Cooper James M. Johnson Werner Geyer ELM UQCV 44 3 0 15 Oct 2024
In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions Alireza Shamshiri Kyeong Rok Ryu June Young Park LLMAG 14 1 0 15 Oct 2024
Agent-as-a-Judge: Evaluate Agents with Agents Mingchen Zhuge Changsheng Zhao Dylan R. Ashley Wenyi Wang Dmitrii Khizbullin ... Raghuraman Krishnamoorthi Yuandong Tian Yangyang Shi Vikas Chandra Jürgen Schmidhuber ELM 57 32 0 14 Oct 2024
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework Zhengwei Yang Yuke Li Qiang Sun Basura Fernando Heng-Chiao Huang Zheng Wang 21 1 0 14 Oct 2024
Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences Zahra Ashktorab Michael Desmond Qian Pan James M. Johnson Martin Santillan Cooper Elizabeth M. Daly Rahul Nair Tejaswini Pedapati Swapnaja Achintalwar Werner Geyer ELM 31 2 0 01 Oct 2024
A Survey on Complex Tasks for Goal-Directed Interactive Agents Mareike Hartmann Alexander Koller LM&Ro LLMAG 32 0 0 27 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks Andreas Stephan D. Zhu Matthias Aßenmacher Xiaoyu Shen Benjamin Roth ELM 45 4 0 06 Sep 2024
Summarizing long regulatory documents with a multi-step pipeline Mika Sie Ruby Beek Michiel Bots S. Brinkkemper Albert Gatt AILaw ELM 21 1 0 19 Aug 2024
Can LLMs Replace Manual Annotation of Software Engineering Artifacts? Toufique Ahmed Premkumar Devanbu Christoph Treude Michael Pradel 68 10 0 10 Aug 2024
Self-Taught Evaluators Tianlu Wang Ilia Kulikov O. Yu. Golovneva Ping Yu Weizhe Yuan Jane Dwivedi-Yu Richard Yuanzhe Pang Maryam Fazel-Zarandi Jason Weston Xian Li ALM LRM 25 21 0 05 Aug 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges Aman Singh Thakur Kartik Choudhary Venkat Srinik Ramayapally Sankaran Vaidyanathan Dieuwke Hupkes ELM ALM 45 55 0 18 Jun 2024
The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation Maja Pavlovic Massimo Poesio 32 17 0 02 May 2024
OLMo: Accelerating the Science of Language Models Dirk Groeneveld Iz Beltagy Pete Walsh Akshita Bhagia Rodney Michael Kinney ... Jesse Dodge Kyle Lo Luca Soldaini Noah A. Smith Hanna Hajishirzi OSLM 130 349 0 01 Feb 2024
Can Large Language Models Be an Alternative to Human Evaluations? Cheng-Han Chiang Hung-yi Lee ALM LM&MA 206 559 0 03 May 2023
e-SNLI: Natural Language Inference with Natural Language Explanations Oana-Maria Camburu Tim Rocktaschel Thomas Lukasiewicz Phil Blunsom LRM 252 618 0 04 Dec 2018
Teaching Machines to Read and Comprehend Karl Moritz Hermann Tomás Kociský Edward Grefenstette L. Espeholt W. Kay Mustafa Suleyman Phil Blunsom 170 3,504 0 10 Jun 2015