Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2406.18403
Cited By
LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks
26 June 2024
A. Bavaresco
Raffaella Bernardi
Leonardo Bertolazzi
Desmond Elliott
Raquel Fernández
Albert Gatt
E. Ghaleb
Mario Giulianelli
Michael Hanna
Alexander Koller
André F. T. Martins
Philipp Mondorf
Vera Neplenbroek
Sandro Pezzelle
Barbara Plank
David Schlangen
Alessandro Suglia
Aditya K Surikuchi
Ece Takmaz
A. Testoni
ALM
ELM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLMs instead of Human Judges? A Large Scale Empirical Study across 20 NLP Evaluation Tasks"
47 / 47 papers shown
Title
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
38
0
0
08 May 2025
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
Soumik Dey
Hansi Wu
Binbin Li
36
0
0
07 May 2025
Patterns and Mechanisms of Contrastive Activation Engineering
Yixiong Hao
Ayush Panda
Stepan Shabalin
Sheikh Abdur Raheem Ali
LLMSV
58
0
0
06 May 2025
Process Reward Models That Think
Muhammad Khalifa
Rishabh Agarwal
Lajanugen Logeswaran
Jaekyeom Kim
Hao Peng
Moontae Lee
Honglak Lee
Lu Wang
OffRL
ALM
LRM
39
1
0
23 Apr 2025
What's the Difference? Supporting Users in Identifying the Effects of Prompt and Model Changes Through Token Patterns
Michael A. Hedderich
Anyi Wang
Raoyuan Zhao
Florian Eichin
Barbara Plank
30
0
0
22 Apr 2025
An LLM-as-a-judge Approach for Scalable Gender-Neutral Translation Evaluation
Andrea Piergentili
Beatrice Savoldi
Matteo Negri
L. Bentivogli
ELM
35
0
0
16 Apr 2025
Large Language Models as Span Annotators
Zdeněk Kasner
Vilém Zouhar
Patrícia Schmidtová
Ivan Kartáč
Kristýna Onderková
Ondřej Plátek
Dimitra Gkatzia
Saad Mahamood
Ondrej Dusek
Simone Balloccu
ALM
27
0
0
11 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALM
ELM
42
0
0
09 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Hongliu Cao
Ilias Driouich
Robin Singh
Eoin Thomas
ELM
36
0
0
01 Apr 2025
Beyond Single-Sentence Prompts: Upgrading Value Alignment Benchmarks with Dialogues and Stories
Y. Zhang
Qimeng Liu
Qiuchi Li
Peng Zhang
Jing Qin
AAML
28
0
0
28 Mar 2025
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
50
0
0
24 Mar 2025
PinLanding: Content-First Keyword Landing Page Generation via Multi-Modal AI for Web-Scale Discovery
Faye Zhang
Jasmine Wan
Qianyu Cheng
Jinfeng Rao
31
0
0
01 Mar 2025
Can LLM Assist in the Evaluation of the Quality of Machine Learning Explanations?
Bo Wang
Yiqiao Li
Jianlong Zhou
Fang Chen
XAI
ELM
35
0
0
28 Feb 2025
All-in-one: Understanding and Generation in Multimodal Reasoning with the MAIA Benchmark
Davide Testa
Giovanni Bonetta
Raffaella Bernardi
Alessandro Bondielli
Alessandro Lenci
Alessio Miaschi
Lucia Passaro
Bernardo Magnini
VGen
LRM
50
0
0
24 Feb 2025
Correlating and Predicting Human Evaluations of Language Models from Natural Language Processing Benchmarks
Rylan Schaeffer
Punit Singh Koura
Binh Tang
R. Subramanian
Aaditya K. Singh
...
Vedanuj Goswami
Sergey Edunov
Dieuwke Hupkes
Sanmi Koyejo
Sharan Narang
ALM
69
0
0
24 Feb 2025
Natural Language Generation from Visual Sequences: Challenges and Future Directions
Aditya K Surikuchi
Raquel Fernández
Sandro Pezzelle
EGVM
99
0
0
18 Feb 2025
Investigating the Impact of Quantization Methods on the Safety and Reliability of Large Language Models
Artyom Kharinaev
Viktor Moskvoretskii
Egor Shvetsov
Kseniia Studenikina
Bykov Mikhail
E. Burnaev
MQ
38
0
0
18 Feb 2025
Forget What You Know about LLMs Evaluations - LLMs are Like a Chameleon
Nurit Cohen-Inger
Yehonatan Elisha
Bracha Shapira
L. Rokach
Seffi Cohen
ELM
89
0
0
11 Feb 2025
Aligning Black-box Language Models with Human Judgments
Gerrit J. J. van den Burg
Gen Suzuki
Wei Liu
Murat Sensoy
ALM
66
0
0
07 Feb 2025
CondAmbigQA: A Benchmark and Dataset for Conditional Ambiguous Question Answering
Zongxi Li
Y. Li
Haoran Xie
S. J. Qin
66
0
0
03 Feb 2025
Beyond correlation: The Impact of Human Uncertainty in Measuring the Effectiveness of Automatic Evaluation and LLM-as-a-Judge
Aparna Elangovan
Jongwoo Ko
Lei Xu
Mahsa Elyasi
Ling Liu
S. Bodapati
Dan Roth
33
5
0
28 Jan 2025
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Benjamin Feuer
Micah Goldblum
Teresa Datta
Sanjana Nambiar
Raz Besaleli
Samuel Dooley
Max Cembalest
John P. Dickerson
ALM
35
0
0
28 Jan 2025
ELITR-Bench: A Meeting Assistant Benchmark for Long-Context Language Models
Thibaut Thonet
Jos Rozen
Laurent Besacier
RALM
129
2
0
20 Jan 2025
JuStRank: Benchmarking LLM Judges for System Ranking
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
92
3
0
12 Dec 2024
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu
Zhengxing Chen
Aston Zhang
L Tan
Chenguang Zhu
...
Suchin Gururangan
Chao-Yue Zhang
Melanie Kambadur
Dhruv Mahajan
Rui Hou
LRM
ALM
87
14
0
25 Nov 2024
FactLens: Benchmarking Fine-Grained Fact Verification
Kushan Mitra
Dan Zhang
Sajjadur Rahman
Estevam R. Hruschka
HILM
32
1
0
08 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
30
3
0
04 Nov 2024
SleepCoT: A Lightweight Personalized Sleep Health Model via Chain-of-Thought Distillation
Huimin Zheng
Xiaofeng Xing
Xiangmin Xu
VLM
43
1
0
22 Oct 2024
Ethics Whitepaper: Whitepaper on Ethical Research into Large Language Models
Eddie L. Ungless
Nikolas Vitsakis
Zeerak Talat
James Garforth
Bjorn Ross
Arno Onken
Atoosa Kasirzadeh
Alexandra Birch
25
1
0
17 Oct 2024
Red and blue language: Word choices in the Trump & Harris 2024 presidential debate
Philipp Wicke
M. Bolognesi
19
1
0
17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
33
5
0
17 Oct 2024
Black-box Uncertainty Quantification Method for LLM-as-a-Judge
Nico Wagner
Michael Desmond
Rahul Nair
Zahra Ashktorab
Elizabeth M. Daly
Qian Pan
Martin Santillan Cooper
James M. Johnson
Werner Geyer
ELM
UQCV
44
3
0
15 Oct 2024
In-Context Learning for Long-Context Sentiment Analysis on Infrastructure Project Opinions
Alireza Shamshiri
Kyeong Rok Ryu
June Young Park
LLMAG
14
1
0
15 Oct 2024
Agent-as-a-Judge: Evaluate Agents with Agents
Mingchen Zhuge
Changsheng Zhao
Dylan R. Ashley
Wenyi Wang
Dmitrii Khizbullin
...
Raghuraman Krishnamoorthi
Yuandong Tian
Yangyang Shi
Vikas Chandra
Jürgen Schmidhuber
ELM
57
32
0
14 Oct 2024
Cross-Modal Few-Shot Learning: a Generative Transfer Learning Framework
Zhengwei Yang
Yuke Li
Qiang Sun
Basura Fernando
Heng-Chiao Huang
Zheng Wang
21
1
0
14 Oct 2024
Aligning Human and LLM Judgments: Insights from EvalAssist on Task-Specific Evaluations and AI-assisted Assessment Strategy Preferences
Zahra Ashktorab
Michael Desmond
Qian Pan
James M. Johnson
Martin Santillan Cooper
Elizabeth M. Daly
Rahul Nair
Tejaswini Pedapati
Swapnaja Achintalwar
Werner Geyer
ELM
31
2
0
01 Oct 2024
A Survey on Complex Tasks for Goal-Directed Interactive Agents
Mareike Hartmann
Alexander Koller
LM&Ro
LLMAG
32
0
0
27 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
45
4
0
06 Sep 2024
Summarizing long regulatory documents with a multi-step pipeline
Mika Sie
Ruby Beek
Michiel Bots
S. Brinkkemper
Albert Gatt
AILaw
ELM
21
1
0
19 Aug 2024
Can LLMs Replace Manual Annotation of Software Engineering Artifacts?
Toufique Ahmed
Premkumar Devanbu
Christoph Treude
Michael Pradel
68
10
0
10 Aug 2024
Self-Taught Evaluators
Tianlu Wang
Ilia Kulikov
O. Yu. Golovneva
Ping Yu
Weizhe Yuan
Jane Dwivedi-Yu
Richard Yuanzhe Pang
Maryam Fazel-Zarandi
Jason Weston
Xian Li
ALM
LRM
25
21
0
05 Aug 2024
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
Aman Singh Thakur
Kartik Choudhary
Venkat Srinik Ramayapally
Sankaran Vaidyanathan
Dieuwke Hupkes
ELM
ALM
45
55
0
18 Jun 2024
The Effectiveness of LLMs as Annotators: A Comparative Overview and Empirical Analysis of Direct Representation
Maja Pavlovic
Massimo Poesio
32
17
0
02 May 2024
OLMo: Accelerating the Science of Language Models
Dirk Groeneveld
Iz Beltagy
Pete Walsh
Akshita Bhagia
Rodney Michael Kinney
...
Jesse Dodge
Kyle Lo
Luca Soldaini
Noah A. Smith
Hanna Hajishirzi
OSLM
130
349
0
01 Feb 2024
Can Large Language Models Be an Alternative to Human Evaluations?
Cheng-Han Chiang
Hung-yi Lee
ALM
LM&MA
206
559
0
03 May 2023
e-SNLI: Natural Language Inference with Natural Language Explanations
Oana-Maria Camburu
Tim Rocktaschel
Thomas Lukasiewicz
Phil Blunsom
LRM
252
618
0
04 Dec 2018
Teaching Machines to Read and Comprehend
Karl Moritz Hermann
Tomás Kociský
Edward Grefenstette
L. Espeholt
W. Kay
Mustafa Suleyman
Phil Blunsom
170
3,504
0
10 Jun 2015
1