Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.13076
Cited By
LLM Evaluators Recognize and Favor Their Own Generations
15 April 2024
Arjun Panickssery
Samuel R. Bowman
Shi Feng
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLM Evaluators Recognize and Favor Their Own Generations"
50 / 110 papers shown
Title
Efficient Aspect-Based Summarization of Climate Change Reports with Small Language Models
Iacopo Ghinassi
Leonardo Catalano
Tommaso Colella
54
1
0
21 Nov 2024
Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations
Chaitanya Malaviya
Joseph Chee Chang
Dan Roth
Mohit Iyyer
Mark Yatskar
Kyle Lo
ELM
40
4
0
11 Nov 2024
Benchmarking LLMs' Judgments with No Gold Standard
Shengwei Xu
Yuxuan Lu
Grant Schoenebeck
Yuqing Kong
29
1
0
11 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
30
3
0
04 Nov 2024
DISCERN: Decoding Systematic Errors in Natural Language for Text Classifiers
Rakesh R Menon
Shashank Srivastava
21
0
0
29 Oct 2024
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
Kimihiro Hasegawa
Wiradee Imrattanatrai
Zhi-Qi Cheng
Masaki Asada
Susan Holm
Yuran Wang
Ken Fukuda
Teruko Mitamura
16
0
0
29 Oct 2024
LLMs are Biased Evaluators But Not Biased for Retrieval Augmented Generation
Yen-Shan Chen
Jing Jin
Peng-Ting Kuo
Chao-Wei Huang
Yun-Nung (Vivian) Chen
15
0
0
28 Oct 2024
Looking Inward: Language Models Can Learn About Themselves by Introspection
Felix J Binder
James Chua
Tomek Korbak
Henry Sleight
John Hughes
Robert Long
Ethan Perez
Miles Turpin
Owain Evans
KELM
AIFin
LRM
30
8
0
17 Oct 2024
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
Nandan Thakur
Suleman Kazi
Ge Luo
Jimmy J. Lin
Amin Ahmad
VLM
RALM
21
6
0
17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
33
5
0
17 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
14
6
0
09 Oct 2024
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
Kartik Mehta
Yu-Hsiang Lin
Haw-Shiuan Chang
Shereen Oraby
Sijia Liu
Vivek Subramanian
Tagyoung Chung
Mohit Bansal
Nanyun Peng
38
7
0
09 Oct 2024
Generating bilingual example sentences with large language models as lexicography assistants
Raphael Merx
Ekaterina Vylomova
Kemal Kurniawan
18
2
0
04 Oct 2024
A Critical Look at Meta-evaluating Summarisation Evaluation Metrics
Xiang Dai
Sarvnaz Karimi
Biaoyan Fang
15
0
0
29 Sep 2024
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells
Atharva Naik
Marcus Alenius
Daniel Fried
Carolyn Rose
19
0
0
29 Sep 2024
Training Language Models to Win Debates with Self-Play Improves Judge Accuracy
Samuel Arnesen
David Rein
Julian Michael
ELM
17
0
0
25 Sep 2024
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
37
11
0
23 Sep 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
48
3
0
10 Sep 2024
Report Cards: Qualitative Evaluation of Language Models Using Natural Language Summaries
Blair Yang
Fuyang Cui
Keiran Paster
Jimmy Ba
Pashootan Vaezipoor
Silviu Pitis
Michael Ruogu Zhang
13
1
0
01 Sep 2024
Non-instructional Fine-tuning: Enabling Instruction-Following Capabilities in Pre-trained Language Models without Instruction-Following Data
Juncheng Xie
Shensian Syu
Hung-yi Lee
ALM
15
1
0
27 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
47
22
0
23 Aug 2024
Summarizing long regulatory documents with a multi-step pipeline
Mika Sie
Ruby Beek
Michiel Bots
S. Brinkkemper
Albert Gatt
AILaw
ELM
19
1
0
19 Aug 2024
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss
Christian Poelitz
Ian Drosos
Vu Le
Nick McKenna
Carina Negreanu
Chris Parnin
Advait Sarkar
ELM
ALM
19
3
0
16 Aug 2024
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models
Faris Hijazi
Somayah Alharbi
Abdulaziz AlHussein
Harethah Shairah
Reem Alzahrani
Hebah Alshamlan
Omar Knio
G. Turkiyyah
AILaw
ELM
33
2
0
15 Aug 2024
Active Testing of Large Language Model via Multi-Stage Sampling
Yuheng Huang
Jiayang Song
Qiang Hu
Felix Juefei-Xu
Lei Ma
16
2
0
07 Aug 2024
Trust or Escalate: LLM Judges with Provable Guarantees for Human Agreement
Jaehun Jung
Faeze Brahman
Yejin Choi
ALM
32
11
0
25 Jul 2024
OMoS-QA: A Dataset for Cross-Lingual Extractive Question Answering in a German Migration Context
Steffen Kleinle
Jakob Prange
Annemarie Friedrich
RALM
31
0
0
22 Jul 2024
Understanding Reference Policies in Direct Preference Optimization
Yixin Liu
Pengfei Liu
Arman Cohan
26
7
0
18 Jul 2024
Weak-to-Strong Reasoning
Yuqing Yang
Yan Ma
Pengfei Liu
LRM
25
13
0
18 Jul 2024
Self-Recognition in Language Models
Tim R. Davidson
Viacheslav Surkov
V. Veselovsky
Giuseppe Russo
Robert West
Çağlar Gülçehre
PILM
216
2
0
09 Jul 2024
AI AI Bias: Large Language Models Favor Their Own Generated Content
Walter Laurito
Benjamin Davis
Peli Grietzer
T. Gavenčiak
Ada Böhm
Jan Kulveit
16
3
0
09 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
29
28
0
05 Jul 2024
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks
Adrian Rebmann
Fabian David Schmidt
Goran Glavaš
Han van der Aa
LRM
17
5
0
02 Jul 2024
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh
Tejas Srinivasan
Swabha Swayamdipta
24
2
0
02 Jul 2024
LLM See, LLM Do: Guiding Data Generation to Target Non-Differentiable Objectives
Luísa Shimabucoro
Sebastian Ruder
Julia Kreutzer
Marzieh Fadaee
Sara Hooker
SyDa
21
4
0
01 Jul 2024
AutoDetect: Towards a Unified Framework for Automated Weakness Detection in Large Language Models
Jiale Cheng
Yida Lu
Xiaotao Gu
Pei Ke
Xiao-Yang Liu
Yuxiao Dong
Hongning Wang
Jie Tang
Minlie Huang
28
4
0
24 Jun 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
65
125
0
22 Jun 2024
A SMART Mnemonic Sounds like "Glue Tonic": Mixing LLMs with Student Feedback to Make Mnemonic Learning Stick
Nishant Balepur
Matthew Shu
Alexander Hoyle
Alison Robey
Shi Feng
Seraphina Goldfarb-Tarrant
Jordan Boyd-Graber
38
0
0
21 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
30
8
0
21 Jun 2024
Adversaries Can Misuse Combinations of Safe Models
Erik Jones
Anca Dragan
Jacob Steinhardt
40
6
0
20 Jun 2024
PostMark: A Robust Blackbox Watermark for Large Language Models
Yapei Chang
Kalpesh Krishna
Amir Houmansadr
John Wieting
Mohit Iyyer
16
5
0
20 Jun 2024
Finding Blind Spots in Evaluator LLMs with Interpretable Checklists
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Sshubam Verma
Mitesh Khapra
34
11
0
19 Jun 2024
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Ruiqi He
Yushu He
Longju Bai
Jiarui Liu
Zhenjie Sun
Zenghao Tang
He Wang
Hanchen Xia
Naihao Deng
25
0
0
18 Jun 2024
Fairer Preferences Elicit Improved Human-Aligned Large Language Model Judgments
Han Zhou
Xingchen Wan
Yinhong Liu
Nigel Collier
Ivan Vulić
Anna Korhonen
ALM
21
9
0
17 Jun 2024
CRAG -- Comprehensive RAG Benchmark
Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Nikita Bhalla
...
Nirav Shah
Rakesh Wanga
Anuj Kumar
Wen-tau Yih
Xin Luna Dong
18
22
0
07 Jun 2024
UltraMedical: Building Specialized Generalists in Biomedicine
Kaiyan Zhang
Sihang Zeng
Ermo Hua
Ning Ding
Zhang-Ren Chen
...
Xuekai Zhu
Xingtai Lv
Hu Jinfang
Zhiyuan Liu
Bowen Zhou
LM&MA
39
19
0
06 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
21
0
0
04 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
41
8
0
02 Jun 2024
LACIE: Listener-Aware Finetuning for Confidence Calibration in Large Language Models
Elias Stengel-Eskin
Peter Hase
Mohit Bansal
27
4
0
31 May 2024
Can Automatic Metrics Assess High-Quality Translations?
Sweta Agrawal
António Farinhas
Ricardo Rei
André F. T. Martins
16
8
0
28 May 2024
Previous
1
2
3
Next