Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2404.13076
Cited By
LLM Evaluators Recognize and Favor Their Own Generations
15 April 2024
Arjun Panickssery
Samuel R. Bowman
Shi Feng
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (2 upvotes)
Papers citing
"LLM Evaluators Recognize and Favor Their Own Generations"
50 / 154 papers shown
Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey
Xiaoou Liu
Tiejin Chen
Longchao Da
Chacha Chen
Zhen Lin
Hua Wei
HILM
489
40
0
20 Mar 2025
OpeNLGauge: An Explainable Metric for NLG Evaluation with Open-Weights LLMs
Ivan Kartáč
Mateusz Lango
Ondrej Dusek
ELM
364
5
0
14 Mar 2025
Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
377
5
0
11 Mar 2025
Language Models Fail to Introspect About Their Knowledge of Language
Siyuan Song
Jennifer Hu
Kyle Mahowald
LRM
KELM
HILM
ELM
408
11
0
10 Mar 2025
SwiLTra-Bench: The Swiss Legal Translation Benchmark
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Joel Niklaus
Jakob Merane
Luka Nenadic
Sina Ahmadi
Yingqiang Gao
...
Matthew Guillod
Robin Mamié
Daniel Brunner
Julio Pereyra
Niko Grupen
AILaw
ELM
305
7
0
03 Mar 2025
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Hongyi Cal
Jie Li
Mohammad Mahdinur Rahman
Wenzhen Dong
398
0
0
26 Feb 2025
Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources
Joachim De Baer
A. Seza Doğruöz
T. Demeester
Chris Develder
273
1
0
25 Feb 2025
Automatic Input Rewriting Improves Translation with Large Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Dayeon Ki
Marine Carpuat
325
2
0
23 Feb 2025
CLIPPER: Compression enables long-context synthetic data generation
Chau Minh Pham
Yapei Chang
Mohit Iyyer
SyDa
437
1
0
20 Feb 2025
RLTHF: Targeted Human Feedback for LLM Alignment
Yifei Xu
Tusher Chakraborty
Emre Kıcıman
Bibek Aryal
Eduardo Rodrigues
...
Rafael Padilha
Leonardo Nunes
Shobana Balakrishnan
Songwu Lu
Ranveer Chandra
465
4
0
19 Feb 2025
Elucidating Mechanisms of Demographic Bias in LLMs for Healthcare
Hiba Ahsan
Arnab Sen Sharma
Silvio Amir
David Bau
Byron C. Wallace
374
3
0
18 Feb 2025
AI Alignment at Your Discretion
Conference on Fairness, Accountability and Transparency (FAccT), 2025
Maarten Buyl
Hadi Khalaf
C. M. Verdun
Lucas Monteiro Paes
Caio Vieira Machado
Flavio du Pin Calmon
311
10
0
10 Feb 2025
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Hongxin Li
Jingfan Chen
Jingran Su
Yuntao Chen
Qing Li
Rundong Wang
985
8
0
04 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
Jiawei Han
Wei Wei
Wei Wang
Huan Liu
592
68
0
03 Feb 2025
Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models
Hao Li
Cor-Paul Bezemer
Ahmed E. Hassan
314
9
0
08 Jan 2025
Exploring and Controlling Diversity in LLM-Agent Conversation
Kuanchao Chu
Yi-Pei Chen
Hideki Nakayama
LLMAG
500
8
0
30 Dec 2024
LLM-based relevance assessment still can't replace human relevance assessment
International Workshop on Evaluating Information Access (EIA), 2024
Charles L. A. Clarke
Laura Dietz
ELM
197
36
0
22 Dec 2024
Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Aneta Zugecova
Dominik Macko
Ivan Srba
Robert Moro
Jakub Kopal
Katarina Marcincinova
Matus Mesarcik
414
15
0
18 Dec 2024
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
International Conference on Computational Linguistics (COLING), 2024
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
369
0
0
16 Dec 2024
JuStRank: Benchmarking LLM Judges for System Ranking
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Ariel Gera
Odellia Boni
Yotam Perlitz
Roy Bar-Haim
Lilach Eden
Asaf Yehudai
ALM
ELM
469
13
0
12 Dec 2024
VL-RewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Computer Vision and Pattern Recognition (CVPR), 2024
Lei Li
Y. X. Wei
Zhihui Xie
Xuqing Yang
Yifan Song
...
Tianyu Liu
Sujian Li
Bill Yuchen Lin
Dianbo Sui
Qiang Liu
VLM
CoGe
533
62
0
26 Nov 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
1.1K
287
0
25 Nov 2024
Benchmarking LLMs' Judgments with No Gold Standard
International Conference on Learning Representations (ICLR), 2024
Shengwei Xu
Yuxuan Lu
Grant Schoenebeck
Yuqing Kong
194
11
0
11 Nov 2024
Evaluating Creative Short Story Generation in Humans and Large Language Models
Mete Ismayilzada
Claire Stevenson
Lonneke van der Plas
LM&MA
LRM
530
13
0
04 Nov 2024
ProMQA: Question Answering Dataset for Multimodal Procedural Activity Understanding
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Kimihiro Hasegawa
Wiradee Imrattanatrai
Zhi-Qi Cheng
Masaki Asada
Susan Holm
Yuran Wang
Ken Fukuda
Teruko Mitamura
249
7
0
29 Oct 2024
BQA: Body Language Question Answering Dataset for Video Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Shintaro Ozaki
Kazuki Hayashi
Miyu Oba
Yusuke Sakai
Hidetaka Kamigaito
Taro Watanabe
423
3
0
17 Oct 2024
Limits to scalable evaluation at the frontier: LLM as Judge won't beat twice the data
International Conference on Learning Representations (ICLR), 2024
Florian E. Dorner
Vivian Y. Nastl
Moritz Hardt
ELM
ALM
405
22
0
17 Oct 2024
Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland
Luca Rolshoven
Vishvaksenan Rasiah
Srinanda Brügger Bose
Sarah Hostettler
Lara Burkhalter
Matthias Sturmer
Joel Niklaus
ELM
AILaw
273
4
0
17 Oct 2024
MIRAGE-Bench: Automatic Multilingual Benchmark Arena for Retrieval-Augmented Generation Systems
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Nandan Thakur
Suleman Kazi
Ge Luo
Jimmy J. Lin
Amin Ahmad
VLM
RALM
463
13
0
17 Oct 2024
LASeR: Learning to Adaptively Select Reward Models with Multi-Armed Bandits
Duy Nguyen
Archiki Prasad
Elias Stengel-Eskin
Joey Tianyi Zhou
439
5
0
02 Oct 2024
CRScore: Grounding Automated Evaluation of Code Review Comments in Code Claims and Smells
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Atharva Naik
Marcus Alenius
Daniel Fried
Carolyn Rose
326
4
0
29 Sep 2024
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
367
22
0
23 Sep 2024
From Lists to Emojis: How Format Bias Affects Model Alignment
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Xuanchang Zhang
Wei Xiong
Lichang Chen
Wanrong Zhu
Heng Huang
Tong Zhang
ALM
434
21
0
18 Sep 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
505
5
0
10 Sep 2024
IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering
Neural Information Processing Systems (NeurIPS), 2024
Ruosen Li
Barry Wang
Ruochen Li
Xinya Du
ELM
243
14
0
24 Aug 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
494
60
0
23 Aug 2024
AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling
ACM Transactions on Software Engineering and Methodology (TOSEM), 2024
Yuheng Huang
Yuheng Huang
Qiang Hu
Felix Juefei Xu
Lei Ma
348
7
0
07 Aug 2024
Self-Recognition in Language Models
Tim R. Davidson
Viacheslav Surkov
V. Veselovsky
Giuseppe Russo
Robert West
Çağlar Gülçehre
PILM
527
9
0
09 Jul 2024
AI-AI Bias: large language models favor communications generated by large language models
Walter Laurito
Benjamin Davis
Peli Grietzer
T. Gavenčiak
Ada Böhm
Jan Kulveit
209
5
0
09 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
307
62
0
05 Jul 2024
Evaluating the Ability of LLMs to Solve Semantics-Aware Process Mining Tasks
Adrian Rebmann
Fabian David Schmidt
Goran Glavaš
Han van der Aa
LRM
181
21
0
02 Jul 2024
Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh
Tejas Srinivasan
Swabha Swayamdipta
290
3
0
02 Jul 2024
Free-text Rationale Generation under Readability Level Control
Yi-Sheng Hsu
Nils Feldhus
Sherzod Hakimov
466
4
0
01 Jul 2024
BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions
Terry Yue Zhuo
Minh Chien Vu
Jenny Chim
Han Hu
Wenhao Yu
...
David Lo
Daniel Fried
Xiaoning Du
H. D. Vries
Leandro von Werra
603
371
0
22 Jun 2024
PARIKSHA : A Large-Scale Investigation of Human-LLM Evaluator Agreement on Multilingual and Multi-Cultural Data
Ishaan Watts
Varun Gumma
Aditya Yadavalli
Vivek Seshadri
Manohar Swaminathan
Sunayana Sitaram
ELM
275
24
0
21 Jun 2024
Chumor 1.0: A Truly Funny and Challenging Chinese Humor Understanding Dataset from Ruo Zhi Ba
Ruiqi He
Yushu He
Longju Bai
Jiarui Liu
Zhenjie Sun
Zenghao Tang
He Wang
Hanchen Xia
Naihao Deng
166
3
0
18 Jun 2024
DCA-Bench: A Benchmark for Dataset Curation Agents
Benhao Huang
Yingzhuo Yu
Jin Huang
Xingjian Zhang
Jiaqi Ma
363
3
0
11 Jun 2024
CRAG -- Comprehensive RAG Benchmark
Neural Information Processing Systems (NeurIPS), 2024
Xiao Yang
Kai Sun
Hao Xin
Yushi Sun
Nikita Bhalla
...
Nirav Shah
Rakesh Wanga
Anuj Kumar
Anuj Kumar
Xin Luna Dong
327
79
0
07 Jun 2024
Self-Control of LLM Behaviors by Compressing Suffix Gradient into Prefix Controller
Min Cai
Yuchen Zhang
Shichang Zhang
Fan Yin
Difan Zou
Yisong Yue
Ziniu Hu
314
3
0
04 Jun 2024
Inverse Constitutional AI: Compressing Preferences into Principles
Arduin Findeis
Timo Kaufmann
Eyke Hüllermeier
Samuel Albanie
Robert Mullins
SyDa
287
23
0
02 Jun 2024
Previous
1
2
3
4
Next