Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2404.13076
Cited By
LLM Evaluators Recognize and Favor Their Own Generations
15 April 2024
Arjun Panickssery
Samuel R. Bowman
Shi Feng
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LLM Evaluators Recognize and Favor Their Own Generations"
50 / 110 papers shown
Title
Frame In, Frame Out: Do LLMs Generate More Biased News Headlines than Humans?
Valeria Pastorino
N. Moosavi
33
0
0
08 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
50
0
0
05 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
H. Li
Shangqing Tu
...
Zhiyuan Liu
Huiqin Liu
Lei Hou
Juanzi Li
Bin Xu
17
0
0
04 May 2025
TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments
Sichang Tu
Abigail Powers
S. Doogan
Jinho D. Choi
31
0
0
30 Apr 2025
On the Potential of Large Language Models to Solve Semantics-Aware Process Mining Tasks
Adrian Rebmann
Fabian David Schmidt
Goran Glavaš
Han van der Aa
LRM
21
0
0
29 Apr 2025
Towards Automated Scoping of AI for Social Good Projects
Jacob Emmerson
Rayid Ghani
Zheyuan Ryan Shi
41
0
0
28 Apr 2025
FinNLI: Novel Dataset for Multi-Genre Financial Natural Language Inference Benchmarking
Jabez Magomere
Elena Kochkina
Samuel Mensah
Simerjot Kaur
Charese Smiley
15
0
0
22 Apr 2025
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
Sid Black
Asa Cooper Stickland
Jake Pencharz
Oliver Sourbut
Michael Schmatz
Jay Bailey
Ollie Matthews
Ben Millwood
Alex Remedios
Alan Cooney
ELM
52
0
0
21 Apr 2025
LoRe: Personalizing LLMs via Low-Rank Reward Modeling
Avinandan Bose
Zhihan Xiong
Yuejie Chi
Simon S. Du
Lin Xiao
Maryam Fazel
26
0
0
20 Apr 2025
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Regan Bolton
Mohammadreza Sheikhfathollahi
Simon Parkinson
Dan Basher
Howard Parkinson
24
0
0
18 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
29
0
0
16 Apr 2025
Cancer-Myth: Evaluating AI Chatbot on Patient Questions with False Presuppositions
Wang Zhu
Tianqi Chen
Ching Ying Lin
Jade Law
Mazen Jizzini
Jorge J. Nieva
Ruishan Liu
Robin Jia
20
0
0
15 Apr 2025
Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification
Joseph Liu
Yoonsoo Nam
Xinyue Cui
Swabha Swayamdipta
49
0
0
13 Apr 2025
AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
Tuhin Chakrabarty
Philippe Laban
C. Wu
32
1
0
10 Apr 2025
HalluciNot: Hallucination Detection Through Context and Common Knowledge Verification
Bibek Paudel
Alexander Lyzhov
Preetam Joshi
Puneet Anand
HILM
41
0
0
09 Apr 2025
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
31
0
0
09 Apr 2025
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles
Zihao Xu
Junchen Ding
Yiling Lou
Kun Zhang
Dong Gong
Yuekang Li
ELM
LRM
26
0
0
09 Apr 2025
FinGrAct: A Framework for FINe-GRrained Evaluation of ACTionability in Explainable Automatic Fact-Checking
Islam Eldifrawi
Shengrui Wang
Amine Trabelsi
24
0
0
07 Apr 2025
Cognitive Debiasing Large Language Models for Decision-Making
Yougang Lyu
Shijie Ren
Yue Feng
Zihan Wang
Z. Chen
Z. Z. Ren
Maarten de Rijke
31
0
0
05 Apr 2025
Verification of Autonomous Neural Car Control with KeYmaera X
Enguerrand Prebet
Samuel Teuber
André Platzer
29
0
0
04 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELM
LRM
42
0
0
04 Apr 2025
An Illusion of Progress? Assessing the Current State of Web Agents
Tianci Xue
Weijian Qi
Tianneng Shi
Chan Hee Song
Boyu Gou
D. Song
Huan Sun
Yu Su
LLMAG
ELM
79
4
1
02 Apr 2025
KOFFVQA: An Objectively Evaluated Free-form VQA Benchmark for Large Vision-Language Models in the Korean Language
Yoonshik Kim
Jaeyoon Jung
35
0
0
31 Mar 2025
Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
Claas Beger
Carl-Leander Henneking
32
0
0
29 Mar 2025
Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach
Javier Coronado-Blázquez
HILM
ELM
59
0
0
27 Mar 2025
Writing as a testbed for open ended agents
Sian Gooding
Lucia Lopez-Rivilla
Edward Grefenstette
LLMAG
78
1
0
25 Mar 2025
4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding
Wenxuan Zhu
Bing Li
Cheng Zheng
Jinjie Mai
Jun-Cheng Chen
...
Abdullah Hamdi
Sara Rojas Martinez
Chia-Wen Lin
Mohamed Elhoseiny
Bernard Ghanem
VLM
48
0
0
22 Mar 2025
Enhancing Product Search Interfaces with Sketch-Guided Diffusion and Language Agents
Edward Sun
DiffM
23
0
0
21 Mar 2025
Uncertainty Quantification and Confidence Calibration in Large Language Models: A Survey
Xiaoou Liu
Tiejin Chen
Longchao Da
Chacha Chen
Zhen Lin
Hua Wei
HILM
59
3
0
20 Mar 2025
Safety Aware Task Planning via Large Language Models in Robotics
A. Khan
Michael Andrev
Muhammad Ali Murtaza
Sergio Aguilera
Rui Zhang
Jie Ding
Seth Hutchinson
Ali Anwar
LLMAG
45
3
0
19 Mar 2025
Exploiting Instruction-Following Retrievers for Malicious Information Retrieval
Parishad BehnamGhader
Nicholas Meade
Siva Reddy
53
0
0
11 Mar 2025
Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
55
0
0
11 Mar 2025
Language Models Fail to Introspect About Their Knowledge of Language
Siyuan Song
Jennifer Hu
Kyle Mahowald
LRM
KELM
HILM
ELM
66
2
0
10 Mar 2025
SwiLTra-Bench: The Swiss Legal Translation Benchmark
Joel Niklaus
Jakob Merane
Luka Nenadic
Sina Ahmadi
Yingqiang Gao
...
Matthew Guillod
Robin Mamié
Daniel Brunner
Julio Pereyra
Niko Grupen
AILaw
ELM
72
0
0
03 Mar 2025
À la recherche du sens perdu: your favourite LLM might have more to say than you can understand
K. O. T. Erziev
28
0
0
28 Feb 2025
Low-Confidence Gold: Refining Low-Confidence Samples for Efficient Instruction Tuning
Hongyi Cal
Jie Li
Wenzhen Dong
59
0
0
26 Feb 2025
Single- vs. Dual-Prompt Dialogue Generation with LLMs for Job Interviews in Human Resources
Joachim De Baer
A. Seza Doğruöz
T. Demeester
Chris Develder
33
0
0
25 Feb 2025
RLTHF: Targeted Human Feedback for LLM Alignment
Yifei Xu
Tusher Chakraborty
Emre Kıcıman
Bibek Aryal
Eduardo Rodrigues
...
Rafael Padilha
Leonardo Nunes
Shobana Balakrishnan
Songwu Lu
Ranveer Chandra
91
1
0
24 Feb 2025
Automatic Input Rewriting Improves Translation with Large Language Models
Dayeon Ki
Marine Carpuat
36
0
0
23 Feb 2025
CLIPPER: Compression enables long-context synthetic data generation
Chau Minh Pham
Yapei Chang
Mohit Iyyer
SyDa
70
1
0
21 Feb 2025
AI Alignment at Your Discretion
Maarten Buyl
Hadi Khalaf
C. M. Verdun
Lucas Monteiro Paes
Caio Vieira Machado
Flavio du Pin Calmon
31
0
0
10 Feb 2025
AutoGUI: Scaling GUI Grounding with Automatic Functionality Annotations from LLMs
Hongxin Li
Jingfan Chen
Jingran Su
Yuntao Chen
Qing Li
Zhaoxiang Zhang
55
0
0
04 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
J. Han
X. Zhang
Wei Wang
Huan Liu
65
11
0
03 Feb 2025
Style Outweighs Substance: Failure Modes of LLM Judges in Alignment Benchmarking
Benjamin Feuer
Micah Goldblum
Teresa Datta
Sanjana Nambiar
Raz Besaleli
Samuel Dooley
Max Cembalest
John P. Dickerson
ALM
28
6
0
28 Jan 2025
Software Engineering and Foundation Models: Insights from Industry Blogs Using a Jury of Foundation Models
Hao Li
C. Bezemer
Ahmed E. Hassan
37
1
0
08 Jan 2025
Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation
Aneta Zugecova
Dominik Macko
Ivan Srba
Robert Moro
Jakub Kopal
Katarina Marcincinova
Matus Mesarcik
67
1
0
18 Dec 2024
QUENCH: Measuring the gap between Indic and Non-Indic Contextual General Reasoning in LLMs
Mohammad Aflah Khan
Neemesh Yadav
Sarah Masud
Md. Shad Akhtar
66
0
0
16 Dec 2024
The Superalignment of Superhuman Intelligence with Large Language Models
Minlie Huang
Yingkang Wang
Shiyao Cui
Pei Ke
J. Tang
103
1
0
15 Dec 2024
VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li
Y. X. Wei
Zhihui Xie
Xuqing Yang
Yifan Song
...
Tianyu Liu
Sujian Li
Bill Yuchen Lin
Lingpeng Kong
Q. Liu
CoGe
VLM
107
24
0
26 Nov 2024
SAGEval: The frontiers of Satisfactory Agent based NLG Evaluation for reference-free open-ended text
Reshmi Ghosh
Tianyi Yao
Lizzy Chen
Sadid Hasan
Tianwei Chen
Dario Bernal
Huitian Jiao
H M Sajjad Hossain
ELM
67
0
0
25 Nov 2024
1
2
3
Next