ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2404.13076
  4. Cited By
LLM Evaluators Recognize and Favor Their Own Generations

LLM Evaluators Recognize and Favor Their Own Generations

15 April 2024
Arjun Panickssery
Samuel R. Bowman
Shi Feng
ArXiv (abs)PDFHTMLHuggingFace (2 upvotes)

Papers citing "LLM Evaluators Recognize and Favor Their Own Generations"

50 / 155 papers shown
The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game
The AI in the Mirror: LLM Self-Recognition in an Iterated Public Goods Game
Olivia Long
Carter Teplica
162
1
0
25 Aug 2025
Hermes 4 Technical Report
Hermes 4 Technical Report
Ryan Teknium
Roger Jin
Jai Suphavadeeprasit
Dakota Mahan
Jeffrey Quesnelle
Joe Li
Chen Guang
Shannon Sands
Karan Malhotra
128
1
0
25 Aug 2025
LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining
LLMs that Understand Processes: Instruction-tuning for Semantics-Aware Process Mining
Vira Pyrih
Adrian Rebmann
Han van der Aa
160
0
0
22 Aug 2025
STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports
STREAM (ChemBio): A Standard for Transparently Reporting Evaluations in AI Model Reports
Tegan McCaslin
Jide Alaga
Samira Nedungadi
Seth Donoughe
Tom Reed
Rishi Bommasani
Chris Painter
Luca Righetti
267
3
0
13 Aug 2025
MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
Shuo Tang
Jian Xu
Jiadong Zhang
Y. Chen
Qizhao Jin
Lingdong Shen
Chenglin Liu
Shiming Xiang
213
0
0
09 Aug 2025
Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge
Play Favorites: A Statistical Method to Measure Self-Bias in LLM-as-a-Judge
Evangelia Spiliopoulou
Riccardo Fogliato
Hanna Burnsky
Tamer Soliman
Jie Ma
Graham Horwood
Miguel Ballesteros
159
9
0
08 Aug 2025
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges
Learning an Efficient Multi-Turn Dialogue Evaluator from Multiple LLM Judges
Yuqi Tang
Kehua Feng
Yunfeng Wang
Zhiwen Chen
Chengfei Lv
Gang Yu
Qiang Zhang
Keyan Ding
Huajun Chen
ELM
204
0
0
01 Aug 2025
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Cascaded Information Disclosure for Generalized Evaluation of Problem Solving Capabilities
Yunxiang Yan
Tomohiro Sawada
Kartik Goyal
ELM
191
0
0
31 Jul 2025
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Counterfactual Evaluation for Blind Attack Detection in LLM-based Evaluation Systems
Lijia Liu
Takumi Kondo
Kyohei Atarashi
Koh Takeuchi
Jiyi Li
Shigeru Saito
H. Kashima
142
0
0
31 Jul 2025
AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data
AraTable: Benchmarking LLMs' Reasoning and Understanding of Arabic Tabular Data
Rana Alshaikh
Israa Alghanmi
Shelan Jeawak
LMTDLRM
149
2
0
24 Jul 2025
Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support
Marcel: A Lightweight and Open-Source Conversational Agent for University Student Support
Jan Trienes
Anastasiia Derzhanskaia
Roland Schwarzkopf
Markus Mühling
Jorg Schlotterer
Christin Seifert
129
0
0
18 Jul 2025
GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning
GEMeX-RMCoT: An Enhanced Med-VQA Dataset for Region-Aware Multimodal Chain-of-Thought Reasoning
Bo Liu
Xiangyu Zhao
Along He
Yidi Chen
Huazhu Fu
Xiao-Ming Wu
MedImLRM
215
0
0
22 Jun 2025
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems
Kexin Huang
Qian Tu
Liwei Fan
Chenchen Yang
Dong Zhang
Shimin Li
Zhaoye Fei
Qinyuan Cheng
Xipeng Qiu
231
8
0
19 Jun 2025
Correlated Errors in Large Language Models
Correlated Errors in Large Language Models
Elliot Kim
Avi Garg
Kenny Peng
Nikhil Garg
236
4
0
09 Jun 2025
Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning
Right Is Not Enough: The Pitfalls of Outcome Supervision in Training LLMs for Math Reasoning
Jiaxing Guo
Wenjie Yang
Shengzhong Zhang
Tongshan Xu
Lun Du
Da Zheng
Zengfeng Huang
LRM
231
6
0
07 Jun 2025
BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
BioMol-MQA: A Multi-Modal Question Answering Dataset For LLM Reasoning Over Bio-Molecular Interactions
Saptarshi Sengupta
Shuhua Yang
Paul Kwong Yu
Fali Wang
Suhang Wang
210
1
0
06 Jun 2025
Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models
Explainer-guided Targeted Adversarial Attacks against Binary Code Similarity Detection Models
Mingjie Chen
Tiancheng Zhu
Mingxue Zhang
Yiling He
Minghao Lin
Penghui Li
Kui Ren
AAML
202
8
0
05 Jun 2025
RedDebate: Safer Responses through Multi-Agent Red Teaming Debates
RedDebate: Safer Responses through Multi-Agent Red Teaming Debates
Ali Asad
Stephen Obadinma
Radin Shayanfar
Xiaodan Zhu
AAMLLLMAG
265
3
0
04 Jun 2025
Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views
Toward Reliable VLM: A Fine-Grained Benchmark and Framework for Exposure, Bias, and Inference in Korean Street Views
Xiaonan Wang
Bo Shao
Hansaem Kim
202
0
0
03 Jun 2025
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
Contextual Integrity in LLMs via Reasoning and Reinforcement Learning
Guangchen Lan
Huseyin A. Inan
Sahar Abdelnabi
Janardhan Kulkarni
Lukas Wutschitz
Reza Shokri
Christopher G. Brinton
Robert Sim
196
9
0
29 May 2025
LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
LASER: Stratified Selective Sampling for Instruction Tuning with Dedicated Scoring Strategy
Paramita Mirza
Lucas Weber
Fabian Küch
287
0
0
28 May 2025
Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator
Silencer: From Discovery to Mitigation of Self-Bias in LLM-as-Benchmark-Generator
Peiwen Yuan
Yiwei Li
Shaoxiong Feng
Xinglin Wang
Y. Zhang
Jiayi Shi
Chuyi Tan
Boyuan Pan
Yao Hu
Kan Li
223
3
0
27 May 2025
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement
Keheliya Gallaba
Ali Arabat
Dayi Lin
Mohammed Sayagh
Ahmed E. Hassan
AI4CE
283
3
0
27 May 2025
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Assistant-Guided Mitigation of Teacher Preference Bias in LLM-as-a-Judge
Zhuo Liu
Moxin Li
Xun Deng
Qifan Wang
Fuli Feng
ELM
415
2
0
25 May 2025
Large Language Models for Predictive Analysis: How Far Are They?
Large Language Models for Predictive Analysis: How Far Are They?Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Qin Chen
Yuanyi Ren
Xiaojun Ma
Yuyang Shi
272
3
0
22 May 2025
Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks
Walk&Retrieve: Simple Yet Effective Zero-shot Retrieval-Augmented Generation via Knowledge Graph Walks
Martin Böckling
Heiko Paulheim
Andreea Iana
RALM
318
1
0
22 May 2025
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
InfoDeepSeek: Benchmarking Agentic Information Seeking for Retrieval-Augmented Generation
Yunjia Xi
Jianghao Lin
Menghui Zhu
Yongzhao Xiao
Zhuoying Ou
...
Weiwen Liu
Yasheng Wang
Ruiming Tang
Weinan Zhang
Yong Yu
356
7
0
21 May 2025
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Yu Fan
Jingwei Ni
Jakob Merane
Etienne Salimbeni
Yoan Hermstrüwer
...
Mrinmaya Sachan
Alexander Stremitzer
Christoph Engel
Elliott Ash
Joel Niklaus
AILawELM
547
12
0
19 May 2025
R3: Robust Rubric-Agnostic Reward Models
R3: Robust Rubric-Agnostic Reward Models
David Anugraha
Zilu Tang
Lester James V. Miranda
Hanyang Zhao
Mohammad Rifqi Farhansyah
Garry Kuwanto
Derry Wijaya
Genta Indra Winata
634
13
0
19 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
333
2
0
05 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
Haoyang Li
Shangqing Tu
...
Zhiyuan Liu
Yisi Zhan
Lei Hou
Juanzi Li
Bin Xu
190
2
0
04 May 2025
TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments
TRUST: An LLM-Based Dialogue System for Trauma Understanding and Structured Assessments
Sichang Tu
Abigail Powers
S. Doogan
Jinho D. Choi
249
4
0
30 Apr 2025
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
RepliBench: Evaluating the Autonomous Replication Capabilities of Language Model Agents
Sid Black
Asa Cooper Stickland
Jake Pencharz
Oliver Sourbut
Michael Schmatz
Jay Bailey
Ollie Matthews
Ben Millwood
Alex Remedios
Alan Cooney
ELM
1.0K
6
0
21 Apr 2025
LoRe: Personalizing LLMs via Low-Rank Reward Modeling
LoRe: Personalizing LLMs via Low-Rank Reward Modeling
Avinandan Bose
Zhihan Xiong
Yuejie Chi
Simon S. Du
Lin Xiao
Maryam Fazel
293
9
0
20 Apr 2025
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Regan Bolton
Mohammadreza Sheikhfathollahi
Simon Parkinson
Dan Basher
Howard Parkinson
239
2
0
18 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
349
13
0
16 Apr 2025
Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
Cancer-Myth: Evaluating Large Language Models on Patient Questions with False Presuppositions
Peng Guo
Tianqi Chen
Ching Ying Lin
Ching Ying Lin
Jade Law
Mazen Jizzini
Jorge J. Nieva
Ruishan Liu
Robin Jia
381
1
0
15 Apr 2025
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
NorEval: A Norwegian Language Understanding and Generation Evaluation BenchmarkAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Vladislav Mikhailov
Tita Ranveig Enstad
David Samuel
Hans Christian Farsethås
Andrey Kutuzov
Erik Velldal
Lilja Øvrelid
ELM
400
5
0
10 Apr 2025
AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
AI-Slop to AI-Polish? Aligning Language Models through Edit-Based Writing Rewards and Test-time Computation
Tuhin Chakrabarty
Philippe Laban
Chien-Sheng Wu
472
14
0
10 Apr 2025
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Societal Impacts Research Requires Benchmarks for Creative Composition Tasks
Judy Hanwen Shen
Carlos Guestrin
614
2
0
09 Apr 2025
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles
Socrates or Smartypants: Testing Logic Reasoning Capabilities of Large Language Models with Logic Programming-based Test Oracles
Zihao Xu
Junchen Ding
Yiling Lou
Kun Zhang
Dong Gong
Yuekang Li
ELMLRM
369
1
0
09 Apr 2025
Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-Making
Self-Adaptive Cognitive Debiasing for Large Language Models in Decision-Making
Yougang Lyu
Shijie Ren
Yue Feng
Zihan Wang
Zhongfu Chen
Zhaochun Ren
Maarten de Rijke
747
1
0
05 Apr 2025
Verification of Autonomous Neural Car Control with KeYmaera X
Verification of Autonomous Neural Car Control with KeYmaera XInternational Conference on Abstract State Machines, Alloy, B, TLA, VDM, and Z (ABZ), 2025
Enguerrand Prebet
Samuel Teuber
André Platzer
264
20
0
04 Apr 2025
Do LLM Evaluators Prefer Themselves for a Reason?
Do LLM Evaluators Prefer Themselves for a Reason?
Wei-Lin Chen
Zhepei Wei
Xinyu Zhu
Shi Feng
Yu Meng
ELMLRM
359
22
0
04 Apr 2025
An Illusion of Progress? Assessing the Current State of Web Agents
An Illusion of Progress? Assessing the Current State of Web Agents
Tianci Xue
Weijian Qi
Tianneng Shi
Chan Hee Song
Boyu Gou
Basel Alomair
Huan Sun
Eric Fosler-Lussier
LLMAGELM
878
49
1
02 Apr 2025
Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE dataset
Rubrik's Cube: Testing a New Rubric for Evaluating Explanations on the CUBE datasetAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Diana Galván-Sosa
Gabrielle Gaudeau
Pride Kavumba
Yunmeng Li
Hongyi gu
Zheng Yuan
Keisuke Sakaguchi
P. Buttery
LRM
392
3
0
31 Mar 2025
Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
Citegeist: Automated Generation of Related Work Analysis on the arXiv Corpus
Claas Beger
Carl-Leander Henneking
242
1
0
29 Mar 2025
Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach
Evaluating book summaries from internal knowledge in Large Language Models: a cross-model and semantic consistency approach
Javier Coronado-Blázquez
HILMELM
283
0
0
27 Mar 2025
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
3MDBench: Medical Multimodal Multi-agent Dialogue Benchmark
Ivan Sviridov
Amina Miftakhova
Artemiy Tereshchenko
Galina Zubkova
Pavel Blinov
Andrey Savchenko
LM&MA
349
5
0
26 Mar 2025
Enhancing Product Search Interfaces with Sketch-Guided Diffusion and Language Agents
Enhancing Product Search Interfaces with Sketch-Guided Diffusion and Language AgentsThe Web Conference (WWW), 2025
Edward Sun
DiffM
255
0
0
21 Mar 2025
Previous
1234
Next