Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2411.16594
Cited By
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
25 November 2024
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
Zhen Tan
Amrita Bhattacharjee
Yuxuan Jiang
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
Re-assign community
ArXiv
PDF
HTML
Papers citing
"From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge"
46 / 46 papers shown
Title
SymPlanner: Deliberate Planning in Language Models with Symbolic Representation
Siheng Xiong
Jieyu Zhou
Zhangding Liu
Yusen Su
LLMAG
LM&Ro
40
0
0
02 May 2025
Which Agent Causes Task Failures and When? On Automated Failure Attribution of LLM Multi-Agent Systems
Shaokun Zhang
Ming Yin
Jieyu Zhang
J. H. Liu
Zhiguang Han
...
Beibin Li
Chi Wang
H. Wang
Y. Chen
Qingyun Wu
47
0
0
30 Apr 2025
Agree to Disagree? A Meta-Evaluation of LLM Misgendering
Arjun Subramonian
Vagrant Gautam
Preethi Seshadri
Dietrich Klakow
Kai-Wei Chang
Yizhou Sun
14
1
0
23 Apr 2025
Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer
Huaizhi Qu
Inyoung Choi
Zhen Tan
Song Wang
Sukwon Yun
Qi Long
Faizan Siddiqui
Kwonjoon Lee
Tianlong Chen
37
0
0
17 Apr 2025
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Xiaotian Zhang
Ruizhe Chen
Yang Feng
Zuozhu Liu
35
0
0
17 Apr 2025
Benchmarking LLM-based Relevance Judgment Methods
Negar Arabzadeh
Charles L. A. Clarke
28
0
0
17 Apr 2025
A Human-AI Comparative Analysis of Prompt Sensitivity in LLM-Based Relevance Judgment
Negar Arabzadeh
Charles L. A. Clarke
19
1
0
16 Apr 2025
Heimdall: test-time scaling on the generative verification
Wenlei Shi
Xing Jin
LRM
18
0
0
14 Apr 2025
Deep Reasoning Translation via Reinforcement Learning
Jiaan Wang
Fandong Meng
Jie Zhou
OffRL
LRM
27
0
0
14 Apr 2025
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
Zongxian Yang
Jiayu Qian
Z. Huang
Kay Chen Tan
LM&MA
LRM
23
0
0
13 Apr 2025
Meta-Evaluating Local LLMs: Rethinking Performance Metrics for Serious Games
Andrés Isaza-Giraldo
Paulo Bala
Lucas Pereira
19
0
0
13 Apr 2025
NorEval: A Norwegian Language Understanding and Generation Evaluation Benchmark
Vladislav Mikhailov
Tita Ranveig Enstad
David Samuel
Hans Christian Farsethås
Andrey Kutuzov
Erik Velldal
Lilja Øvrelid
ELM
40
0
0
10 Apr 2025
HypoEval: Hypothesis-Guided Evaluation for Natural Language Generation
Mingxuan Li
Hanchen Li
Chenhao Tan
ALM
ELM
31
0
0
09 Apr 2025
Beyond Accuracy: The Role of Calibration in Self-Improving Large Language Models
Liangjie Huang
Dawei Li
Huan Liu
Lu Cheng
LRM
29
0
0
03 Apr 2025
A Survey of Scaling in Large Language Model Reasoning
Zihan Chen
Song Wang
Zhen Tan
Xingbo Fu
Zhenyu Lei
Peng Wang
Huan Liu
Cong Shen
Jundong Li
LRM
84
0
0
02 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Hongliu Cao
Ilias Driouich
Robin Singh
Eoin Thomas
ELM
29
0
0
01 Apr 2025
Brains vs. Bytes: Evaluating LLM Proficiency in Olympiad Mathematics
Hamed Mahdavi
Alireza Hashemi
Majid Daliri
Pegah Mohammadipour
Alireza Farhadi
Samira Malek
Yekta Yazdanifard
Amir Khasahmadi
V. Honavar
ELM
LRM
35
1
0
01 Apr 2025
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
José P. Pombal
Nuno M. Guerreiro
Ricardo Rei
André F. T. Martins
ALM
61
0
0
01 Apr 2025
Do We Truly Need So Many Samples? Multi-LLM Repeated Sampling Efficiently Scales Test-Time Compute
Jianhao Chen
Zishuo Xun
Bocheng Zhou
Han Qi
Qiaosheng Zhang
...
Wei Hu
Yuzhong Qu
W. Ouyang
Wanli Ouyang
Shuyue Hu
74
0
0
01 Apr 2025
A Multi-Model Adaptation of Speculative Decoding for Classification
Somnath Roy
Padharthi Sreekar
Srivatsa Narasimha
Anubhav Anand
31
0
0
23 Mar 2025
Debiasing Multimodal Large Language Models via Noise-Aware Preference Optimization
Zefeng Zhang
Hengzhu Tang
Jiawei Sheng
Zhenyu Zhang
Yiming Ren
Zhenyang Li
Dawei Yin
Duohe Ma
Tingwen Liu
36
0
0
23 Mar 2025
Tuning LLMs by RAG Principles: Towards LLM-native Memory
Jiale Wei
Shuchi Wu
Ruochen Liu
Xiang Ying
Jingbo Shang
Fangbo Tao
RALM
57
0
0
20 Mar 2025
From Chaos to Order: The Atomic Reasoner Framework for Fine-grained Reasoning in Large Language Models
Jinyi Liu
Yan Zheng
Rong Cheng
Qiyu Wu
Wei Guo
...
Hebin Liang
Yifu Yuan
Hangyu Mao
Fuzheng Zhang
Jianye Hao
LRM
AI4CE
39
1
0
20 Mar 2025
D3: Diversity, Difficulty, and Dependability-Aware Data Selection for Sample-Efficient LLM Instruction Tuning
Jia Zhang
Chen-Xi Zhang
Yao Liu
Yi-Xuan Jin
Xiao-Wen Yang
Bo Zheng
Y. Liu
Lan-Zhe Guo
36
2
0
14 Mar 2025
DAFE: LLM-Based Evaluation Through Dynamic Arbitration for Free-Form Question-Answering
Sher Badshah
Hassan Sajjad
54
1
0
11 Mar 2025
DeepReview: Improving LLM-based Paper Review with Human-like Deep Thinking Process
Minjun Zhu
Yixuan Weng
Linyi Yang
Yue Zhang
ALM
LRM
52
1
0
11 Mar 2025
Quantifying the Robustness of Retrieval-Augmented Language Models Against Spurious Features in Grounding Data
Shiping Yang
Jie Wu
Wenbiao Ding
Ning Wu
Shining Liang
Ming Gong
Hengyuan Zhang
Dongmei Zhang
AAML
57
1
0
07 Mar 2025
No Free Labels: Limitations of LLM-as-a-Judge Without Human Grounding
Michael Krumdick
Charles Lovering
Varshini Reddy
Seth Ebner
Chris Tanner
ALM
ELM
41
2
0
07 Mar 2025
Graph-Augmented Reasoning: Evolving Step-by-Step Knowledge Graph Retrieval for LLM Reasoning
Wenjie Wu
Yongcheng Jing
Yingjie Wang
Wenbin Hu
Dacheng Tao
RALM
LRM
54
2
0
03 Mar 2025
Evaluating Personalized Tool-Augmented LLMs from the Perspectives of Personalization and Proactivity
Yupu Hao
Pengfei Cao
Zhuoran Jin
Huanxuan Liao
Yubo Chen
Kang Liu
Jun Zhao
LLMAG
50
1
0
02 Mar 2025
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu
Tiezheng YU
Wenjun Hou
Yi Cheng
Liangyou Li
Xin Jiang
Lifeng Shang
Q. Liu
Wenjie Li
ELM
61
0
0
26 Feb 2025
ImageChain: Advancing Sequential Image-to-Text Reasoning in Multimodal Large Language Models
Danae Sánchez Villegas
Ingo Ziegler
Desmond Elliott
LRM
43
1
0
26 Feb 2025
M-MAD: Multidimensional Multi-Agent Debate for Advanced Machine Translation Evaluation
Zhaopeng Feng
Jiayuan Su
Jiamei Zheng
Jiahan Ren
Yan Zhang
Jian Wu
Hongwei Wang
Zuozhu Liu
ELM
191
0
0
21 Feb 2025
BPO: Towards Balanced Preference Optimization between Knowledge Breadth and Depth in Alignment
Sizhe Wang
Yongqi Tong
Hengyuan Zhang
Dawei Li
Xin Zhang
Tianlong Chen
85
5
0
21 Feb 2025
RefineCoder: Iterative Improving of Large Language Models via Adaptive Critique Refinement for Code Generation
C. Zhou
Xinyu Zhang
Dandan Song
Xiancai Chen
Wanli Gu
Huipeng Ma
Yuhang Tian
M. Zhang
Linmei Hu
63
1
0
13 Feb 2025
Towards Internet-Scale Training For Agents
Brandon Trabucco
Gunnar A. Sigurdsson
Robinson Piramuthu
Ruslan Salakhutdinov
ALM
84
2
0
10 Feb 2025
Self-Supervised Prompt Optimization
Jinyu Xiang
Jiayi Zhang
Zhaoyang Yu
Fengwei Teng
Jinhao Tu
Xinbing Liang
Sirui Hong
Chenglin Wu
Yuyu Luo
OffRL
LRM
49
5
0
07 Feb 2025
Preference Leakage: A Contamination Problem in LLM-as-a-judge
Dawei Li
Renliang Sun
Yue Huang
Ming Zhong
Bohan Jiang
J. Han
X. Zhang
Wei Wang
Huan Liu
62
11
0
03 Feb 2025
Visual Large Language Models for Generalized and Specialized Applications
Yifan Li
Zhixin Lai
Wentao Bao
Zhen Tan
Anh Dao
Kewei Sui
Jiayi Shen
Dong Liu
Huan Liu
Yu Kong
VLM
83
10
0
06 Jan 2025
Towards Omni-RAG: Comprehensive Retrieval-Augmented Generation for Large Language Models in Medical Applications
Zhe Chen
Yusheng Liao
Shuyang Jiang
Pingjie Wang
Y. Guo
Y. Wang
Yu Wang
34
3
0
05 Jan 2025
An Investigation into Value Misalignment in LLM-Generated Texts for Cultural Heritage
Fan Bu
Zheng Wang
Siyi Wang
Ziyao Liu
25
0
0
03 Jan 2025
Outcome-Refining Process Supervision for Code Generation
Zhuohao Yu
Weizheng Gu
Yidong Wang
Zhengran Zeng
Jindong Wang
Wei Ye
Shikun Zhang
LRM
79
4
0
19 Dec 2024
RAG-RewardBench: Benchmarking Reward Models in Retrieval Augmented Generation for Preference Alignment
Zhuoran Jin
Hongbang Yuan
Tianyi Men
Pengfei Cao
Yubo Chen
Kang-Jun Liu
Jun Zhao
ALM
82
7
0
18 Dec 2024
Assessing the Impact of Conspiracy Theories Using Large Language Models
Bohan Jiang
Dawei Li
Zhen Tan
Xinyi Zhou
Ashwin Rao
Kristina Lerman
H. Bernard
Huan Liu
69
2
0
09 Dec 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
42
22
0
23 Aug 2024
How Good is ChatGPT in Giving Advice on Your Visualization Design?
Nam Wook Kim
Grace Myers
Benjamin Bach
15
19
0
14 Oct 2023
1