ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2307.02762
  4. Cited By
PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations

3 January 2025
Ruosen Li
Teerth Patel
Xinya Du
    LLMAG
    ALM
ArXivPDFHTML

Papers citing "PRD: Peer Rank and Discussion Improve Large Language Model based Evaluations"

50 / 73 papers shown
Title
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Multi-Agent LLM Judge: automatic personalized LLM judge design for evaluating natural language generation applications
Hongliu Cao
Ilias Driouich
Robin Singh
Eoin Thomas
ELM
33
0
0
01 Apr 2025
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
MAMM-Refine: A Recipe for Improving Faithfulness in Generation with Multi-Agent Collaboration
David Wan
Justin Chih-Yao Chen
Elias Stengel-Eskin
Mohit Bansal
LLMAG
LRM
60
1
0
19 Mar 2025
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
Qihui Zhang
Munan Ning
Zheyuan Liu
Yanbo Wang
Jiayi Ye
Yue Huang
Shuo Yang
Xiao Chen
Y. Song
Li Yuan
LRM
56
0
0
19 Mar 2025
Safer or Luckier? LLMs as Safety Evaluators Are Not Robust to Artifacts
Hongyu Chen
Seraphina Goldfarb-Tarrant
43
0
0
12 Mar 2025
PiCO: Peer Review in LLMs based on the Consistency Optimization
PiCO: Peer Review in LLMs based on the Consistency Optimization
Kun-Peng Ning
Shuo Yang
Yu-Yang Liu
Jia-Yu Yao
Zhen-Hui Liu
Yu Wang
Ming Pang
Li Yuan
ALM
61
8
0
24 Feb 2025
S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners
Yuchen Yan
Jin Jiang
Yang Liu
Yixin Cao
Xin Xu
M. Zhang
Xunliang Cai
Jian Shao
ReLM
LRM
KELM
110
7
0
21 Feb 2025
Evaluation of Large Language Models via Coupled Token Generation
Evaluation of Large Language Models via Coupled Token Generation
N. C. Benz
Stratis Tsirtsis
Eleni Straitouri
Ivi Chatzi
Ander Artola Velasco
Suhas Thejaswi
Manuel Gomez Rodriguez
41
0
0
03 Feb 2025
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
106
61
0
25 Nov 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
23
6
0
07 Nov 2024
SMoA: Improving Multi-agent Large Language Models with Sparse
  Mixture-of-Agents
SMoA: Improving Multi-agent Large Language Models with Sparse Mixture-of-Agents
Dawei Li
Zhen Tan
Peijia Qian
Yifan Li
Kumar Satvik Chaudhary
Lijie Hu
Jiayi Shen
40
6
0
05 Nov 2024
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection
  Bias in LLMs-as-Judges
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
Haitao Li
Junjie Chen
Qingyao Ai
Zhumin Chu
Yujia Zhou
Qian Dong
Yiqun Liu
30
8
0
20 Oct 2024
SkillAggregation: Reference-free LLM-Dependent Aggregation
SkillAggregation: Reference-free LLM-Dependent Aggregation
Guangzhi Sun
Anmol Kagrecha
Potsawee Manakul
Phil Woodland
Mark J. F. Gales
22
0
0
14 Oct 2024
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Jiasheng Zheng
Hongyu Lin
Boxi Cao
M. Liao
Y. Lu
Xianpei Han
Le Sun
13
0
0
10 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
14
6
0
09 Oct 2024
AIME: AI System Optimization via Multiple LLM Evaluators
AIME: AI System Optimization via Multiple LLM Evaluators
Bhrij Patel
Souradip Chakraborty
Wesley A. Suttle
Mengdi Wang
Amrit Singh Bedi
Dinesh Manocha
18
5
0
04 Oct 2024
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge
Jiayi Ye
Yanbo Wang
Yue Huang
Dongping Chen
Qihui Zhang
...
Werner Geyer
Chao Huang
Pin-Yu Chen
Nitesh V. Chawla
Xiangliang Zhang
ELM
24
43
0
03 Oct 2024
Role-RL: Online Long-Context Processing with Role Reinforcement Learning
  for Distinct LLMs in Their Optimal Roles
Role-RL: Online Long-Context Processing with Role Reinforcement Learning for Distinct LLMs in Their Optimal Roles
Lewei He
Tianyu Shi
Pengran Huang
Bingzhi Chen
Qianglong Chen
Jiahui Pan
OffRL
28
0
0
26 Sep 2024
LINKAGE: Listwise Ranking among Varied-Quality References for
  Non-Factoid QA Evaluation via LLMs
LINKAGE: Listwise Ranking among Varied-Quality References for Non-Factoid QA Evaluation via LLMs
Sihui Yang
Keping Bi
Wanqing Cui
Jiafeng Guo
Xueqi Cheng
11
2
0
23 Sep 2024
Towards a Unified View of Preference Learning for Large Language Models:
  A Survey
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Bofei Gao
Feifan Song
Yibo Miao
Zefan Cai
Z. Yang
...
Houfeng Wang
Zhifang Sui
Peiyi Wang
Baobao Chang
Baobao Chang
41
11
0
04 Sep 2024
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
LongGenBench: Benchmarking Long-Form Generation in Long Context LLMs
Yuhao Wu
Ming Shan Hee
Zhiqing Hu
Roy Ka-Wei Lee
RALM
20
0
0
03 Sep 2024
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency
Poor-Supervised Evaluation for SuperLLM via Mutual Consistency
Peiwen Yuan
Shaoxiong Feng
Yiwei Li
Xinglin Wang
Boyuan Pan
Heda Wang
Yao Hu
Kan Li
18
1
0
25 Aug 2024
DHP Benchmark: Are LLMs Good NLG Evaluators?
DHP Benchmark: Are LLMs Good NLG Evaluators?
Yicheng Wang
Jiayi Yuan
Yu-Neng Chuang
Zhuoer Wang
Yingchi Liu
Mark Cusick
Param Kulkarni
Zhengping Ji
Yasser Ibrahim
Xia Hu
LM&MA
ELM
41
3
0
25 Aug 2024
IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question
  Answering
IQA-EVAL: Automatic Evaluation of Human-Model Interactive Question Answering
Ruosen Li
Barry Wang
Ruochen Li
Xinya Du
ELM
21
5
0
24 Aug 2024
Estimating Contribution Quality in Online Deliberations Using a Large
  Language Model
Estimating Contribution Quality in Online Deliberations Using a Large Language Model
Lodewijk Gelauff
Mohak Goyal
Bhargav Dindukurthi
Ashish Goel
Alice Siu
27
0
0
21 Aug 2024
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated
  Responses
CLAVE: An Adaptive Framework for Evaluating Values of LLM Generated Responses
Jing Yao
Xiaoyuan Yi
Xing Xie
ELM
ALM
20
7
0
15 Jul 2024
Scalability of Bayesian Network Structure Elicitation with Large
  Language Models: a Novel Methodology and Comparative Analysis
Scalability of Bayesian Network Structure Elicitation with Large Language Models: a Novel Methodology and Comparative Analysis
Nikolay Babakov
Ehud Reiter
Alberto Bugarin
19
1
0
12 Jul 2024
On scalable oversight with weak LLMs judging strong LLMs
On scalable oversight with weak LLMs judging strong LLMs
Zachary Kenton
Noah Y. Siegel
János Kramár
Jonah Brown-Cohen
Samuel Albanie
...
Rishabh Agarwal
David Lindner
Yunhao Tang
Noah D. Goodman
Rohin Shah
ELM
29
28
0
05 Jul 2024
Spontaneous Reward Hacking in Iterative Self-Refinement
Spontaneous Reward Hacking in Iterative Self-Refinement
Jane Pan
He He
Samuel R. Bowman
Shi Feng
22
10
0
05 Jul 2024
Debate-to-Write: A Persona-Driven Multi-Agent Framework for Diverse Argument Generation
Debate-to-Write: A Persona-Driven Multi-Agent Framework for Diverse Argument Generation
Zhe Hu
Hou Pong Chan
Jing Li
Yu Yin
LLMAG
34
0
0
28 Jun 2024
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine
  Translation and Summarization Evaluation
PrExMe! Large Scale Prompt Exploration of Open Source LLMs for Machine Translation and Summarization Evaluation
Christoph Leiter
Steffen Eger
27
7
0
26 Jun 2024
AgentReview: Exploring Peer Review Dynamics with LLM Agents
AgentReview: Exploring Peer Review Dynamics with LLM Agents
Yiqiao Jin
Qinlin Zhao
Yiyang Wang
Hao Chen
Kaijie Zhu
Yijia Xiao
Jindong Wang
LLMAG
32
13
0
18 Jun 2024
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks
Language Model Council: Democratically Benchmarking Foundation Models on Highly Subjective Tasks
Justin Zhao
Flor Miriam Plaza del Arco
A. C. Curry
Amanda Cercas Curry
ELM
ALM
28
1
0
12 Jun 2024
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of
  Self-Correction of LLMs
When Can LLMs Actually Correct Their Own Mistakes? A Critical Survey of Self-Correction of LLMs
Ryo Kamoi
Yusen Zhang
Nan Zhang
Jiawei Han
Rui Zhang
LRM
40
19
0
03 Jun 2024
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles
  and Committee Discussions
Auto Arena of LLMs: Automating LLM Evaluations with Agent Peer-battles and Committee Discussions
Ruochen Zhao
Wenxuan Zhang
Yew Ken Chia
Deli Zhao
Lidong Bing
25
9
0
30 May 2024
Language Models can Evaluate Themselves via Probability Discrepancy
Language Models can Evaluate Themselves via Probability Discrepancy
Tingyu Xia
Bowen Yu
Yuan Wu
Yi-Ju Chang
Chang Zhou
ELM
21
4
0
17 May 2024
Agent Design Pattern Catalogue: A Collection of Architectural Patterns
  for Foundation Model based Agents
Agent Design Pattern Catalogue: A Collection of Architectural Patterns for Foundation Model based Agents
Yue Liu
Sin Kit Lo
Qinghua Lu
Liming Zhu
Dehai Zhao
Xiwei Xu
Stefan Harrer
Jon Whittle
LLMAG
AI4CE
20
10
0
16 May 2024
HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants
HumanRankEval: Automatic Evaluation of LMs as Conversational Assistants
Milan Gritta
Gerasimos Lampouras
Ignacio Iacobacci
ALM
16
1
0
15 May 2024
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of
  Diverse Models
Replacing Judges with Juries: Evaluating LLM Generations with a Panel of Diverse Models
Pat Verga
Sebastian Hofstatter
Sophia Althammer
Yixuan Su
Aleksandra Piktus
Arkady Arkhangorodsky
Minjie Xu
Naomi White
Patrick Lewis
ALM
ELM
24
87
0
29 Apr 2024
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation
  of Large Language Models
FreeEval: A Modular Framework for Trustworthy and Efficient Evaluation of Large Language Models
Zhuohao Yu
Chang Gao
Wenjin Yao
Yidong Wang
Zhengran Zeng
Wei Ye
Jindong Wang
Yue Zhang
Shikun Zhang
28
1
0
09 Apr 2024
Evaluating LLMs at Detecting Errors in LLM Responses
Evaluating LLMs at Detecting Errors in LLM Responses
Ryo Kamoi
Sarkar Snigdha Sarathi Das
Renze Lou
Jihyun Janice Ahn
Yilun Zhao
...
Salika Dave
Shaobo Qin
Arman Cohan
Wenpeng Yin
Rui Zhang
40
19
0
04 Apr 2024
Debatrix: Multi-dimensional Debate Judge with Iterative Chronological
  Analysis Based on LLM
Debatrix: Multi-dimensional Debate Judge with Iterative Chronological Analysis Based on LLM
Jingcong Liang
Rong Ye
Meng Han
Ruofei Lai
Xinyu Zhang
Xuanjing Huang
Zhongyu Wei
29
5
0
12 Mar 2024
Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive
  Data Analysis Agents
Tapilot-Crossing: Benchmarking and Evolving LLMs Towards Interactive Data Analysis Agents
Jinyang Li
Nan Huo
Yan Gao
Jiayi Shi
Yingxiu Zhao
Ge Qu
Yurong Wu
Chenhao Ma
Jian-Guang Lou
Reynold Cheng
LLMAG
19
3
0
08 Mar 2024
Beyond Natural Language: LLMs Leveraging Alternative Formats for
  Enhanced Reasoning and Communication
Beyond Natural Language: LLMs Leveraging Alternative Formats for Enhanced Reasoning and Communication
Weize Chen
Chenfei Yuan
Jiarui Yuan
Yusheng Su
Cheng Qian
Cheng Yang
Ruobing Xie
Zhiyuan Liu
Maosong Sun
18
9
0
28 Feb 2024
Prediction-Powered Ranking of Large Language Models
Prediction-Powered Ranking of Large Language Models
Ivi Chatzi
Eleni Straitouri
Suhas Thejaswi
Manuel Gomez Rodriguez
ALM
24
5
0
27 Feb 2024
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large
  Language Models
KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models
Zhuohao Yu
Chang Gao
Wenjin Yao
Yidong Wang
Wei Ye
Jindong Wang
Xing Xie
Yue Zhang
Shikun Zhang
27
20
0
23 Feb 2024
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large
  Language Models
OMGEval: An Open Multilingual Generative Evaluation Benchmark for Large Language Models
Yang Janet Liu
Meng Xu
Shuo Wang
Liner Yang
Haoyu Wang
...
Cunliang Kong
Yun-Nung Chen
Yang Liu
Maosong Sun
Erhong Yang
ELM
LRM
33
1
0
21 Feb 2024
Large Language Models for Data Annotation: A Survey
Large Language Models for Data Annotation: A Survey
Zhen Tan
Dawei Li
Song Wang
Alimohammad Beigi
Bohan Jiang
Amrita Bhattacharjee
Mansooreh Karami
Jundong Li
Lu Cheng
Huan Liu
SyDa
37
44
0
21 Feb 2024
Evolving AI Collectives to Enhance Human Diversity and Enable
  Self-Regulation
Evolving AI Collectives to Enhance Human Diversity and Enable Self-Regulation
Shiyang Lai
Yujin Potter
Junsol Kim
Richard Zhuang
Dawn Song
James Evans
45
3
0
19 Feb 2024
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of
  Large Language Models
Confidence Matters: Revisiting Intrinsic Self-Correction Capabilities of Large Language Models
Loka Li
Zhenhao Chen
Guan-Hong Chen
Yixuan Zhang
Yusheng Su
Eric P. Xing
Kun Zhang
LRM
36
15
0
19 Feb 2024
12
Next