ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2310.08491
  4. Cited By
Prometheus: Inducing Fine-grained Evaluation Capability in Language
  Models

Prometheus: Inducing Fine-grained Evaluation Capability in Language Models

12 October 2023
Seungone Kim
Jamin Shin
Yejin Cho
Joel Jang
Shayne Longpre
Hwaran Lee
Sangdoo Yun
Seongjin Shin
Sungdong Kim
James Thorne
Minjoon Seo
    ALM
    LM&MA
    ELM
ArXivPDFHTML

Papers citing "Prometheus: Inducing Fine-grained Evaluation Capability in Language Models"

50 / 168 papers shown
Title
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
VISCO: Benchmarking Fine-Grained Critique and Correction Towards Self-Improvement in Visual Reasoning
Xueqing Wu
Yuheng Ding
Bingxuan Li
Pan Lu
Da Yin
Kai-Wei Chang
Nanyun Peng
LRM
100
3
0
03 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
108
63
0
25 Nov 2024
Self-Generated Critiques Boost Reward Modeling for Language Models
Self-Generated Critiques Boost Reward Modeling for Language Models
Yue Yu
Zhengxing Chen
Aston Zhang
L Tan
Chenguang Zhu
...
Suchin Gururangan
Chao-Yue Zhang
Melanie Kambadur
Dhruv Mahajan
Rui Hou
LRM
ALM
87
14
0
25 Nov 2024
From Jack of All Trades to Master of One: Specializing LLM-based
  Autoraters to a Test Set
From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set
M. Finkelstein
Dan Deutsch
Parker Riley
Juraj Juraska
Geza Kovacs
Markus Freitag
74
0
0
23 Nov 2024
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Bayesian Calibration of Win Rate Estimation with LLM Evaluators
Yicheng Gao
G. Xu
Zhe Wang
Arman Cohan
31
6
0
07 Nov 2024
DELIFT: Data Efficient Language model Instruction Fine Tuning
DELIFT: Data Efficient Language model Instruction Fine Tuning
Ishika Agarwal
Krishnateja Killamsetty
Lucian Popa
Marina Danilevksy
ALM
VLM
46
2
0
07 Nov 2024
VERITAS: A Unified Approach to Reliability Evaluation
VERITAS: A Unified Approach to Reliability Evaluation
Rajkumar Ramamurthy
Meghana Arakkal Rajeev
Oliver Molenschot
James Y. Zou
Nazneen Rajani
HILM
33
1
0
05 Nov 2024
Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance
  Judgements
Evaluating Cost-Accuracy Trade-offs in Multimodal Search Relevance Judgements
Silvia Terragni
Hoang Cuong
Joachim Daiber
Pallavi Gudipati
Pablo N. Mendes
20
0
0
25 Oct 2024
Improving Model Factuality with Fine-grained Critique-based Evaluator
Improving Model Factuality with Fine-grained Critique-based Evaluator
Yiqing Xie
Wenxuan Zhou
Pradyot Prakash
Di Jin
Yuning Mao
...
Sinong Wang
Han Fang
Carolyn Rose
Daniel Fried
Hejia Zhang
HILM
33
5
0
24 Oct 2024
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
Guijin Son
Dongkeun Yoon
Juyoung Suk
Javier Aula-Blasco
Mano Aslan
Vu Trong Kim
Shayekh Bin Islam
Jaume Prats-Cristià
Lucía Tormo-Bañuelos
Seungone Kim
ELM
LRM
25
8
0
23 Oct 2024
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection
  Bias in LLMs-as-Judges
CalibraEval: Calibrating Prediction Distribution to Mitigate Selection Bias in LLMs-as-Judges
Haitao Li
Junjie Chen
Qingyao Ai
Zhumin Chu
Yujia Zhou
Qian Dong
Yiqun Liu
41
8
0
20 Oct 2024
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World
  Multilingual Settings
HEALTH-PARIKSHA: Assessing RAG Models for Health Chatbots in Real-World Multilingual Settings
Varun Gumma
Anandhita Raghunath
Mohit Jain
Sunayana Sitaram
LM&MA
32
1
0
17 Oct 2024
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Cross-Lingual Auto Evaluation for Assessing Multilingual LLMs
Sumanth Doddapaneni
Mohammed Safi Ur Rahman Khan
Dilip Venkatesh
Raj Dabre
Anoop Kunchukuttan
Mitesh M. Khapra
ELM
35
1
0
17 Oct 2024
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
MCQG-SRefine: Multiple Choice Question Generation and Evaluation with Iterative Self-Critique, Correction, and Comparison Feedback
Zonghai Yao
Aditya Parashar
Huixue Zhou
Won Seok Jang
Feiyun Ouyang
Zhichao Yang
Hong-ye Yu
ELM
44
2
0
17 Oct 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELM
ALM
51
36
0
16 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
36
1
0
14 Oct 2024
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of
  LLMs
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs
Yijie Li
Yuan Sun
ELM
26
0
0
13 Oct 2024
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Beyond Exact Match: Semantically Reassessing Event Extraction by Large Language Models
Yi-Fan Lu
Xian-Ling Mao
Tian Lan
Heyan Huang
Heyan Huang
Xiaoyan Gao
45
0
0
12 Oct 2024
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Multi-Facet Counterfactual Learning for Content Quality Evaluation
Jiasheng Zheng
Hongyu Lin
Boxi Cao
M. Liao
Y. Lu
Xianpei Han
Le Sun
23
0
0
10 Oct 2024
ReIFE: Re-evaluating Instruction-Following Evaluation
ReIFE: Re-evaluating Instruction-Following Evaluation
Yixin Liu
Kejian Shi
Alexander R. Fabbri
Yilun Zhao
Peifeng Wang
Chien-Sheng Wu
Shafiq Joty
Arman Cohan
22
6
0
09 Oct 2024
Uncovering Factor Level Preferences to Improve Human-Model Alignment
Uncovering Factor Level Preferences to Improve Human-Model Alignment
Juhyun Oh
Eunsu Kim
Jiseon Kim
Wenda Xu
Inha Cha
William Yang Wang
Alice H. Oh
21
0
0
09 Oct 2024
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for
  Enhanced Following of Instructions with Multiple Constraints
LLM Self-Correction with DeCRIM: Decompose, Critique, and Refine for Enhanced Following of Instructions with Multiple Constraints
Thomas Palmeira Ferraz
Kartik Mehta
Yu-Hsiang Lin
Haw-Shiuan Chang
Shereen Oraby
Sijia Liu
Vivek Subramanian
Tagyoung Chung
Mohit Bansal
Nanyun Peng
48
7
0
09 Oct 2024
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Cheating Automatic LLM Benchmarks: Null Models Achieve High Win Rates
Xiaosen Zheng
Tianyu Pang
Chao Du
Qian Liu
Jing Jiang
Min-Bin Lin
33
8
0
09 Oct 2024
Mitigating the Impact of Reference Quality on Evaluation of
  Summarization Systems with Reference-Free Metrics
Mitigating the Impact of Reference Quality on Evaluation of Summarization Systems with Reference-Free Metrics
Théo Gigant
Camille Guinaudeau
Marc Decombas
Frédéric Dufaux
40
1
0
08 Oct 2024
Self-rationalization improves LLM as a fine-grained judge
Self-rationalization improves LLM as a fine-grained judge
Prapti Trivedi
Aditya Gulati
Oliver Molenschot
Meghana Arakkal Rajeev
Rajkumar Ramamurthy
Keith Stevens
Tanveesh Singh Chaudhery
Jahnavi Jambholkar
James Y. Zou
Nazneen Rajani
LRM
25
5
0
07 Oct 2024
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Rationale-Aware Answer Verification by Pairwise Self-Evaluation
Akira Kawabata
Saku Sugawara
LRM
31
2
0
07 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang
Yufei Wang
Tiezheng YU
Yuxin Jiang
Chuhan Wu
...
Xin Jiang
Lifeng Shang
Ruiming Tang
Fuyuan Lyu
Chen Ma
26
4
0
07 Oct 2024
Learning Code Preference via Synthetic Evolution
Learning Code Preference via Synthetic Evolution
Jiawei Liu
Thanh Nguyen
Mingyue Shang
Hantian Ding
Xiaopeng Li
Yu Yu
Varun Kumar
Zijian Wang
SyDa
ALM
AAML
26
3
0
04 Oct 2024
Better Instruction-Following Through Minimum Bayes Risk
Better Instruction-Following Through Minimum Bayes Risk
Ian Wu
Patrick Fernandes
Amanda Bertsch
Seungone Kim
Sina Pakazad
Graham Neubig
48
9
0
03 Oct 2024
Generative Reward Models
Generative Reward Models
Dakota Mahan
Duy Phung
Rafael Rafailov
Chase Blagden
Nathan Lile
Louis Castricato
Jan-Philipp Fränken
Chelsea Finn
Alon Albalak
VLM
SyDa
OffRL
27
25
0
02 Oct 2024
Speculative Coreset Selection for Task-Specific Fine-tuning
Speculative Coreset Selection for Task-Specific Fine-tuning
Xiaoyu Zhang
Juan Zhai
Shiqing Ma
Chao Shen
Tianlin Li
Weipeng Jiang
Yang Liu
30
1
0
02 Oct 2024
Robust LLM safeguarding via refusal feature adversarial training
Robust LLM safeguarding via refusal feature adversarial training
L. Yu
Virginie Do
Karen Hambardzumyan
Nicola Cancedda
AAML
56
9
0
30 Sep 2024
Post-hoc Reward Calibration: A Case Study on Length Bias
Post-hoc Reward Calibration: A Case Study on Length Bias
Zeyu Huang
Zihan Qiu
Zili Wang
Edoardo M. Ponti
Ivan Titov
38
5
0
25 Sep 2024
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL
  Benchmark
FLEX: Expert-level False-Less EXecution Metric for Reliable Text-to-SQL Benchmark
Heegyu Kim
Taeyang Jeon
Seunghwan Choi
Seungtaek Choi
Hyunsouk Cho
39
0
0
24 Sep 2024
Direct Judgement Preference Optimization
Direct Judgement Preference Optimization
Peifeng Wang
Austin Xu
Yilun Zhou
Caiming Xiong
Shafiq Joty
ELM
37
12
0
23 Sep 2024
Aligning Language Models Using Follow-up Likelihood as Reward Signal
Aligning Language Models Using Follow-up Likelihood as Reward Signal
Chen Zhang
Dading Chong
Feng Jiang
Chengguang Tang
Anningzhe Gao
Guohua Tang
Haizhou Li
ALM
29
2
0
20 Sep 2024
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
Guijin Son
Hyunwoo Ko
Hoyoung Lee
Yewon Kim
Seunghyeok Hong
ALM
ELM
46
5
0
17 Sep 2024
Quantile Regression for Distributional Reward Models in RLHF
Quantile Regression for Distributional Reward Models in RLHF
Nicolai Dorka
32
15
0
16 Sep 2024
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
AI-LieDar: Examine the Trade-off Between Utility and Truthfulness in LLM Agents
Zhe Su
Xuhui Zhou
Sanketh Rangreji
Anubha Kabra
Julia Mendelsohn
Faeze Brahman
Maarten Sap
LLMAG
95
2
0
13 Sep 2024
Your Weak LLM is Secretly a Strong Teacher for Alignment
Your Weak LLM is Secretly a Strong Teacher for Alignment
Leitian Tao
Yixuan Li
84
5
0
13 Sep 2024
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
GroUSE: A Benchmark to Evaluate Evaluators in Grounded Question Answering
Sacha Muller
António Loison
Bilel Omrani
Gautier Viaud
RALM
ELM
31
1
0
10 Sep 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
45
4
0
06 Sep 2024
Towards a Unified View of Preference Learning for Large Language Models:
  A Survey
Towards a Unified View of Preference Learning for Large Language Models: A Survey
Bofei Gao
Feifan Song
Yibo Miao
Zefan Cai
Z. Yang
...
Houfeng Wang
Zhifang Sui
Peiyi Wang
Baobao Chang
Baobao Chang
41
11
0
04 Sep 2024
What Makes a Good Story and How Can We Measure It? A Comprehensive
  Survey of Story Evaluation
What Makes a Good Story and How Can We Measure It? A Comprehensive Survey of Story Evaluation
Dingyi Yang
Qin Jin
36
5
0
26 Aug 2024
Critique-out-Loud Reward Models
Critique-out-Loud Reward Models
Zachary Ankner
Mansheej Paul
Brandon Cui
Jonathan D. Chang
Prithviraj Ammanabrolu
ALM
LRM
25
26
0
21 Aug 2024
BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General
  Role-Playing Language Model
BEYOND DIALOGUE: A Profile-Dialogue Alignment Framework Towards General Role-Playing Language Model
Yeyong Yu
Runsheng Yu
Haojie Wei
Zhanqiu Zhang
Quan Qian
ALM
16
2
0
20 Aug 2024
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation
  Instructions
Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss
Christian Poelitz
Ian Drosos
Vu Le
Nick McKenna
Carina Negreanu
Chris Parnin
Advait Sarkar
ELM
ALM
35
12
0
16 Aug 2024
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal
  Knowledge in Large Language Models
ArabLegalEval: A Multitask Benchmark for Assessing Arabic Legal Knowledge in Large Language Models
Faris Hijazi
Somayah Alharbi
Abdulaziz AlHussein
Harethah Shairah
Reem Alzahrani
Hebah Alshamlan
Omar Knio
G. Turkiyyah
AILaw
ELM
41
2
0
15 Aug 2024
Decoding Biases: Automated Methods and LLM Judges for Gender Bias
  Detection in Language Models
Decoding Biases: Automated Methods and LLM Judges for Gender Bias Detection in Language Models
Shachi H. Kumar
Saurav Sahay
Sahisnu Mazumder
Eda Okur
R. Manuvinakurike
Nicole Beckage
Hsuan Su
Hung-yi Lee
L. Nachman
ELM
26
15
0
07 Aug 2024
Self-Taught Evaluators
Self-Taught Evaluators
Tianlu Wang
Ilia Kulikov
O. Yu. Golovneva
Ping Yu
Weizhe Yuan
Jane Dwivedi-Yu
Richard Yuanzhe Pang
Maryam Fazel-Zarandi
Jason Weston
Xian Li
ALM
LRM
25
22
0
05 Aug 2024
Previous
1234
Next