ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05087
  4. Cited By
PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning
  Optimization

PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization

8 June 2023
Yidong Wang
Zhuohao Yu
Zhengran Zeng
Linyi Yang
Cunxiang Wang
Hao Chen
Chaoya Jiang
Rui Xie
Jindong Wang
Xingxu Xie
Wei Ye
Shi-Bo Zhang
Yue Zhang
    ALM
    ELM
ArXivPDFHTML

Papers citing "PandaLM: An Automatic Evaluation Benchmark for LLM Instruction Tuning Optimization"

48 / 48 papers shown
Title
am-ELO: A Stable Framework for Arena-based LLM Evaluation
am-ELO: A Stable Framework for Arena-based LLM Evaluation
Zirui Liu
Jiatong Li
Yan Zhuang
Q. Liu
Shuanghong Shen
Jie Ouyang
Mingyue Cheng
Shijin Wang
32
0
0
06 May 2025
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
LecEval: An Automated Metric for Multimodal Knowledge Acquisition in Multimedia Learning
Joy Lim Jia Yin
Daniel Zhang-Li
Jifan Yu
H. Li
Shangqing Tu
...
Zhiyuan Liu
Huiqin Liu
Lei Hou
Juanzi Li
Bin Xu
24
0
0
04 May 2025
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang
Ruotian Ma
Qingxuan Jiang
Peisong Wang
Jiaqi Chen
...
Fanghua Ye
Jian Li
Yifan Yang
Zhaopeng Tu
Xiaolong Li
LLMAG
ELM
ALM
102
25
1
01 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq R. Joty
ELM
ALM
LRM
45
2
0
21 Apr 2025
SPHERE: An Evaluation Card for Human-AI Systems
SPHERE: An Evaluation Card for Human-AI Systems
Qianou Ma
Dora Zhao
Xinran Zhao
Chenglei Si
Chenyang Yang
Ryan Louie
Ehud Reiter
Diyi Yang
Tongshuang Wu
ALM
50
0
0
24 Mar 2025
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Learning to Align Multi-Faceted Evaluation: A Unified and Robust Framework
Kaishuai Xu
Tiezheng YU
Wenjun Hou
Yi Cheng
Liangyou Li
Xin Jiang
Lifeng Shang
Q. Liu
Wenjie Li
ELM
66
0
0
26 Feb 2025
PiCO: Peer Review in LLMs based on the Consistency Optimization
PiCO: Peer Review in LLMs based on the Consistency Optimization
Kun-Peng Ning
Shuo Yang
Yu-Yang Liu
Jia-Yu Yao
Zhen-Hui Liu
Yu Wang
Ming Pang
Li Yuan
ALM
69
8
0
24 Feb 2025
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Rate, Explain and Cite (REC): Enhanced Explanation and Attribution in Automatic Evaluation by Large Language Models
Aliyah R. Hsu
James Zhu
Zhichao Wang
Bin Bi
Shubham Mehrotra
...
Sougata Chaudhuri
Regunathan Radhakrishnan
S. Asur
Claire Na Cheng
Bin Yu
ALM
LRM
67
0
0
20 Feb 2025
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Savaal: Scalable Concept-Driven Question Generation to Enhance Human Learning
Kimia Noorbakhsh
Joseph Chandler
Pantea Karimi
M. Alizadeh
H. Balakrishnan
LRM
44
1
0
18 Feb 2025
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
NOTA: Multimodal Music Notation Understanding for Visual Large Language Model
Mingni Tang
Jiajia Li
Lu Yang
Zhiqiang Zhang
Jinghao Tian
Z. Li
L. Zhang
P. Wang
51
0
0
17 Feb 2025
Uncertainty-Aware Step-wise Verification with Generative Reward Models
Uncertainty-Aware Step-wise Verification with Generative Reward Models
Zihuiwen Ye
L. Melo
Younesse Kaddar
Phil Blunsom
S. Kamath S
Yarin Gal
LRM
44
0
0
16 Feb 2025
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models
Yulei Qin
Yuncheng Yang
Pengcheng Guo
Gang Li
Hang Shao
Yuchen Shi
Zihan Xu
Yun Gu
Ke Li
Xing Sun
ALM
88
11
0
31 Dec 2024
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge
Dawei Li
Bohan Jiang
Liangjie Huang
Alimohammad Beigi
Chengshuai Zhao
...
Canyu Chen
Tianhao Wu
Kai Shu
Lu Cheng
Huan Liu
ELM
AILaw
108
63
0
25 Nov 2024
JudgeBench: A Benchmark for Evaluating LLM-based Judges
JudgeBench: A Benchmark for Evaluating LLM-based Judges
Sijun Tan
Siyuan Zhuang
Kyle Montgomery
William Y. Tang
Alejandro Cuadron
Chenguang Wang
Raluca A. Popa
Ion Stoica
ELM
ALM
51
36
0
16 Oct 2024
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Wanying Wang
Zeyu Ma
Pengfei Liu
Mingang Chen
LLMAG
45
1
0
15 Oct 2024
4-LEGS: 4D Language Embedded Gaussian Splatting
4-LEGS: 4D Language Embedded Gaussian Splatting
Gal Fiebelman
Tamir Cohen
Ayellet Morgenstern
Peter Hedman
Hadar Averbuch-Elor
3DGS
41
1
0
14 Oct 2024
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of
  LLMs
EasyJudge: an Easy-to-use Tool for Comprehensive Response Evaluation of LLMs
Yijie Li
Yuan Sun
ELM
26
0
0
13 Oct 2024
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
RevisEval: Improving LLM-as-a-Judge via Response-Adapted References
Qiyuan Zhang
Yufei Wang
Tiezheng YU
Yuxin Jiang
Chuhan Wu
...
Xin Jiang
Lifeng Shang
Ruiming Tang
Fuyuan Lyu
Chen Ma
26
4
0
07 Oct 2024
Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding
Bridging Context Gaps: Leveraging Coreference Resolution for Long Contextual Understanding
Yanming Liu
Xinyue Peng
Jiannan Cao
Shi Bo
Yanxin Shen
Tianyu Du
Sheng Cheng
Xun Wang
Jianwei Yin
Xuhong Zhang
63
9
0
02 Oct 2024
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
From Calculation to Adjudication: Examining LLM judges on Mathematical Reasoning Tasks
Andreas Stephan
D. Zhu
Matthias Aßenmacher
Xiaoyu Shen
Benjamin Roth
ELM
45
4
0
06 Sep 2024
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Systematic Evaluation of LLM-as-a-Judge in LLM Alignment Tasks: Explainable Metrics and Diverse Prompt Templates
Hui Wei
Shenghua He
Tian Xia
Andy H. Wong
Jingyang Lin
Mei Han
Mei Han
ALM
ELM
59
23
0
23 Aug 2024
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
Threshold Filtering Packing for Supervised Fine-Tuning: Training Related Samples within Packs
Jiancheng Dong
Lei Jiang
Wei Jin
Lu Cheng
36
1
0
18 Aug 2024
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
OffsetBias: Leveraging Debiased Data for Tuning Evaluators
Junsoo Park
Seungyeon Jwa
Meiying Ren
Daeyoung Kim
Sanghyuk Choi
ALM
34
30
0
09 Jul 2024
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for
  Foundation Models
CMMaTH: A Chinese Multi-modal Math Skill Evaluation Benchmark for Foundation Models
Zhong-Zhi Li
Ming-Liang Zhang
Fei Yin
Zhi-Long Ji
Jin-Feng Bai
Zhen-Ru Pan
Fan-Hu Zeng
Jian Xu
Jia-Xin Zhang
Cheng-Lin Liu
ELM
28
10
0
28 Jun 2024
Cracking the Code of Juxtaposition: Can AI Models Understand the
  Humorous Contradictions
Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions
Zhe Hu
Tuo Liang
Jing Li
Yiren Lu
Yunlai Zhou
Yiran Qiao
Jing Ma
Yu Yin
36
4
0
29 May 2024
Fennec: Fine-grained Language Model Evaluation and Correction Extended
  through Branching and Bridging
Fennec: Fine-grained Language Model Evaluation and Correction Extended through Branching and Bridging
Xiaobo Liang
Haoke Zhang
Helan hu
Juntao Li
Jun Xu
Min Zhang
ALM
38
2
0
20 May 2024
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
NovelQA: Benchmarking Question Answering on Documents Exceeding 200K Tokens
Cunxiang Wang
Ruoxi Ning
Boqi Pan
Tonghui Wu
Qipeng Guo
...
Guangsheng Bao
Xiangkun Hu
Zheng Zhang
Qian Wang
Yue Zhang
RALM
77
3
0
18 Mar 2024
On the Essence and Prospect: An Investigation of Alignment Approaches
  for Big Models
On the Essence and Prospect: An Investigation of Alignment Approaches for Big Models
Xinpeng Wang
Shitong Duan
Xiaoyuan Yi
Jing Yao
Shanlin Zhou
Zhihua Wei
Peng Zhang
Dongkuan Xu
Maosong Sun
Xing Xie
OffRL
33
16
0
07 Mar 2024
Natural Language Reinforcement Learning
Natural Language Reinforcement Learning
Xidong Feng
Ziyu Wan
Mengyue Yang
Ziyan Wang
Girish A. Koushiks
Yali Du
Ying Wen
Jun Wang
OffRL
35
3
0
11 Feb 2024
A General Framework for Learning from Weak Supervision
A General Framework for Learning from Weak Supervision
Hao Chen
Jindong Wang
Lei Feng
Xiang Li
Yidong Wang
Xing Xie
Masashi Sugiyama
Rita Singh
Bhiksha Raj
19
2
0
02 Feb 2024
LLM-based NLG Evaluation: Current Status and Challenges
LLM-based NLG Evaluation: Current Status and Challenges
Mingqi Gao
Xinyu Hu
Jie Ruan
Xiao Pu
Xiaojun Wan
ELM
LM&MA
53
29
0
02 Feb 2024
PRE: A Peer Review Based Large Language Model Evaluator
PRE: A Peer Review Based Large Language Model Evaluator
Zhumin Chu
Qingyao Ai
Yiteng Tu
Haitao Li
Yiqun Liu
LRM
ALM
28
21
0
28 Jan 2024
The Critique of Critique
The Critique of Critique
Shichao Sun
Junlong Li
Weizhe Yuan
Ruifeng Yuan
Wenjie Li
Pengfei Liu
ELM
32
0
0
09 Jan 2024
Inherent limitations of LLMs regarding spatial information
Inherent limitations of LLMs regarding spatial information
He Yan
Xinyao Hu
Xiangpeng Wan
Chengyu Huang
Kai Zou
Shiqi Xu
LRM
28
2
0
05 Dec 2023
CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM
  Instruction Tuning
CoachLM: Automatic Instruction Revisions Improve the Data Quality in LLM Instruction Tuning
Yilun Liu
Shimin Tao
Xiaofeng Zhao
Ming Zhu
Wenbing Ma
...
Min Zhang
Hongxia Ma
Li Zhang
Hao-Yu Yang
Yanfei Jiang
26
10
0
22 Nov 2023
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
Lianghui Zhu
Xinggang Wang
Xinlong Wang
ELM
ALM
54
106
0
26 Oct 2023
Instruction Tuning with Human Curriculum
Instruction Tuning with Human Curriculum
Bruce W. Lee
Hyunsoo Cho
Kang Min Yoo
35
3
0
14 Oct 2023
Generative Judge for Evaluating Alignment
Generative Judge for Evaluating Alignment
Junlong Li
Shichao Sun
Weizhe Yuan
Run-Ze Fan
Hai Zhao
Pengfei Liu
ELM
ALM
17
76
0
09 Oct 2023
Foundation Metrics for Evaluating Effectiveness of Healthcare
  Conversations Powered by Generative AI
Foundation Metrics for Evaluating Effectiveness of Healthcare Conversations Powered by Generative AI
Mahyar Abbasian
Elahe Khatibi
Iman Azimi
David Oniani
Zahra Shakeri Hossein Abad
...
Bryant Lin
Olivier Gevaert
Li-Jia Li
Ramesh C. Jain
Amir M. Rahmani
LM&MA
ELM
AI4MH
23
65
0
21 Sep 2023
Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation
Beyond Static Datasets: A Deep Interaction Approach to LLM Evaluation
Jiatong Li
Rui Li
Qi Liu
21
14
0
08 Sep 2023
LLM-Mini-CEX: Automatic Evaluation of Large Language Model for
  Diagnostic Conversation
LLM-Mini-CEX: Automatic Evaluation of Large Language Model for Diagnostic Conversation
Xiaoming Shi
J. Xu
Jinru Ding
Jiali Pang
Sichen Liu
...
Lu Lu
Haihong Yang
Mingtao Hu
Tong Ruan
Shaoting Zhang
LM&MA
ELM
21
12
0
15 Aug 2023
CValues: Measuring the Values of Chinese Large Language Models from
  Safety to Responsibility
CValues: Measuring the Values of Chinese Large Language Models from Safety to Responsibility
Guohai Xu
Jiayi Liu
Mingshi Yan
Haotian Xu
Jinghui Si
...
Rong Zhang
Ji Zhang
Chao Peng
Feiyan Huang
Jingren Zhou
ALM
ELM
26
72
0
19 Jul 2023
Instruction Tuning with GPT-4
Instruction Tuning with GPT-4
Baolin Peng
Chunyuan Li
Pengcheng He
Michel Galley
Jianfeng Gao
SyDa
ALM
LM&MA
157
579
0
06 Apr 2023
GLM-130B: An Open Bilingual Pre-trained Model
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng-Zhen Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
245
1,071
0
05 Oct 2022
FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning
FreeMatch: Self-adaptive Thresholding for Semi-supervised Learning
Yidong Wang
Hao Chen
Qiang Heng
Wenxin Hou
Yue Fan
...
Marios Savvides
T. Shinozaki
Bhiksha Raj
Bernt Schiele
Xing Xie
182
256
0
15 May 2022
ZeRO-Offload: Democratizing Billion-Scale Model Training
ZeRO-Offload: Democratizing Billion-Scale Model Training
Jie Ren
Samyam Rajbhandari
Reza Yazdani Aminabadi
Olatunji Ruwase
Shuangyang Yang
Minjia Zhang
Dong Li
Yuxiong He
MoE
160
413
0
18 Jan 2021
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language
  Understanding
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
Alex Jinpeng Wang
Amanpreet Singh
Julian Michael
Felix Hill
Omer Levy
Samuel R. Bowman
ELM
294
6,943
0
20 Apr 2018
1