ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2402.14762
  4. Cited By
MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language
  Models in Multi-Turn Dialogues

MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues

22 February 2024
Ge Bai
Jie Liu
Xingyuan Bu
Yancheng He
Jiaheng Liu
Zhanhui Zhou
Zhuoran Lin
Wenbo Su
Tiezheng Ge
Bo Zheng
Wanli Ouyang
    ELM
    LM&MA
ArXivPDFHTML

Papers citing "MT-Bench-101: A Fine-Grained Benchmark for Evaluating Large Language Models in Multi-Turn Dialogues"

24 / 24 papers shown
Title
TRAIL: Trace Reasoning and Agentic Issue Localization
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
16
0
0
13 May 2025
LLMs Get Lost In Multi-Turn Conversation
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban
Hiroaki Hayashi
Yingbo Zhou
Jennifer Neville
27
0
0
09 May 2025
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
DREAM: Disentangling Risks to Enhance Safety Alignment in Multimodal Large Language Models
J. Liu
Hangyu Guo
Ranjie Duan
Xingyuan Bu
Yancheng He
...
Yingshui Tan
Yanan Wu
Jihao Gu
Y. Li
J. Zhu
MLLM
67
0
0
25 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Z. Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
90
3
0
01 Apr 2025
Why Do Multi-Agent LLM Systems Fail?
Why Do Multi-Agent LLM Systems Fail?
Mert Cemri
Melissa Z. Pan
Shuyi Yang
Lakshya A Agrawal
Bhavya Chopra
...
Dan Klein
Kannan Ramchandran
Matei A. Zaharia
Joseph E. Gonzalez
Ion Stoica
LLMAG
Presented at ResearchTrend Connect | LLMAG on 23 Apr 2025
120
5
0
17 Mar 2025
Can Language Models Follow Multiple Turns of Entangled Instructions?
Can Language Models Follow Multiple Turns of Entangled Instructions?
Chi Han
ELM
LRM
43
1
0
17 Mar 2025
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation
Peiding Wang
L. Zhang
Fang Liu
Lin Shi
Minxiao Li
Bo Shen
An Fu
ELM
LRM
54
0
0
05 Mar 2025
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Synthesizing Post-Training Data for LLMs through Multi-Agent Simulation
Shuo Tang
Xianghe Pang
Zexi Liu
Bohan Tang
Rui Ye
Xiaowen Dong
Y. Wang
Yanfeng Wang
S. Chen
SyDa
LLMAG
124
3
0
21 Feb 2025
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Multi-Step Alignment as Markov Games: An Optimistic Online Gradient Descent Approach with Convergence Guarantees
Yongtao Wu
Luca Viano
Yihang Chen
Zhenyu Zhu
Kimon Antonakopoulos
Quanquan Gu
V. Cevher
49
0
0
18 Feb 2025
InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context
InfoQuest: Evaluating Multi-Turn Dialogue Agents for Open-Ended Conversations with Hidden Context
Bryan L. M. de Oliveira
Luana G. B. Martins
Bruno Brandão
L. Melo
ELM
90
1
0
17 Feb 2025
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Model Tampering Attacks Enable More Rigorous Evaluations of LLM Capabilities
Zora Che
Stephen Casper
Robert Kirk
Anirudh Satheesh
Stewart Slocum
...
Zikui Cai
Bilal Chughtai
Y. Gal
Furong Huang
Dylan Hadfield-Menell
MU
AAML
ELM
80
2
0
03 Feb 2025
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
Lize Alberts
Benjamin Ellis
Andrei Lupu
Jakob Foerster
ELM
34
0
0
28 Oct 2024
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Revisiting Benchmark and Assessment: An Agent-based Exploratory Dynamic Evaluation Framework for LLMs
Wanying Wang
Zeyu Ma
Pengfei Liu
Mingang Chen
LLMAG
45
1
0
15 Oct 2024
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
PingPong: A Benchmark for Role-Playing Language Models with User Emulation and Multi-Model Evaluation
Ilya Gusev
LLMAG
52
3
0
10 Sep 2024
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
GraCoRe: Benchmarking Graph Comprehension and Complex Reasoning in Large Language Models
Zike Yuan
Ming Liu
Hui Wang
Bing Qin
LRM
ELM
44
2
0
03 Jul 2024
UniCoder: Scaling Code Large Language Model via Universal Code
UniCoder: Scaling Code Large Language Model via Universal Code
Tao Sun
Linzheng Chai
Jian Yang
Yuwei Yin
Hongcheng Guo
Jiaheng Liu
Bing Wang
Liqun Yang
Zhoujun Li
OffRL
LRM
60
16
0
24 Jun 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
90
28
0
09 Jun 2024
Weak-to-Strong Search: Align Large Language Models via Searching over
  Small Language Models
Weak-to-Strong Search: Align Large Language Models via Searching over Small Language Models
Zhanhui Zhou
Zhixuan Liu
Jie Liu
Zhichen Dong
Chao Yang
Yu Qiao
ALM
36
20
0
29 May 2024
Task-Agnostic Detector for Insertion-Based Backdoor Attacks
Task-Agnostic Detector for Insertion-Based Backdoor Attacks
Weimin Lyu
Xiao Lin
Songzhu Zheng
Lu Pang
Haibin Ling
Susmit Jha
Chao Chen
43
25
0
25 Mar 2024
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring
  Mathematical Reasoning of Large Language Models
ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
Yanan Wu
Jie Liu
Xingyuan Bu
Jiaheng Liu
Zhanhui Zhou
...
Haibin Chen
Tiezheng Ge
Wanli Ouyang
Wenbo Su
Bo Zheng
LRM
27
6
0
22 Feb 2024
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language
  Feedback
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
Xingyao Wang
Zihan Wang
Jiateng Liu
Yangyi Chen
Lifan Yuan
Hao Peng
Heng Ji
LRM
125
137
0
19 Sep 2023
OWL: A Large Language Model for IT Operations
OWL: A Large Language Model for IT Operations
Hongcheng Guo
Jian Yang
Jiaheng Liu
Liqun Yang
Linzheng Chai
...
Tieqiao Zheng
Liangfan Zheng
Bo-Wen Zhang
Ke Xu
Zhoujun Li
VLM
63
40
0
17 Sep 2023
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
1