ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05685
  4. Cited By
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

9 June 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
Yonghao Zhuang
Zi Lin
Zhuohan Li
Dacheng Li
Eric P. Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
    ALM
    OSLM
    ELM
ArXivPDFHTML

Papers citing "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

50 / 2,880 papers shown
Title
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Advancing Embodied Agent Security: From Safety Benchmarks to Input Moderation
Ning Wang
Zihan Yan
W. Li
Chuan Ma
H. Chen
Tao Xiang
AAML
35
0
0
22 Apr 2025
Trillion 7B Technical Report
Trillion 7B Technical Report
Sungjun Han
Juyoung Suk
Suyeong An
Hyungguk Kim
Kyuseok Kim
Wonsuk Yang
Seungtaek Choi
Jamin Shin
116
1
0
21 Apr 2025
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
EvalAgent: Discovering Implicit Evaluation Criteria from the Web
Manya Wadhwa
Zayne Sprague
Chaitanya Malaviya
Philippe Laban
Junyi Jessy Li
Greg Durrett
34
0
0
21 Apr 2025
Synergistic Weak-Strong Collaboration by Aligning Preferences
Synergistic Weak-Strong Collaboration by Aligning Preferences
Yizhu Jiao
Xuchao Zhang
Zhaoyang Wang
Yubo Ma
Zhun Deng
Rujia Wang
Chetan Bansal
Saravan Rajmohan
Jiawei Han
Huaxiu Yao
133
0
0
21 Apr 2025
Establishing Reliability Metrics for Reward Models in Large Language Models
Establishing Reliability Metrics for Reward Models in Large Language Models
Yizhou Chen
Yawen Liu
Xuesi Wang
Qingtao Yu
Guangda Huzhang
Anxiang Zeng
Han Yu
Zhiming Zhou
30
0
0
21 Apr 2025
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
Yilun Zhou
Austin Xu
Peifeng Wang
Caiming Xiong
Shafiq R. Joty
ELM
ALM
LRM
53
2
0
21 Apr 2025
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
The Great Nugget Recall: Automating Fact Extraction and RAG Evaluation with Large Language Models
Ronak Pradeep
Nandan Thakur
Shivani Upadhyay
Daniel Fernando Campos
Nick Craswell
Jimmy Lin
31
0
0
21 Apr 2025
Natural Fingerprints of Large Language Models
Natural Fingerprints of Large Language Models
Teppei Suzuki
Ryokan Ri
Sho Takase
30
0
0
21 Apr 2025
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
EducationQ: Evaluating LLMs' Teaching Capabilities Through Multi-Agent Dialogue Framework
Yao Shi
Rongkeng Liang
Yong Xu
LLMAG
AI4Ed
ELM
67
0
0
21 Apr 2025
A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models
A Hierarchical Framework for Measuring Scientific Paper Innovation via Large Language Models
Hongming Tan
Shaoxiong Zhan
Fengwei Jia
Hai-Tao Zheng
Wai Kin Victor Chan
29
0
0
20 Apr 2025
Learning from Reasoning Failures via Synthetic Data Generation
Learning from Reasoning Failures via Synthetic Data Generation
Gabriela Ben-Melech Stan
Estelle Aflalo
Avinash Madasu
Vasudev Lal
Phillip Howard
SyDa
LRM
49
0
0
20 Apr 2025
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines
Reya Vir
Shreya Shankar
Harrison Chase
Will Fu-Hinthorn
Aditya G. Parameswaran
AI4TS
32
0
0
20 Apr 2025
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Pairwise or Pointwise? Evaluating Feedback Protocols for Bias in LLM-Based Evaluation
Tuhina Tripathi
Manya Wadhwa
Greg Durrett
S. Niekum
32
0
0
20 Apr 2025
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Improving the Serving Performance of Multi-LoRA Large Language Models via Efficient LoRA and KV Cache Management
Hang Zhang
Jiuchen Shi
Yixiao Wang
Quan Chen
Yizhou Shan
Minyi Guo
33
0
0
19 Apr 2025
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space
Yicheng Chen
Yining Li
Kai Hu
Zerun Ma
Haochen Ye
Kai Chen
34
0
0
18 Apr 2025
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Multi-Stage Retrieval for Operational Technology Cybersecurity Compliance Using Large Language Models: A Railway Casestudy
Regan Bolton
Mohammadreza Sheikhfathollahi
Simon Parkinson
Dan Basher
Howard Parkinson
36
0
0
18 Apr 2025
CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation
CodeVisionary: An Agent-based Framework for Evaluating Large Language Models in Code Generation
Xinchen Wang
Pengfei Gao
Chao Peng
Ruida Hu
Cuiyun Gao
ELM
33
0
0
18 Apr 2025
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
Revisiting Uncertainty Quantification Evaluation in Language Models: Spurious Interactions with Response Length Bias Results
Andrea Santilli
Adam Goliñski
Michael Kirchhof
Federico Danieli
Arno Blaas
Miao Xiong
Luca Zappella
Sinead Williamson
23
0
0
18 Apr 2025
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
From Large to Super-Tiny: End-to-End Optimization for Cost-Efficient LLMs
Jiliang Ni
Jiachen Pu
Zhongyi Yang
Kun Zhou
Hui Wang
Xiaoliang Xiao
Dakui Wang
Xin Li
Jingfeng Luo
Conggang Hu
37
0
0
18 Apr 2025
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model
D-GEN: Automatic Distractor Generation and Evaluation for Reliable Assessment of Generative Model
Grace Byun
Jinho D. Choi
EGVM
46
0
0
18 Apr 2025
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
MEQA: A Meta-Evaluation Framework for Question & Answer LLM Benchmarks
Jaime Raldua Veuthey
Zainab Ali Majid
Suhas Hariharan
Jacob Haimes
ELM
31
0
0
18 Apr 2025
Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering
Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering
Grace Byun
S. Lee
Nayoung Choi
Jinho D. Choi
32
0
0
18 Apr 2025
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Persona-judge: Personalized Alignment of Large Language Models via Token-level Self-judgment
Xiaotian Zhang
Ruizhe Chen
Yang Feng
Zuozhu Liu
40
0
0
17 Apr 2025
Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer
Efficient MAP Estimation of LLM Judgment Performance with Prior Transfer
Huaizhi Qu
Inyoung Choi
Zhen Tan
Song Wang
Sukwon Yun
Qi Long
Faizan Siddiqui
Kwonjoon Lee
Tianlong Chen
43
0
0
17 Apr 2025
Benchmarking LLM-based Relevance Judgment Methods
Benchmarking LLM-based Relevance Judgment Methods
Negar Arabzadeh
Charles L. A. Clarke
35
0
0
17 Apr 2025
Could Thinking Multilingually Empower LLM Reasoning?
Could Thinking Multilingually Empower LLM Reasoning?
Changjiang Gao
Xu Huang
Wenhao Zhu
Shujian Huang
Lei Li
Fei Yuan
LRM
27
0
0
16 Apr 2025
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
LLM-as-a-Judge: Reassessing the Performance of LLMs in Extractive QA
Xanh Ho
Jiahao Huang
Florian Boudin
Akiko Aizawa
ELM
36
0
0
16 Apr 2025
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Self-alignment of Large Video Language Models with Refined Regularized Preference Optimization
Pritam Sarkar
Ali Etemad
31
0
0
16 Apr 2025
A Dual-Space Framework for General Knowledge Distillation of Large Language Models
A Dual-Space Framework for General Knowledge Distillation of Large Language Models
Xuzhi Zhang
Songming Zhang
Yunlong Liang
Fandong Meng
Yufeng Chen
Jinan Xu
Jie Zhou
26
0
0
15 Apr 2025
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails
Bypassing Prompt Injection and Jailbreak Detection in LLM Guardrails
William Hackett
Lewis Birch
Stefan Trawicki
N. Suri
Peter Garraghan
32
2
0
15 Apr 2025
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Divyansh Garg
Shaun VanWeelden
Diego Caples
Andis Draguns
Nikil Ravi
...
Youngchul Joo
Jindong Gu
Charles London
Christian Schroeder de Witt
S. Motwani
44
1
0
15 Apr 2025
CHARM: Calibrating Reward Models With Chatbot Arena Scores
CHARM: Calibrating Reward Models With Chatbot Arena Scores
Xiao Zhu
Chenmien Tan
Pinzhen Chen
Rico Sennrich
Yanlin Zhang
Hanxu Hu
ALM
24
0
0
14 Apr 2025
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
The Jailbreak Tax: How Useful are Your Jailbreak Outputs?
Kristina Nikolić
Luze Sun
Jie Zhang
F. Tramèr
25
0
0
14 Apr 2025
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models
Localized Cultural Knowledge is Conserved and Controllable in Large Language Models
V. Veselovsky
Berke Argin
Benedikt Stroebl
Chris Wendler
Robert West
James Evans
Thomas L. Griffiths
Arvind Narayanan
57
0
0
14 Apr 2025
LLM-driven Constrained Copy Generation through Iterative Refinement
LLM-driven Constrained Copy Generation through Iterative Refinement
Varun Vasudevan
Faezeh Akhavizadegan
Abhinav Prakash
Yokila Arora
Jason H. D. Cho
Tanya Mendiratta
Sushant Kumar
Kannan Achan
34
0
0
14 Apr 2025
Enhancing LLM-based Recommendation through Semantic-Aligned Collaborative Knowledge
Enhancing LLM-based Recommendation through Semantic-Aligned Collaborative Knowledge
Zihan Wang
Jinghao Lin
Xiaocui Yang
Yongkang Liu
Shi Feng
Daling Wang
Yuhang Zhang
21
0
0
14 Apr 2025
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
SilVar-Med: A Speech-Driven Visual Language Model for Explainable Abnormality Detection in Medical Imaging
Tan-Hanh Pham
Chris Ngo
Trong-Duong Bui
Minh Luu Quang
Tan-Huong Pham
Truong Son-Hy
29
1
0
14 Apr 2025
DICE: A Framework for Dimensional and Contextual Evaluation of Language Models
DICE: A Framework for Dimensional and Contextual Evaluation of Language Models
Aryan Shrivastava
Paula Akemi Aoyagui
29
0
0
14 Apr 2025
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
S1-Bench: A Simple Benchmark for Evaluating System 1 Thinking Capability of Large Reasoning Models
Wenyuan Zhang
Shuaiyi Nie
Xinghua Zhang
Zefeng Zhang
Tingwen Liu
ELM
LRM
46
2
0
14 Apr 2025
Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning
Reasoning Court: Combining Reasoning, Action, and Judgment for Multi-Hop Reasoning
Jingtian Wu
Claire Cardie
LRM
29
0
0
14 Apr 2025
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
QM-ToT: A Medical Tree of Thoughts Reasoning Framework for Quantized Model
Zongxian Yang
Jiayu Qian
Z. Huang
Kay Chen Tan
LM&MA
LRM
31
0
0
13 Apr 2025
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
BabyVLM: Data-Efficient Pretraining of VLMs Inspired by Infant Learning
Shengao Wang
Arjun Chandra
Aoming Liu
Venkatesh Saligrama
Boqing Gong
MLLM
VLM
47
0
0
13 Apr 2025
Evolved Hierarchical Masking for Self-Supervised Learning
Evolved Hierarchical Masking for Self-Supervised Learning
Zhanzhou Feng
Shiliang Zhang
49
0
0
12 Apr 2025
SynthTRIPs: A Knowledge-Grounded Framework for Benchmark Query Generation for Personalized Tourism Recommenders
SynthTRIPs: A Knowledge-Grounded Framework for Benchmark Query Generation for Personalized Tourism Recommenders
Ashmi Banerjee
Adithi Satish
Fitri Nur Aisyah
Wolfgang Wörndl
Yashar Deldjoo
AI4TS
33
0
0
12 Apr 2025
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
SIFT-50M: A Large-Scale Multilingual Dataset for Speech Instruction Fine-Tuning
Prabhat Pandey
R. Swaminathan
K V Vijay Girish
Arunasish Sen
Jian Xie
Grant P. Strimel
Andreas Schwarz
140
0
0
12 Apr 2025
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future
Jialun Zhong
Wei Shen
Yanzeng Li
Songyang Gao
Hua Lu
Yicheng Chen
Yang Zhang
Wei Zhou
Jinjie Gu
Lei Zou
LRM
45
2
0
12 Apr 2025
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
SpecEE: Accelerating Large Language Model Inference with Speculative Early Exiting
Jiaming Xu
Jiayi Pan
Yongkang Zhou
Siming Chen
J. Li
Yaoxiu Lian
Junyi Wu
Guohao Dai
LRM
37
0
0
11 Apr 2025
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
VLMT: Vision-Language Multimodal Transformer for Multimodal Multi-hop Question Answering
Qi Zhi Lim
C. Lee
K. Lim
Kalaiarasi Sonai Muthu Anbananthen
31
0
0
11 Apr 2025
On The Landscape of Spoken Language Models: A Comprehensive Survey
On The Landscape of Spoken Language Models: A Comprehensive Survey
Siddhant Arora
Kai-Wei Chang
Chung-Ming Chien
Yifan Peng
Haibin Wu
Yossi Adi
Emmanuel Dupoux
Hung-yi Lee
Karen Livescu
Shinji Watanabe
52
2
0
11 Apr 2025
SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
SAEs Can\textit{Can}Can Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs
Aashiq Muhamed
Jacopo Bonato
Mona Diab
Virginia Smith
MU
66
0
0
11 Apr 2025
Previous
123456...565758
Next