ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2306.05685
  4. Cited By
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

9 June 2023
Lianmin Zheng
Wei-Lin Chiang
Ying Sheng
Siyuan Zhuang
Zhanghao Wu
Yonghao Zhuang
Zi Lin
Zhuohan Li
Dacheng Li
Eric P. Xing
Haotong Zhang
Joseph E. Gonzalez
Ion Stoica
    ALM
    OSLM
    ELM
ArXivPDFHTML

Papers citing "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"

50 / 2,880 papers shown
Title
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
HelpSteer3-Preference: Open Human-Annotated Preference Data across Diverse Tasks and Languages
Zhilin Wang
Jiaqi Zeng
Olivier Delalleau
Hoo-Chang Shin
Felipe Soares
Alexander Bukharin
Ellie Evans
Yi Dong
Oleksii Kuchaiev
17
0
0
16 May 2025
RanDeS: Randomized Delta Superposition for Multi-Model Compression
RanDeS: Randomized Delta Superposition for Multi-Model Compression
Hangyu Zhou
Aaron Gokaslan
Volodymyr Kuleshov
Bharath Hariharan
MoMe
22
0
0
16 May 2025
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Finetune-RAG: Fine-Tuning Language Models to Resist Hallucination in Retrieval-Augmented Generation
Zhan Peng Lee
Andre Lin
Calvin Tan
RALM
HILM
25
0
0
16 May 2025
CAMEO: Collection of Multilingual Emotional Speech Corpora
CAMEO: Collection of Multilingual Emotional Speech Corpora
Iwona Christop
Maciej Czajka
19
0
0
16 May 2025
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models
Review-Instruct: A Review-Driven Multi-Turn Conversations Generation Method for Large Language Models
Jian Wu
Cong Wang
TianHuang Su
Jun Yang
Haozhi Lin
...
Steve Yang
BinQing Pan
Zehan Li
Ni Yang
ZhenYu Yang
ALM
14
0
0
16 May 2025
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
Lingxiao Diao
Xinyue Xu
Wanxuan Sun
Cheng Yang
Zhuosheng Zhang
LLMAG
ALM
ELM
7
0
0
16 May 2025
Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition
Real-Time Verification of Embodied Reasoning for Generative Skill Acquisition
Bo Yue
Shuqi Guo
Kaiyu Hu
Chujiao Wang
Benyou Wang
Kui Jia
Guiliang Liu
LRM
17
0
0
16 May 2025
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
BLEUBERI: BLEU is a surprisingly effective reward for instruction following
Yapei Chang
Yekyung Kim
Michael Krumdick
Amir Zadeh
Chuan Li
Chris Tanner
Mohit Iyyer
ALM
17
0
0
16 May 2025
Towards Better Evaluation for Generated Patent Claims
Towards Better Evaluation for Generated Patent Claims
Lekang Jiang
Pascal A Scherz
Stephan Goetz
ELM
25
0
0
16 May 2025
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
PIG: Privacy Jailbreak Attack on LLMs via Gradient-based Iterative In-Context Optimization
Yidan Wang
Yanan Cao
Yubing Ren
Fang Fang
Zheng-Shen Lin
Binxing Fang
PILM
44
0
0
15 May 2025
WorldPM: Scaling Human Preference Modeling
WorldPM: Scaling Human Preference Modeling
Binghui Wang
Runji Lin
K. Lu
L. Yu
Z. Zhang
...
Xuanjing Huang
Yu-Gang Jiang
Bowen Yu
J. Zhou
Junyang Lin
24
0
0
15 May 2025
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning
Chenxi Whitehouse
Tianlu Wang
Ping Yu
Xian Li
Jason Weston
Ilia Kulikov
Swarnadeep Saha
ALM
ELM
LRM
19
0
0
15 May 2025
On the Evaluation of Engineering Artificial General Intelligence
On the Evaluation of Engineering Artificial General Intelligence
Sandeep Neema
Susmit Jha
Adam Nagel
Ethan Lew
Chandrasekar Sureshkumar
Aleksa Gordic
Chase Shimmin
Hieu Nguygen
Paul Eremenko
ELM
19
0
0
15 May 2025
XRAG: Cross-lingual Retrieval-Augmented Generation
XRAG: Cross-lingual Retrieval-Augmented Generation
Wei Liu
Sony Trenous
Leonardo F. R. Ribeiro
Bill Byrne
Felix Hieber
RALM
26
0
0
15 May 2025
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents
Pre-Act: Multi-Step Planning and Reasoning Improves Acting in LLM Agents
Mrinal Rawat
Ambuje Gupta
Rushil Goomer
Alessandro Di Bari
Neha Gupta
Roberto Pieraccini
LLMAG
LRM
23
0
0
15 May 2025
A Multimodal Multi-Agent Framework for Radiology Report Generation
A Multimodal Multi-Agent Framework for Radiology Report Generation
Ziruo Yi
Ting Xiao
Mark V. Albert
MedIm
26
0
0
14 May 2025
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
Seeing Beyond the Scene: Enhancing Vision-Language Models with Interactional Reasoning
Dayong Liang
Changmeng Zheng
Zhiyuan Wen
Yi Cai
Xiao Wei
Qing Li
LRM
31
0
0
14 May 2025
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models
WorldView-Bench: A Benchmark for Evaluating Global Cultural Perspectives in Large Language Models
Abdullah Mushtaq
Imran Taj
Rafay Naeem
Ibrahim Ghaznavi
Junaid Qadir
26
0
0
14 May 2025
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
Automated Meta Prompt Engineering for Alignment with the Theory of Mind
Aaron Baughman
Rahul Agarwal
Eduardo Morales
Gozde Akay
36
0
0
13 May 2025
LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
LCES: Zero-shot Automated Essay Scoring via Pairwise Comparisons Using Large Language Models
Takumi Shibata
Yuichi Miyamura
29
0
0
13 May 2025
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
From Seeing to Doing: Bridging Reasoning and Decision for Robotic Manipulation
Yifu Yuan
Haiqin Cui
Yibin Chen
Zibin Dong
Fei Ni
Longxin Kou
Jinyi Liu
Pengyi Li
Yan Zheng
Jianye Hao
31
0
0
13 May 2025
On the Account Security Risks Posed by Password Strength Meters
On the Account Security Risks Posed by Password Strength Meters
Ming Xu
Weili Han
Jitao Yu
Xiaozhong Liu
Xiaotian Zhang
Yun Lin
J. Dong
29
0
0
13 May 2025
TRAIL: Trace Reasoning and Agentic Issue Localization
TRAIL: Trace Reasoning and Agentic Issue Localization
Darshan Deshpande
Varun Gangal
Hersh Mehta
Jitin Krishnan
Anand Kannappan
Rebecca Qian
25
0
0
13 May 2025
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
SAS-Bench: A Fine-Grained Benchmark for Evaluating Short Answer Scoring with Large Language Models
Peichao Lai
Kaipeng Zhang
Yi Lin
L. Zhang
Feiyang Ye
...
Yanwei Xu
Conghui He
Yixuan Wang
Wentao Zhang
Bin Cui
ELM
LRM
47
0
0
12 May 2025
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
Xiaokun Wang
Chris
Jiangbo Pei
Wei Shen
Yi Peng
...
Ai Jian
Tianyidan Xie
Xuchen Song
Yang Liu
Yahui Zhou
OffRL
LRM
28
0
0
12 May 2025
A Case Study Investigating the Role of Generative AI in Quality Evaluations of Epics in Agile Software Development
A Case Study Investigating the Role of Generative AI in Quality Evaluations of Epics in Agile Software Development
Werner Geyer
Jessica He
Daita Sarkar
Michelle Brachman
Chris Hammond
Jennifer Heins
Zahra Ashktorab
Carlos Rosemberg
Charlie Hill
28
0
0
12 May 2025
How well do LLMs reason over tabular data, really?
How well do LLMs reason over tabular data, really?
Cornelius Wolff
Madelon Hulsebos
LMTD
ELM
LRM
45
0
0
12 May 2025
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
SpecRouter: Adaptive Routing for Multi-Level Speculative Decoding in Large Language Models
Hang Wu
Jianian Zhu
Yongqian Li
Haojie Wang
Biao Hou
Jidong Zhai
40
0
0
12 May 2025
PLHF: Prompt Optimization with Few-Shot Human Feedback
PLHF: Prompt Optimization with Few-Shot Human Feedback
Chun-Pai Yang
Kan Zheng
Shou-De Lin
21
0
0
11 May 2025
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety
Zihan Guan
Mengxuan Hu
Ronghang Zhu
Sheng Li
Anil Vullikanti
AAML
31
0
0
11 May 2025
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
QoS-Efficient Serving of Multiple Mixture-of-Expert LLMs Using Partial Runtime Reconfiguration
HamidReza Imani
Jiaxin Peng
Peiman Mohseni
Abdolah Amirany
Tarek A. El-Ghazawi
MoE
31
0
0
10 May 2025
xGen-small Technical Report
xGen-small Technical Report
Erik Nijkamp
Bo Pang
Egor Pakhomov
Akash Gokul
Jin Qu
Silvio Savarese
Yingbo Zhou
Caiming Xiong
LLMAG
58
0
0
10 May 2025
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization
Jae-Won Chung
Jiachen Liu
Jeff J. Ma
Ruofan Wu
Oh Jun Kweon
Yuxuan Xia
Zhiyu Wu
Mosharaf Chowdhury
28
0
0
09 May 2025
Stability in Single-Peaked Strategic Resource Selection Games
Stability in Single-Peaked Strategic Resource Selection Games
Henri Zeiler
32
3
0
09 May 2025
LLMs Get Lost In Multi-Turn Conversation
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban
Hiroaki Hayashi
Yingbo Zhou
Jennifer Neville
42
1
0
09 May 2025
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
Healthy LLMs? Benchmarking LLM Knowledge of UK Government Public Health Information
Joshua Harris
Fan Grayson
Felix Feldman
Timothy Laurence
Toby Nonnenmacher
...
Leo Loman
Selina Patel
Thomas Finnie
Samuel Collins
Michael Borowitz
AI4MH
LM&MA
ELM
49
0
0
09 May 2025
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
Chalamalasetti Kranti
Sherzod Hakimov
David Schlangen
LLMAG
49
0
0
08 May 2025
Scaling Laws for Speculative Decoding
Scaling Laws for Speculative Decoding
Siyuan Yan
Mo Zhu
Guo-qing Jiang
Jianfei Wang
Jiaxing Chen
...
Xiang Liao
Xiao Cui
Chen Zhang
Zhuoran Song
Ran Zhu
LRM
48
0
0
08 May 2025
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
LiteLMGuard: Seamless and Lightweight On-Device Prompt Filtering for Safeguarding Small Language Models against Quantization-induced Risks and Vulnerabilities
Kalyan Nakka
Jimmy Dani
Ausmit Mondal
Nitesh Saxena
AAML
30
0
0
08 May 2025
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
To Judge or not to Judge: Using LLM Judgements for Advertiser Keyphrase Relevance at eBay
Soumik Dey
Hansi Wu
Binbin Li
45
0
0
07 May 2025
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
OBLIVIATE: Robust and Practical Machine Unlearning for Large Language Models
Xiaoyu Xu
Minxin Du
Qingqing Ye
Haibo Hu
MU
57
0
0
07 May 2025
LLAMAPIE: Proactive In-Ear Conversation Assistants
LLAMAPIE: Proactive In-Ear Conversation Assistants
Tuochao Chen
Nicholas Batchelder
Alisa Liu
Noah A. Smith
Shyamnath Gollakota
128
0
0
07 May 2025
A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning
A Hashgraph-Inspired Consensus Mechanism for Reliable Multi-Model Reasoning
Kolawole E. Ogunsina
Morayo A. Ogunsina
41
0
0
06 May 2025
Bielik 11B v2 Technical Report
Bielik 11B v2 Technical Report
Krzysztof Ociepa
Łukasz Flis
Krzysztof Wróbel
Adrian Gwoździej
Remigiusz Kinas
34
0
0
05 May 2025
RM-R1: Reward Modeling as Reasoning
RM-R1: Reward Modeling as Reasoning
Xiusi Chen
Gaotang Li
Zehua Wang
Bowen Jin
Cheng Qian
...
Y. Zhang
D. Zhang
Tong Zhang
Hanghang Tong
Heng Ji
ReLM
OffRL
LRM
165
1
0
05 May 2025
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Sailing AI by the Stars: A Survey of Learning from Rewards in Post-Training and Test-Time Scaling of Large Language Models
Xiaobao Wu
LRM
72
1
0
05 May 2025
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
SIMPLEMIX: Frustratingly Simple Mixing of Off- and On-policy Data in Language Model Preference Learning
Tianjian Li
Daniel Khashabi
55
0
0
05 May 2025
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Improving Model Alignment Through Collective Intelligence of Open-Source LLMS
Junlin Wang
Roy Xie
Shang Zhu
Jue Wang
Ben Athiwaratkun
Bhuwan Dhingra
Shuaiwen Leon Song
Ce Zhang
James Zou
ALM
31
0
0
05 May 2025
Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents
Think on your Feet: Adaptive Thinking via Reinforcement Learning for Social Agents
Minzheng Wang
Y. Li
Haozhao Wang
Xinghua Zhang
Nan Xu
Bingli Wu
Fei Huang
Haiyang Yu
Wenji Mao
LLMAG
LRM
43
1
0
04 May 2025
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
R-Bench: Graduate-level Multi-disciplinary Benchmarks for LLM & MLLM Complex Reasoning Evaluation
Meng-Hao Guo
Jiajun Xu
Yi Zhang
Jiaxi Song
Haoyang Peng
...
Yongming Rao
Houwen Peng
Han Hu
Gordon Wetzstein
Shi-Min Hu
ELM
LRM
57
0
0
04 May 2025
1234...565758
Next