ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.03688
  4. Cited By
AgentBench: Evaluating LLMs as Agents

AgentBench: Evaluating LLMs as Agents

7 August 2023
Xiao Liu
Hao Yu
Hanchen Zhang
Yifan Xu
Xuanyu Lei
Hanyu Lai
Yu Gu
Hangliang Ding
Kai Men
Kejuan Yang
Shudan Zhang
Xiang Deng
Aohan Zeng
Zhengxiao Du
Chenhui Zhang
Sheng Shen
S. Shen
Yu-Chuan Su
Huan Sun
Minlie Huang
Yuxiao Dong
Jie Tang
    ELM
    LLMAG
ArXivPDFHTML

Papers citing "AgentBench: Evaluating LLMs as Agents"

50 / 56 papers shown
Title
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Toward Generalizable Evaluation in the LLM Era: A Survey Beyond Benchmarks
Yixin Cao
Shibo Hong
X. Li
Jiahao Ying
Yubo Ma
...
Juanzi Li
Aixin Sun
Xuanjing Huang
Tat-Seng Chua
Yu Jiang
ALM
ELM
84
0
0
26 Apr 2025
Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution
Enhancing LLM-Based Agents via Global Planning and Hierarchical Execution
Junjie Chen
H. Li
Jingli Yang
Y. Liu
Qingyao Ai
LLMAG
80
0
0
23 Apr 2025
A Survey of AI Agent Protocols
A Survey of AI Agent Protocols
Y. Yang
Huacan Chai
Y. Song
S. Qi
Muning Wen
...
Gaowei Chang
W. Liu
Ying Wen
Yong Yu
W. Zhang
LLMAG
59
1
0
23 Apr 2025
Safe Screening Rules for Group OWL Models
Safe Screening Rules for Group OWL Models
Runxue Bao
Quanchao Lu
Yanfu Zhang
34
0
0
04 Apr 2025
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Recitation over Reasoning: How Cutting-Edge Language Models Can Fail on Elementary School-Level Reasoning Problems?
Kai Yan
Yufei Xu
Zhengyin Du
Xuesong Yao
Z. Wang
Xiaowen Guo
Jiecao Chen
ReLM
ELM
LRM
87
3
0
01 Apr 2025
Measuring AI Ability to Complete Long Tasks
Measuring AI Ability to Complete Long Tasks
Thomas Kwa
Ben West
Joel Becker
Amy Deng
Katharyn Garcia
...
Lucas Jun Koba Sato
H. Wijk
Daniel M. Ziegler
Elizabeth Barnes
Lawrence Chan
ELM
72
6
0
18 Mar 2025
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
FACT-AUDIT: An Adaptive Multi-Agent Framework for Dynamic Fact-Checking Evaluation of Large Language Models
Hongzhan Lin
Yang Deng
Yuxuan Gu
Wenxuan Zhang
Jing Ma
See-Kiong Ng
Tat-Seng Chua
LLMAG
KELM
HILM
63
0
0
25 Feb 2025
EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness
EDGE: Efficient Data Selection for LLM Agents via Guideline Effectiveness
Yunxiao Zhang
Guanming Xiong
Haochen Li
Wen Zhao
LLMAG
59
0
0
18 Feb 2025
AgentStudio: A Toolkit for Building General Virtual Agents
AgentStudio: A Toolkit for Building General Virtual Agents
Longtao Zheng
Zhiyuan Huang
Zhenghai Xue
Xinrun Wang
Bo An
Shuicheng Yan
75
14
0
17 Feb 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Erik Cambria
LM&MA
AILaw
79
148
0
28 Jan 2025
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
WebRL: Training LLM Web Agents via Self-Evolving Online Curriculum Reinforcement Learning
Zehan Qi
Xiao-Chang Liu
Iat Long Iong
Hanyu Lai
X. Sun
...
Shuntian Yao
Tianjie Zhang
Wei Xu
J. Tang
Yuxiao Dong
93
14
0
28 Jan 2025
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Automated Generation of Challenging Multiple-Choice Questions for Vision Language Model Evaluation
Yuhui Zhang
Yuchang Su
Yiming Liu
Xiaohan Wang
James Burgess
...
Josiah Aklilu
Alejandro Lozano
Anjiang Wei
Ludwig Schmidt
Serena Yeung-Levy
50
3
0
06 Jan 2025
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Exposing Limitations of Language Model Agents in Sequential-Task Compositions on the Web
Hiroki Furuta
Yutaka Matsuo
Aleksandra Faust
Izzeddin Gur
CLL
78
13
0
03 Jan 2025
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots
H. Zhang
Xiaoman Pan
Hongwei Wang
Kaixin Ma
W. Yu
Dong Yu
LLMAG
52
3
0
03 Jan 2025
Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform
Beyond Text: Implementing Multimodal Large Language Model-Powered Multi-Agent Systems Using a No-Code Platform
Cheonsu Jeong
70
0
0
01 Jan 2025
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
SPaR: Self-Play with Tree-Search Refinement to Improve Instruction-Following in Large Language Models
Jiale Cheng
Xiao-Chang Liu
C. Wang
Xiaotao Gu
Y. Lu
Dan Zhang
Yuxiao Dong
J. Tang
Hongning Wang
Minlie Huang
LRM
117
3
0
16 Dec 2024
OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?
OBI-Bench: Can LMMs Aid in Study of Ancient Script on Oracle Bones?
Z. Chen
Tingzhu Chen
Wenjun Zhang
Guangtao Zhai
82
3
0
02 Dec 2024
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
BALROG: Benchmarking Agentic LLM and VLM Reasoning On Games
Davide Paglieri
Bartłomiej Cupiał
Samuel Coward
Ulyana Piterbarg
Maciej Wolczyk
...
Lerrel Pinto
Rob Fergus
Jakob Foerster
Jack Parker-Holder
Tim Rocktaschel
LLMAG
LRM
101
10
0
20 Nov 2024
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
CURATe: Benchmarking Personalised Alignment of Conversational AI Assistants
Lize Alberts
Benjamin Ellis
Andrei Lupu
Jakob Foerster
ELM
34
0
0
28 Oct 2024
FALCON: Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization system
FALCON: Feedback-driven Adaptive Long/short-term memory reinforced Coding Optimization system
Zeyuan Li
Yangfan He
Lewei He
Jianhui Wang
Tianyu Shi
Bin Lei
Yuchen Li
Qiuwu Chen
ALM
48
5
0
28 Oct 2024
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
AutoBench-V: Can Large Vision-Language Models Benchmark Themselves?
Han Bao
Yue Huang
Yanbo Wang
Jiayi Ye
Xiangqi Wang
Xiuying Chen
Mohamed Elhoseiny
X. Zhang
Mohamed Elhoseiny
Xiangliang Zhang
47
7
0
28 Oct 2024
Beyond Browsing: API-Based Web Agents
Beyond Browsing: API-Based Web Agents
Yueqi Song
Frank F. Xu
Shuyan Zhou
Graham Neubig
39
13
0
21 Oct 2024
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
SPA-Bench: A Comprehensive Benchmark for SmartPhone Agent Evaluation
Jingxuan Chen
Derek Yuen
Bin Xie
Y. Yang
Gongwei Chen
...
Liqiang Nie
Yasheng Wang
Jianye Hao
Jun Wang
Kun Shao
LLMAG
31
5
0
19 Oct 2024
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
ProcBench: Benchmark for Multi-Step Reasoning and Following Procedure
Ippei Fujisawa
Sensho Nobe
Hiroki Seto
Rina Onda
Yoshiaki Uchida
Hiroki Ikoma
Pei-Chun Chien
Ryota Kanai
LRM
34
3
0
04 Oct 2024
ToolGen: Unified Tool Retrieval and Calling via Generation
ToolGen: Unified Tool Retrieval and Calling via Generation
Renxi Wang
Xudong Han
Lei Ji
Shu Wang
Timothy Baldwin
Haonan Li
LLMAG
58
6
0
04 Oct 2024
Self-evolving Agents with reflective and memory-augmented abilities
Self-evolving Agents with reflective and memory-augmented abilities
Xuechen Liang
Yangfan He
Yinghui Xia
Xinyuan Song
Jianhui Wang
...
Keqin Li
Jiaqi Chen
Jinsong Yang
Siyuan Chen
Tianyu Shi
LLMAG
KELM
CLL
33
2
0
01 Sep 2024
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
AgentGen: Enhancing Planning Abilities for Large Language Model based Agent via Environment and Task Generation
Mengkang Hu
Yixiao Wang
Can Xu
Lingfeng Sun
Chensheng Peng
T. Hannagan
Nicola Poerio
Saravan Rajmohan
LM&Ro
LLMAG
58
14
0
01 Aug 2024
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction
  Amplification
Breaking Agents: Compromising Autonomous LLM Agents Through Malfunction Amplification
Boyang Zhang
Yicong Tan
Yun Shen
Ahmed Salem
Michael Backes
Savvas Zannettou
Yang Zhang
LLMAG
AAML
38
12
0
30 Jul 2024
VRSD: Rethinking Similarity and Diversity for Retrieval in Large
  Language Models
VRSD: Rethinking Similarity and Diversity for Retrieval in Large Language Models
Hang Gao
Yongfeng Zhang
41
2
0
05 Jul 2024
AI Agents That Matter
AI Agents That Matter
Sayash Kapoor
Benedikt Stroebl
Zachary S. Siegel
Nitya Nadgir
Arvind Narayanan
38
32
0
01 Jul 2024
Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
Advancing Tool-Augmented Large Language Models: Integrating Insights from Errors in Inference Trees
Sijia Chen
Yibo Wang
Yi-Feng Wu
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
Lijun Zhang
LLMAG
LRM
46
10
0
11 Jun 2024
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
Seungone Kim
Juyoung Suk
Ji Yong Cho
Shayne Longpre
Chaeeun Kim
...
Sean Welleck
Graham Neubig
Moontae Lee
Kyungjae Lee
Minjoon Seo
ELM
ALM
LM&MA
90
28
0
09 Jun 2024
Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf
Learning to Discuss Strategically: A Case Study on One Night Ultimate Werewolf
Xuanfa Jin
Ziyan Wang
Yali Du
Meng Fang
Haifeng Zhang
Jun Wang
OffRL
LLMAG
40
5
0
30 May 2024
Reinforcing Language Agents via Policy Optimization with Action
  Decomposition
Reinforcing Language Agents via Policy Optimization with Action Decomposition
Muning Wen
Ziyu Wan
Weinan Zhang
Jun Wang
Ying Wen
33
7
0
23 May 2024
A Survey on Self-Evolution of Large Language Models
A Survey on Self-Evolution of Large Language Models
Zhengwei Tao
Ting-En Lin
Xiancai Chen
Hangyu Li
Yuchuan Wu
Yongbin Li
Zhi Jin
Fei Huang
Dacheng Tao
Jingren Zhou
LRM
LM&Ro
43
21
0
22 Apr 2024
MMInA: Benchmarking Multihop Multimodal Internet Agents
MMInA: Benchmarking Multihop Multimodal Internet Agents
Ziniu Zhang
Shulin Tian
Liangyu Chen
Ziwei Liu
LLMAG
LM&Ro
27
13
0
15 Apr 2024
CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
CMAT: A Multi-Agent Collaboration Tuning Framework for Enhancing Small Language Models
Xuechen Liang
Meiling Tao
Yinghui Xia
Yiting Xie
Jun Wang
JingSong Yang
LLMAG
16
12
0
02 Apr 2024
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Kevin Xu
Yeganeh Kordi
Kate Sanders
Yizhong Wang
Adam Byerly
Kate Sanders
Adam Byerly
Jingyu Zhang
Benjamin Van Durme
Daniel Khashabi
LLMAG
60
6
0
18 Mar 2024
Towards Unified Alignment Between Agents, Humans, and Environment
Towards Unified Alignment Between Agents, Humans, and Environment
Zonghan Yang
An Liu
Zijun Liu
Kai Liu
Fangzhou Xiong
...
Zhenhe Zhang
Fuwen Luo
Zhicheng Guo
Peng Li
Yang Liu
24
4
0
12 Feb 2024
How Proficient Are Large Language Models in Formal Languages? An
  In-Depth Insight for Knowledge Base Question Answering
How Proficient Are Large Language Models in Formal Languages? An In-Depth Insight for Knowledge Base Question Answering
Jinxi Liu
S. Cao
Jiaxin Shi
Tingjian Zhang
Lunyiu Nie
Linmei Hu
Lei Hou
Juanzi Li
ELM
10
3
0
11 Jan 2024
MemGPT: Towards LLMs as Operating Systems
MemGPT: Towards LLMs as Operating Systems
Charles Packer
Sarah Wooders
Kevin Lin
Vivian Fang
Shishir G. Patil
Ion Stoica
Joseph E. Gonzalez
RALM
13
126
0
12 Oct 2023
Generative Agents: Interactive Simulacra of Human Behavior
Generative Agents: Interactive Simulacra of Human Behavior
J. Park
Joseph C. O'Brien
Carrie J. Cai
Meredith Ringel Morris
Percy Liang
Michael S. Bernstein
LM&Ro
AI4CE
209
1,701
0
07 Apr 2023
ReAct: Synergizing Reasoning and Acting in Language Models
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao
Jeffrey Zhao
Dian Yu
Nan Du
Izhak Shafran
Karthik Narasimhan
Yuan Cao
LLMAG
ReLM
LRM
208
2,413
0
06 Oct 2022
GLM-130B: An Open Bilingual Pre-trained Model
GLM-130B: An Open Bilingual Pre-trained Model
Aohan Zeng
Xiao Liu
Zhengxiao Du
Zihan Wang
Hanyu Lai
...
Jidong Zhai
Wenguang Chen
Peng-Zhen Zhang
Yuxiao Dong
Jie Tang
BDL
LRM
240
1,070
0
05 Oct 2022
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors,
  and Lessons Learned
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Deep Ganguli
Liane Lovitt
John Kernion
Amanda Askell
Yuntao Bai
...
Nicholas Joseph
Sam McCandlish
C. Olah
Jared Kaplan
Jack Clark
218
441
0
23 Aug 2022
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang
Jason W. Wei
Dale Schuurmans
Quoc Le
Ed H. Chi
Sharan Narang
Aakanksha Chowdhery
Denny Zhou
ReLM
BDL
LRM
AI4CE
297
3,163
0
21 Mar 2022
Training language models to follow instructions with human feedback
Training language models to follow instructions with human feedback
Long Ouyang
Jeff Wu
Xu Jiang
Diogo Almeida
Carroll L. Wainwright
...
Amanda Askell
Peter Welinder
Paul Christiano
Jan Leike
Ryan J. Lowe
OSLM
ALM
301
11,730
0
04 Mar 2022
A Systematic Evaluation of Large Language Models of Code
A Systematic Evaluation of Large Language Models of Code
Frank F. Xu
Uri Alon
Graham Neubig
Vincent J. Hellendoorn
ELM
ALM
193
624
0
26 Feb 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
315
8,261
0
28 Jan 2022
Multitask Prompted Training Enables Zero-Shot Task Generalization
Multitask Prompted Training Enables Zero-Shot Task Generalization
Victor Sanh
Albert Webson
Colin Raffel
Stephen H. Bach
Lintang Sutawika
...
T. Bers
Stella Biderman
Leo Gao
Thomas Wolf
Alexander M. Rush
LRM
203
1,651
0
15 Oct 2021
12
Next