ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2308.03688
18
149

AgentBench: Evaluating LLMs as Agents

7 August 2023
Xiao Liu
Hao Yu
Hanchen Zhang
Yifan Xu
Xuanyu Lei
Hanyu Lai
Yu Gu
Hangliang Ding
Kai Men
Kejuan Yang
Shudan Zhang
Xiang Deng
Aohan Zeng
Zhengxiao Du
Chenhui Zhang
Sheng Shen
S. Shen
Yu-Chuan Su
Huan Sun
Minlie Huang
Yuxiao Dong
Jie Tang
    ELM
    LLMAG
ArXivPDFHTML
Abstract

Large Language Models (LLMs) are becoming increasingly smart and autonomous, targeting real-world pragmatic missions beyond traditional NLP tasks. As a result, there has been an urgent need to evaluate LLMs as agents on challenging tasks in interactive environments. We present AgentBench, a multi-dimensional evolving benchmark that currently consists of 8 distinct environments to assess LLM-as-Agent's reasoning and decision-making abilities in a multi-turn open-ended generation setting. Our extensive test over 27 API-based and open-sourced (OSS) LLMs shows that, while top commercial LLMs present a strong ability of acting as agents in complex environments, there is a significant disparity in performance between them and OSS competitors. We identify the typical reasons of failures in environments and LLMs, showing that poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents. Training on code and high quality multi-turn alignment data could improve agent performance. Datasets, environments, and an integrated evaluation package for AgentBench are released at \url{https://github.com/THUDM/AgentBench}.

View on arXiv
Comments on this paper