ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2412.14161
  4. Cited By
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World
  Tasks

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

18 December 2024
Frank F. Xu
Yufan Song
Boxuan Li
Yuxuan Tang
Kritanjali Jain
Mengxue Bao
Zora Zhiruo Wang
Xuhui Zhou
Zhitong Guo
Murong Cao
Mingyang Yang
Hao Yang Lu
Amaad Martin
Zhe Su
Leander Maben
Raj Mehta
Wayne Chi
Lawrence Jang
Yiqing Xie
Shuyan Zhou
Graham Neubig
    LLMAG
ArXivPDFHTML

Papers citing "TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks"

16 / 16 papers shown
Title
Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study
Towards Multi-Agent Reasoning Systems for Collaborative Expertise Delegation: An Exploratory Design Study
Baixuan Xu
Chunyang Li
Weiqi Wang
Wei Fan
Tianshi Zheng
H. Shi
Tao Fan
Yangqiu Song
Qiang Yang
14
0
0
12 May 2025
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
Benchmarking Vision, Language, & Action Models in Procedurally Generated, Open Ended Action Environments
Pranav Guruprasad
Yangyue Wang
Sudipta Chowdhury
Harshvardhan Sikka
LM&Ro
VLM
50
0
0
08 May 2025
PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities
PLANET: A Collection of Benchmarks for Evaluating LLMs' Planning Capabilities
Haoming Li
Zhaoliang Chen
Jonathan Zhang
Fei Liu
LLMAG
31
0
0
21 Apr 2025
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
DoomArena: A framework for Testing AI Agents Against Evolving Security Threats
Léo Boisvert
Mihir Bansal
Chandra Kiran Reddy Evuru
Gabriel Huang
Abhay Puri
...
Quentin Cappart
Jason Stanley
Alexandre Lacoste
Alexandre Drouin
Krishnamurthy Dvijotham
30
0
0
18 Apr 2025
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Divyansh Garg
Shaun VanWeelden
Diego Caples
Andis Draguns
Nikil Ravi
...
Youngchul Joo
Jindong Gu
Charles London
Christian Schroeder de Witt
S. Motwani
39
1
0
15 Apr 2025
Inducing Programmatic Skills for Agentic Tasks
Inducing Programmatic Skills for Agentic Tasks
Zora Zhiruo Wang
Apurva Gandhi
Graham Neubig
Daniel Fried
LLMAG
35
0
0
09 Apr 2025
IMPersona: Evaluating Individual Level LM Impersonation
IMPersona: Evaluating Individual Level LM Impersonation
Quan Shi
Carlos E. Jimenez
Stephen Dong
Brian Seo
Caden Yao
Adam Kelch
Karthik Narasimhan
21
0
0
06 Apr 2025
SWI: Speaking with Intent in Large Language Models
SWI: Speaking with Intent in Large Language Models
Yuwei Yin
EunJeong Hwang
Giuseppe Carenini
LRM
44
0
0
27 Mar 2025
Survey on Evaluation of LLM-based Agents
Survey on Evaluation of LLM-based Agents
Asaf Yehudai
Lilach Eden
Alan Li
Guy Uziel
Yilun Zhao
Roy Bar-Haim
Arman Cohan
Michal Shmueli-Scheuer
LLMAG
ELM
Presented at ResearchTrend Connect | LLMAG on 07 May 2025
93
5
0
20 Mar 2025
Measuring AI Ability to Complete Long Tasks
Measuring AI Ability to Complete Long Tasks
Thomas Kwa
Ben West
Joel Becker
Amy Deng
Katharyn Garcia
...
Lucas Jun Koba Sato
H. Wijk
Daniel M. Ziegler
Elizabeth Barnes
Lawrence Chan
ELM
67
6
0
18 Mar 2025
Measuring temporal effects of agent knowledge by date-controlled tool use
Measuring temporal effects of agent knowledge by date-controlled tool use
R. Xian
Qiming Cui
Stefan Bauer
Reza Abbasi-Asl
KELM
50
0
0
06 Mar 2025
Agentic AI Needs a Systems Theory
Erik Miehling
K. Ramamurthy
Kush R. Varshney
Matthew D Riemer
Djallel Bouneffouf
...
P. Sattigeri
Dennis L. Wei
Ambrish Rawat
Jasmina Gajcin
Werner Geyer
55
1
0
28 Feb 2025
Programming with Pixels: Computer-Use Meets Software Engineering
Programming with Pixels: Computer-Use Meets Software Engineering
Pranjal Aggarwal
Sean Welleck
31
0
0
24 Feb 2025
HARBOR: Exploring Persona Dynamics in Multi-Agent Competition
HARBOR: Exploring Persona Dynamics in Multi-Agent Competition
Kenan Jiang
Li Xiong
Fei Liu
42
0
0
17 Feb 2025
The AI Agent Index
The AI Agent Index
Stephen Casper
Luke Bailey
Rosco Hunter
Carson Ezell
Emma Cabalé
...
Phillip J. K. Christoffersen
A. Pinar Ozisik
Rakshit Trivedi
Dylan Hadfield-Menell
Noam Kolt
60
4
0
03 Feb 2025
The BrowserGym Ecosystem for Web Agent Research
The BrowserGym Ecosystem for Web Agent Research
Thibault Le Sellier De Chezelles
Maxime Gasse
Alexandre Lacoste
Alexandre Drouin
Massimo Caccia
...
Siva Reddy
Quentin Cappart
Graham Neubig
Ruslan Salakhutdinov
Nicolas Chapados
LLMAG
94
9
0
06 Dec 2024
1