ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2408.04682
  4. Cited By
ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

8 August 2024
Jiarui Lu
Thomas Holleis
Yizhe Zhang
Bernhard Aumayer
Feng Nan
Felix Bai
Shuang Ma
Shen Ma
Mengyu Li
Guoli Yin
Zirui Wang
Ruoming Pang
    LLMAG
    ELM
ArXivPDFHTML

Papers citing "ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities"

21 / 21 papers shown
Title
From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems
From Glue-Code to Protocols: A Critical Analysis of A2A and MCP Integration for Scalable Agent Systems
Qiaomu Li
Ying Xie
24
0
0
06 May 2025
When2Call: When (not) to Call Tools
When2Call: When (not) to Call Tools
Hayley Ross
Ameya Sunil Mahabaleshwarkar
Yoshi Suhara
92
0
0
26 Apr 2025
MARFT: Multi-Agent Reinforcement Fine-Tuning
MARFT: Multi-Agent Reinforcement Fine-Tuning
Junwei Liao
Muning Wen
J. Wang
W. Zhang
OffRL
23
0
0
21 Apr 2025
FamilyTool: A Multi-hop Personalized Tool Use Benchmark
Yuxin Wang
Yiran Guo
Y. Zheng
Zhangyue Yin
S. Chen
Jie Yang
Jiajun Chen
Xuanjing Huang
Xipeng Qiu
24
0
0
09 Apr 2025
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
APIGen-MT: Agentic Pipeline for Multi-Turn Data Generation via Simulated Agent-Human Interplay
Akshara Prabhakar
Z. Liu
Weiran Yao
Jianguo Zhang
Ming Zhu
...
Juan Carlos Niebles
Shelby Heinecke
H. Wang
S.
Caiming Xiong
VGen
74
1
0
04 Apr 2025
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions
Multi-Mission Tool Bench: Assessing the Robustness of LLM based Agents through Related and Dynamic Missions
Peijie Yu
Yifan Yang
J. Li
Zelong Zhang
Haorui Wang
Xiao Feng
Feng Zhang
LLMAG
97
0
0
03 Apr 2025
DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models
DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models
S. Jung
Donghun Lee
Shinbok Lee
Gaeun Seo
Daniel Lee
Byeongil Ko
Junrae Cho
Kihyun Kim
EungGyun Kim
M. Shin
36
0
0
02 Apr 2025
On the Robustness of Agentic Function Calling
On the Robustness of Agentic Function Calling
Ella Rabinovich
Ateret Anaby-Tavor
LLMAG
47
0
0
01 Apr 2025
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions
Xinyi Hou
Yanjie Zhao
Shenao Wang
Haoyu Wang
53
10
0
30 Mar 2025
Survey on Evaluation of LLM-based Agents
Survey on Evaluation of LLM-based Agents
Asaf Yehudai
Lilach Eden
Alan Li
Guy Uziel
Yilun Zhao
Roy Bar-Haim
Arman Cohan
Michal Shmueli-Scheuer
LLMAG
ELM
Presented at ResearchTrend Connect | LLMAG on 07 May 2025
93
5
0
20 Mar 2025
Magnet: Multi-turn Tool-use Data Synthesis and Distillation via Graph Translation
Fan Yin
Zifeng Wang
I-Hung Hsu
Jun Yan
Ke Jiang
...
L. Le
Kai-Wei Chang
Chen-Yu Lee
Hamid Palangi
Tomas Pfister
49
2
0
10 Mar 2025
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios
HammerBench: Fine-Grained Function-Calling Evaluation in Real Mobile Device Scenarios
Jun Wang
Jiamu Zhou
Muning Wen
Xiaoyun Mo
H. Zhang
...
Cheng Jin
Xihuai Wang
Weinan Zhang
Qiuying Peng
J. Wang
LLMAG
87
0
0
21 Dec 2024
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation
  for Enterprise Applications
Towards Effective GenAI Multi-Agent Collaboration: Design and Evaluation for Enterprise Applications
Raphael Shu
Nilaksh Das
Michelle Yuan
Monica Sunkara
Yi Zhang
LLMAG
66
2
0
06 Dec 2024
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
CATP-LLM: Empowering Large Language Models for Cost-Aware Tool Planning
Duo Wu
J. Wang
Yuan Meng
Yanning Zhang
Le Sun
Zhi Wang
93
0
0
25 Nov 2024
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
CRMArena: Understanding the Capacity of LLM Agents to Perform Professional CRM Tasks in Realistic Environments
Kung-Hsiang Huang
Akshara Prabhakar
Sidharth Dhawan
Yixin Mao
Huan Wang
Silvio Savarese
Caiming Xiong
Philippe Laban
C. Wu
28
7
0
04 Nov 2024
Library Learning Doesn't: The Curious Case of the Single-Use "Library"
Library Learning Doesn't: The Curious Case of the Single-Use "Library"
Ian Berlot-Attwell
Frank Rudzicz
Xujie Si
37
1
0
26 Oct 2024
Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and
  Tool Knowledge Bases
Toolshed: Scale Tool-Equipped Agents with Advanced RAG-Tool Fusion and Tool Knowledge Bases
Elias Lumer
Vamse Kumar Subbiah
James A. Burke
Pradeep Honaganahalli Basavaraju
Austin Huber
31
0
0
18 Oct 2024
Learning Evolving Tools for Large Language Models
Learning Evolving Tools for Large Language Models
Guoxin Chen
Zhong Zhang
Xin Cong
Fangda Guo
Yesai Wu
Yankai Lin
Wenzheng Feng
Yasheng Wang
KELM
52
1
0
09 Oct 2024
SEAL: Suite for Evaluating API-use of LLMs
SEAL: Suite for Evaluating API-use of LLMs
Woojeong Kim
Ashish Jagmohan
Aditya Vempaty
ELM
ALM
LLMAG
30
0
0
23 Sep 2024
Tool Learning with Large Language Models: A Survey
Tool Learning with Large Language Models: A Survey
Changle Qu
Sunhao Dai
Xiaochi Wei
Hengyi Cai
Shuaiqiang Wang
Dawei Yin
Jun Xu
Jirong Wen
LLMAG
31
77
0
28 May 2024
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Tur[k]ingBench: A Challenge Benchmark for Web Agents
Kevin Xu
Yeganeh Kordi
Kate Sanders
Yizhong Wang
Adam Byerly
Kate Sanders
Adam Byerly
Jingyu Zhang
Benjamin Van Durme
Daniel Khashabi
LLMAG
56
6
0
18 Mar 2024
1