ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2411.13547
88
0

SpecTool: A Benchmark for Characterizing Errors in Tool-Use LLMs

20 November 2024
Shirley Kokane
Ming Zhu
Tulika Awalgaonkar
Jianguo Zhang
Thai Hoang
Akshara Prabhakar
Zuxin Liu
Tian Lan
Liangwei Yang
Juntao Tan
Rithesh Murthy
Weiran Yao
Z. Liu
Juan Carlos Niebles
Huan Wang
Shelby Heinecke
Caiming Xiong
Silivo Savarese
    LLMAG
ArXivPDFHTML
Abstract

Evaluating the output of Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce SpecTool, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using SPECTOOL , we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use the analysis and insights from SPECTOOL to guide their error mitigation strategies.

View on arXiv
Comments on this paper