ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2401.08326
  4. Cited By
RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large
  Language Models in Tool Learning
v1v2 (latest)

RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
16 January 2024
Junjie Ye
Yilong Wu
Songyang Gao
Jessica Fan
Sixian Li
Guanyu Li
Xiaoran Fan
Tao Gui
Tao Gui
Xuanjing Huang
    AAML
ArXiv (abs)PDFHTMLGithub (15★)

Papers citing "RoTBench: A Multi-Level Benchmark for Evaluating the Robustness of Large Language Models in Tool Learning"

24 / 24 papers shown
Title
Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation
Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation
Ke Zhang
X. Zhao
Ce Zheng
Jiahong Ning
DanDan Zhu
Wenqi Zhang
Chen Sun
Toshiharu Sugawara
LM&Ro
384
0
0
26 Nov 2025
Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models
Don't Adapt Small Language Models for Tools; Adapt Tool Schemas to the Models
Jonggeun Lee
Woojung Song
Jongwook Han
Haesung Pyun
Yohan Jo
CLL
106
0
0
08 Oct 2025
OR-Toolformer: Modeling and Solving Operations Research Problems with Tool Augmented Large Language Models
OR-Toolformer: Modeling and Solving Operations Research Problems with Tool Augmented Large Language Models
Jianzhang Zhang
Jialong Zhou
Chuang Liu
LLMAGSyDa
61
0
0
24 Sep 2025
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels
Analyzing the Effects of Supervised Fine-Tuning on Model Knowledge from Token and Parameter Levels
Junjie Ye
Yuming Yang
Yang Nan
Shuo Li
Qi Zhang
Tao Gui
Xuanjing Huang
Liang Luo
Zhongchao Shi
Jianping Fan
84
1
0
20 Sep 2025
SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs
SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs
Hongfei Xia
Hongru Wang
Zeming Liu
Qian Yu
Yuhang Guo
Haifeng Wang
ELM
90
1
0
09 Sep 2025
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Feedback-Driven Tool-Use Improvements in Large Language Models via Automated Build Environments
Junjie Ye
C. Jiang
Zhengyin Du
Yufei Xu
Xuesong Yao
...
Xiaoran Fan
Qi Zhang
Tao Gui
Xuanjing Huang
Jiecao Chen
KELMOffRL
128
4
0
12 Aug 2025
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios
Shiting Huang
Zhen Fang
Zehui Chen
Siyu Yuan
Junjie Ye
Y. Zeng
Lin Yen-Chen
Qi Mao
Feng Zhao
LLMAGKELM
165
1
0
11 Jun 2025
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Trustworthy Medical Question Answering: An Evaluation-Centric Survey
Yinuo Wang
Robert E. Mercer
Frank Rudzicz
Sudipta Singha Roy
Sudipta Singha Roy
Pengjie Ren
Zhumin Chen
Xindi Wang
ELM
196
2
0
04 Jun 2025
RRTL: Red Teaming Reasoning Large Language Models in Tool Learning
RRTL: Red Teaming Reasoning Large Language Models in Tool Learning
Yifei Liu
Yu Cui
Haibin Zhang
LRM
228
1
0
21 May 2025
ToolSpectrum : Towards Personalized Tool Utilization for Large Language Models
ToolSpectrum : Towards Personalized Tool Utilization for Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Zihao Cheng
Hongru Wang
Zeming Liu
Yuhang Guo
Yuanfang Guo
Yunhong Wang
Haifeng Wang
255
3
0
19 May 2025
FamilyTool: A Multi-hop Personalized Tool Use Benchmark
FamilyTool: A Multi-hop Personalized Tool Use Benchmark
Yuxin Wang
Yiran Guo
Y. Zheng
Zhangyue Yin
Tian Jin
Jie Yang
Jiajun Chen
Yuan Li
Qi Zhang
Xipeng Qiu
284
0
0
09 Apr 2025
Select Me! When You Need a Tool: A Black-box Text Attack on Tool Selection
Select Me! When You Need a Tool: A Black-box Text Attack on Tool Selection
Liuji Chen
Hao Gao
Jinghao Zhang
Sihan Yang
Shu Wu
Liang Wang
AAML
200
1
0
07 Apr 2025
On the Robustness of Agentic Function Calling
On the Robustness of Agentic Function Calling
Ella Rabinovich
Ateret Anaby-Tavor
LLMAG
192
7
0
01 Apr 2025
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
reWordBench: Benchmarking and Improving the Robustness of Reward Models with Transformed Inputs
Zhaofeng Wu
Michihiro Yasunaga
Andrew Cohen
Yoon Kim
Asli Celikyilmaz
Marjan Ghazvininejad
279
10
0
14 Mar 2025
GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong Simulation
GenTool: Enhancing Tool Generalization in Language Models through Zero-to-One and Weak-to-Strong SimulationAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Jie He
Jennifer Neville
Mengting Wan
Longqi Yang
Hui Liu
Xiaofeng Xu
Xia Song
Jeff Z. Pan
Pei Zhou
LLMAGSyDa
170
4
0
26 Feb 2025
PEToolLLM: Towards Personalized Tool Learning in Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Qiancheng Xu
Yunshui Li
Heming Xia
Fan Liu
Min Yang
Wenjie Li
304
0
0
26 Feb 2025
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use
ToolHop: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool UseAnnual Meeting of the Association for Computational Linguistics (ACL), 2025
Junjie Ye
Zhengyin Du
Xuesong Yao
Weijian Lin
Yufei Xu
...
Siyu Yuan
Tao Gui
Tao Gui
Qi Zhang
Jiecao Chen
359
9
0
05 Jan 2025
Speech-Copilot: Leveraging Large Language Models for Speech Processing
  via Task Decomposition, Modularization, and Program Generation
Speech-Copilot: Leveraging Large Language Models for Speech Processing via Task Decomposition, Modularization, and Program Generation
Chun-Yi Kuan
Chih-Kai Yang
Wei-Ping Huang
Ke-Han Lu
Hung-yi Lee
255
17
0
13 Jul 2024
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language
  Models
ElecBench: a Power Dispatch Evaluation Benchmark for Large Language Models
Xiyuan Zhou
Huan Zhao
Yuheng Cheng
Yuji Cao
Gaoqi Liang
Guolong Liu
Wenxuan Liu
Yan Xu
Junhua Zhao
ELM
216
19
0
07 Jul 2024
What Affects the Stability of Tool Learning? An Empirical Study on the
  Robustness of Tool Learning Frameworks
What Affects the Stability of Tool Learning? An Empirical Study on the Robustness of Tool Learning Frameworks
Chengrui Huang
Zhengliang Shi
Yuntao Wen
Xiuying Chen
Peng Han
Shen Gao
Shuo Shang
163
2
0
03 Jul 2024
Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?
Can Tool-augmented Large Language Models be Aware of Incomplete Conditions?
Seungbin Yang
Yujin Baek
Taehee Kim
Jaegul Choo
393
4
0
18 Jun 2024
Tool Learning with Large Language Models: A Survey
Tool Learning with Large Language Models: A Survey
Changle Qu
Sunhao Dai
Xiaochi Wei
Hengyi Cai
Shuaiqiang Wang
D. Yin
Jun Xu
Jirong Wen
LLMAG
268
203
0
28 May 2024
ToolSword: Unveiling Safety Issues of Large Language Models in Tool
  Learning Across Three Stages
ToolSword: Unveiling Safety Issues of Large Language Models in Tool Learning Across Three Stages
Junjie Ye
Sixian Li
Guanyu Li
Jessica Fan
Songyang Gao
Yilong Wu
Tao Gui
Tao Gui
Xuanjing Huang
LLMAG
338
48
0
16 Feb 2024
Interpreting User Requests in the Context of Natural Language Standing
  Instructions
Interpreting User Requests in the Context of Natural Language Standing Instructions
Nikita Moghe
Patrick Xia
Jacob Andreas
J. Eisner
Benjamin Van Durme
Harsh Jhamtani
207
5
0
16 Nov 2023
1