Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2505.16700
Cited By
v1
v2 (latest)
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models
22 May 2025
Xuanqi Gao
Siyi Xie
Juan Zhai
Shqing Ma
Chao Shen
ELM
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (11943★)
Papers citing
"MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models"
46 / 46 papers shown
Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol
Niklas Jobs
Luis Miguel Vieira da Silva
Jayanth Somashekaraiah
Maximilian Weigand
David Kube
Felix Gehlhoff
LLMAG
LM&Ro
192
0
0
03 Dec 2025
A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models
Zhen Tao
Shidong Pan
Zhenchang Xing
Emily Black
Talia B. Gillis
Chunyang Chen
AILaw
93
0
0
24 Nov 2025
MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?
Jiayi Fu
Qiyao Sun
136
0
0
08 Nov 2025
OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents
Hongrui Jia
Jitong Liao
X. Zhang
Haiyang Xu
Tianbao Xie
Chaoya Jiang
Ming Yan
Si Liu
Wei Ye
Fei Huang
173
4
0
28 Oct 2025
MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools
Wenhao Wang
Peizhi Niu
Zhao Xu
Zhaoyu Chen
Jian Du
...
Xianghe Pang
Keduan Huang
Y. Wang
Qiang Yan
Siheng Chen
170
0
0
28 Oct 2025
MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
Jia-Kai Dong
I-Wei Huang
Chun-Tin Wu
Yi-Tien Tsai
151
0
0
22 Oct 2025
TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Reza Esfandiarpoor
Vishwas Suryanarayanan
Stephen H. Bach
Vishal Chowdhary
Anthony Aue
LLMAG
207
0
0
22 Oct 2025
InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
Yaxin Du
Y. Zhang
Xiyuan Yang
Yifan Zhou
Cheng-Yu Wang
...
Menglan Chen
Shuo Tang
Z. Li
Feiyu Xiong
Siheng Chen
172
0
0
02 Oct 2025
TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments
Zhangchen Xu
Adriana Meza Soria
Shawn Tan
Anurag Roy
Ashish Sunil Agrawal
Radha Poovendran
Rameswar Panda
116
9
0
01 Oct 2025
MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use
Zijian Wu
Xiangyan Liu
Xinyuan Zhang
L. Chen
Fanqing Meng
...
Zirui Wang
Jinjie Ni
Y. Yang
Arvin Xu
Michael Shieh
116
4
0
28 Sep 2025
IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol
Ningyuan Yang
Guanliang Lyu
Mingchen Ma
Yiyi Lu
Yiming Li
Zhihui Gao
Hancheng Ye
Jianyi Zhang
Tingjun Chen
Yiran Chen
102
0
0
25 Sep 2025
ARE: Scaling Up Agent Environments and Evaluations
Pierre Andrews
Amine Benhalloum
Gerard Moreno-Torres Bertran
Matteo Bettini
Amar Budhiraja
...
Andrey Rusakov
Thomas Scialom
Vladislav Vorotilov
Mengjue Wang
Ian Yu
LLMAG
385
7
0
21 Sep 2025
MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools
Zikang Guo
Benfeng Xu
Chiwei Zhu
Wentao Hong
Xiaorui Wang
Zhendong Mao
ELM
150
8
0
10 Sep 2025
Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills
David Noever
137
0
0
27 Aug 2025
MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use
Fei Lei
Yibo Yang
Wenxiu Sun
Dahua Lin
LLMAG
168
3
0
22 Aug 2025
LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries
Ming Yin
Dinghan Shen
Silei Xu
Jianbing Han
Sixun Dong
...
Song Wang
Sathish Indurthi
Xun Wang
Yiran Chen
Kaiqiang Song
LLMAG
182
10
0
21 Aug 2025
MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers
Ziyang Luo
Zhiqi Shen
Wenzhuo Yang
Zirui Zhao
Prathyusha Jwalapuram
Amrita Saha
Doyen Sahoo
Silvio Savarese
Caiming Xiong
Junnan Li
ELM
208
25
0
20 Aug 2025
Agentic DraCor and the Art of Docstring Engineering: Evaluating MCP-empowered LLM Usage of the DraCor API
Peer Trilcke
Ingo Börner
Henny Sluyter-Gäthje
Daniil Skorinkin
Frank Fischer
Carsten Milling
65
0
0
19 Aug 2025
MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark
Shiqing Fan
Xichen Ding
Liang Zhang
Linjian Mo
LLMAG
86
10
0
11 Aug 2025
Routine: A Structural Planning Framework for LLM Agent System in Enterprise
Guancheng Zeng
Xueyi Chen
Jiawang Hu
Shaohua Qi
Yaxuan Mao
...
Wenqiang Han
Linyan Huang
Gang Li
Jingjing Mo
Haowen Hu
195
2
0
19 Jul 2025
MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models
Pei Wang
Yanan Wu
Zekun Wang
Qingbin Liu
Xiaoshuai Song
...
Ge Zhang
Hangyu Guo
Rundong Wang
Yuchi Xu
Bo Zheng
ELM
257
9
0
15 Oct 2024
AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation
Huanxi Liu
Jiaqi Liao
Dawei Feng
Kele Xu
Huaimin Wang
878
5
0
09 Oct 2024
GAIA: a benchmark for General AI Assistants
Grégoire Mialon
Clémentine Fourrier
Craig Swift
Thomas Wolf
Yann LeCun
Thomas Scialom
AI4MH
ALM
ELM
RALM
439
442
0
21 Nov 2023
MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language Feedback
International Conference on Learning Representations (ICLR), 2023
Xingyao Wang
Zihan Wang
Jiateng Liu
Yangyi Chen
Lifan Yuan
Hao Peng
Heng Ji
LRM
471
253
0
19 Sep 2023
AgentBench: Evaluating LLMs as Agents
International Conference on Learning Representations (ICLR), 2023
Xiao-Yang Liu
Hao Yu
Hanchen Zhang
Yifan Xu
Xuanyu Lei
...
Yu-Chuan Su
Huan Sun
Shiyu Huang
Yuxiao Dong
Jie Tang
ELM
LLMAG
532
494
0
07 Aug 2023
Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models
Cheng-Yu Hsieh
Sibei Chen
Chun-Liang Li
Yasuhisa Fujii
Alexander Ratner
Chen-Yu Lee
Ranjay Krishna
Tomas Pfister
LLMAG
SyDa
288
58
0
01 Aug 2023
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
International Conference on Learning Representations (ICLR), 2023
Yujia Qin
Shi Liang
Yining Ye
Kunlun Zhu
Lan Yan
...
Jie Zhou
Mark B. Gerstein
Dahai Li
Zhiyuan Liu
Maosong Sun
CLL
ALM
LLMAG
ELM
LM&MA
593
1,109
0
31 Jul 2023
ToolQA: A Dataset for LLM Question Answering with External Tools
Neural Information Processing Systems (NeurIPS), 2023
Yuchen Zhuang
Yue Yu
Kuan-Chieh Wang
Haotian Sun
Chao Zhang
ELM
LLMAG
325
342
0
23 Jun 2023
RestGPT: Connecting Large Language Models with Real-World RESTful APIs
Yifan Song
Weimin Xiong
Dawei Zhu
Wenhao Wu
Han Qian
...
Cheng Li
Ke Wang
Rong Yao
Ye Tian
Sujian Li
RALM
LLMAG
CLL
LM&MA
304
113
0
11 Jun 2023
ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases
Qiaoyu Tang
Ziliang Deng
Hongyu Lin
Xianpei Han
Qiao Liang
Boxi Cao
Le Sun
CLL
SyDa
285
282
0
08 Jun 2023
On the Tool Manipulation Capability of Open-source Large Language Models
Qiantong Xu
Fenglu Hong
Yangqiu Song
Changran Hu
Zheng Chen
Jian Zhang
LLMAG
256
97
0
25 May 2023
Gorilla: Large Language Model Connected with Massive APIs
Neural Information Processing Systems (NeurIPS), 2023
Shishir G. Patil
Tianjun Zhang
Xin Wang
Joseph E. Gonzalez
ELM
CLL
ALM
SyDa
387
871
0
24 May 2023
Interactive Natural Language Processing
Zekun Wang
Ge Zhang
Kexin Yang
Ning Shi
Wangchunshu Zhou
...
Wenhu Chen
Ke Xu
Dayiheng Liu
Yi-Ting Guo
Jie Fu
KELM
142
45
0
22 May 2023
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models
Neural Information Processing Systems (NeurIPS), 2023
Pan Lu
Baolin Peng
Hao Cheng
Michel Galley
Kai-Wei Chang
Ying Nian Wu
Song-Chun Zhu
Jianfeng Gao
KELM
MLLM
LRM
380
412
0
19 Apr 2023
API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMs
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Minghao Li
Yingxiu Zhao
Yu Bowen
Feifan Song
Hangyu Li
Haiyang Yu
Zhoujun Li
Fei Huang
Yongbin Li
ELM
RALM
CLL
303
296
0
14 Apr 2023
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong
Ruixiang Cui
Yiduo Guo
Yaobo Liang
Shuai Lu
Yanlin Wang
Amin Saied
Weizhu Chen
Nan Duan
ALM
ELM
378
721
0
13 Apr 2023
TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIs
Intelligent Computing (IC), 2023
Yaobo Liang
Chenfei Wu
Ting Song
Wenshan Wu
Yan Xia
...
Shaoguang Mao
Yuntao Wang
Linjun Shou
Ming Gong
Nan Duan
LLMAG
CLL
264
239
0
29 Mar 2023
Toolformer: Language Models Can Teach Themselves to Use Tools
Neural Information Processing Systems (NeurIPS), 2023
Timo Schick
Jane Dwivedi-Yu
Roberto Dessì
Roberta Raileanu
Maria Lomeli
Luke Zettlemoyer
Nicola Cancedda
Thomas Scialom
SyDa
RALM
414
2,656
0
09 Feb 2023
TALM: Tool Augmented Language Models
Aaron T Parisi
Yao-Min Zhao
Noah Fiedel
KELM
RALM
LLMAG
278
182
0
24 May 2022
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Neural Information Processing Systems (NeurIPS), 2022
Jason W. Wei
Xuezhi Wang
Dale Schuurmans
Maarten Bosma
Brian Ichter
F. Xia
Ed H. Chi
Quoc Le
Denny Zhou
LM&Ro
LRM
AI4CE
ReLM
2.3K
14,608
0
28 Jan 2022
Training Verifiers to Solve Math Word Problems
K. Cobbe
V. Kosaraju
Mohammad Bavarian
Mark Chen
Heewoo Jun
...
Jerry Tworek
Jacob Hilton
Reiichiro Nakano
Christopher Hesse
John Schulman
ReLM
OffRL
LRM
1.1K
6,810
0
27 Oct 2021
Program Synthesis with Large Language Models
Jacob Austin
Augustus Odena
Maxwell Nye
Maarten Bosma
Henryk Michalewski
...
Ellen Jiang
Carrie J. Cai
Michael Terry
Quoc V. Le
Charles Sutton
ELM
AIMat
ReCod
ALM
419
2,869
0
16 Aug 2021
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks
Collin Burns
Saurav Kadavath
Akul Arora
Steven Basart
Eric Tang
Basel Alomair
Jacob Steinhardt
ReLM
FaML
904
3,932
0
05 Mar 2021
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar
Xingdi Yuan
Marc-Alexandre Côté
Yonatan Bisk
Adam Trischler
Matthew J. Hausknecht
LM&Ro
LLMAG
415
635
0
08 Oct 2020
Measuring Massive Multitask Language Understanding
International Conference on Learning Representations (ICLR), 2020
Dan Hendrycks
Collin Burns
Steven Basart
Andy Zou
Mantas Mazeika
Basel Alomair
Jacob Steinhardt
ELM
RALM
2.3K
6,566
0
07 Sep 2020
Exploiting Cloze Questions for Few Shot Text Classification and Natural Language Inference
Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2020
Timo Schick
Hinrich Schütze
1.1K
1,754
0
21 Jan 2020
1