v1v2 (latest)

MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

22 May 2025

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)Github (11943★)

Papers citing "MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models"

46 / 46 papers shown

Benchmark for Planning and Control with Large Language Model Agents: Blocksworld with Model Context Protocol

Niklas Jobs

Luis Miguel Vieira da Silva

Jayanth Somashekaraiah

192

03 Dec 2025

A Longitudinal Measurement of Privacy Policy Evolution for Large Language Models

24 Nov 2025

MCP-RiskCue: Can LLM Infer Risk Information From MCP Server System Logs?

Jiayi Fu

Qiyao Sun

136

08 Nov 2025

OSWorld-MCP: Benchmarking MCP Tool Invocation In Computer-Use Agents

173

28 Oct 2025

MCP-Flow: Facilitating LLM Agents to Master Real-World, Diverse and Scaling MCP Tools

...

170

28 Oct 2025

MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration

151

22 Oct 2025

TheMCPCompany: Creating General-purpose Agents with Task-specific Tools

Reza Esfandiarpoor

Vishwas Suryanarayanan

207

22 Oct 2025

InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents

...

172

02 Oct 2025

TOUCAN: Synthesizing 1.5M Tool-Agentic Data from Real-World MCP Environments

116

01 Oct 2025

MCPMark: A Benchmark for Stress-Testing Realistic and Comprehensive MCP Use

...

116

28 Sep 2025

IoT-MCP: Bridging LLMs and IoT Systems Through Model Context Protocol

102

25 Sep 2025

ARE: Scaling Up Agent Environments and Evaluations

Pierre Andrews

Amine Benhalloum

Gerard Moreno-Torres Bertran

...

385

21 Sep 2025

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

150

10 Sep 2025

Servant, Stalker, Predator: How An Honest, Helpful, And Harmless (3H) Agent Unlocks Adversarial Skills

David Noever

137

27 Aug 2025

MCPVerse: An Expansive, Real-World Benchmark for Agentic Tool Use

168

22 Aug 2025

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

...

182

21 Aug 2025

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers

Prathyusha Jwalapuram

208

20 Aug 2025

Agentic DraCor and the Art of Docstring Engineering: Evaluating MCP-empowered LLM Usage of the DraCor API

19 Aug 2025

MCPToolBench++: A Large Scale AI Agent Model Context Protocol MCP Tool Use Benchmark

11 Aug 2025

Routine: A Structural Planning Framework for LLM Agent System in Enterprise

...

195

19 Jul 2025

MTU-Bench: A Multi-granularity Tool-Use Benchmark for Large Language Models

Pei Wang

Yanan Wu

Zekun Wang

...

Bo Zheng

257

15 Oct 2024

AutoFeedback: An LLM-based Framework for Efficient and Accurate API Request Generation

Kele Xu

878

09 Oct 2024

GAIA: a benchmark for General AI Assistants

Grégoire Mialon

439

442

21 Nov 2023

MINT: Evaluating LLMs in Multi-turn Interaction with Tools and Language FeedbackInternational Conference on Learning Representations (ICLR), 2023

Hao Peng

Heng Ji

LRM

471

253

19 Sep 2023

AgentBench: Evaluating LLMs as AgentsInternational Conference on Learning Representations (ICLR), 2023

...

532

494

07 Aug 2023

Tool Documentation Enables Zero-Shot Tool-Usage with Large Language Models

Cheng-Yu Hsieh

Sibei Chen

Chun-Liang Li

Yasuhisa Fujii

Alexander Ratner

Chen-Yu Lee

Ranjay Krishna

Tomas Pfister

LLMAG SyDa

288

01 Aug 2023

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIsInternational Conference on Learning Representations (ICLR), 2023

...

Jie Zhou

Mark B. Gerstein

Dahai Li

Zhiyuan Liu

Maosong Sun

CLL ALM LLMAG ELM LM&MA

593

1,109

31 Jul 2023

ToolQA: A Dataset for LLM Question Answering with External ToolsNeural Information Processing Systems (NeurIPS), 2023

325

342

23 Jun 2023

RestGPT: Connecting Large Language Models with Real-World RESTful APIs

...

Sujian Li

304

113

11 Jun 2023

ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases

Xianpei Han

285

282

08 Jun 2023

On the Tool Manipulation Capability of Open-source Large Language Models

256

25 May 2023

Gorilla: Large Language Model Connected with Massive APIsNeural Information Processing Systems (NeurIPS), 2023

Tianjun Zhang

387

871

24 May 2023

Interactive Natural Language Processing

Ge Zhang

...

Dayiheng Liu

142

22 May 2023

Chameleon: Plug-and-Play Compositional Reasoning with Large Language ModelsNeural Information Processing Systems (NeurIPS), 2023

380

412

19 Apr 2023

API-Bank: A Comprehensive Benchmark for Tool-Augmented LLMsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Feifan Song

Zhoujun Li

Fei Huang

Yongbin Li

ELM RALM CLL

303

296

14 Apr 2023

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

378

721

13 Apr 2023

TaskMatrix.AI: Completing Tasks by Connecting Foundation Models with Millions of APIsIntelligent Computing (IC), 2023

...

264

239

29 Mar 2023

Toolformer: Language Models Can Teach Themselves to Use ToolsNeural Information Processing Systems (NeurIPS), 2023

Luke Zettlemoyer

414

2,656

09 Feb 2023

TALM: Tool Augmented Language Models

278

182

24 May 2022

Chain-of-Thought Prompting Elicits Reasoning in Large Language ModelsNeural Information Processing Systems (NeurIPS), 2022

2.3K

14,608

28 Jan 2022

Training Verifiers to Solve Math Word Problems

...

1.1K

6,810

27 Oct 2021

Program Synthesis with Large Language Models

Henryk Michalewski

...

419

2,869

16 Aug 2021

Measuring Mathematical Problem Solving With the MATH Dataset

904

3,932

05 Mar 2021

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Matthew J. Hausknecht

LM&Ro LLMAG

415

635

08 Oct 2020

Measuring Massive Multitask Language UnderstandingInternational Conference on Learning Representations (ICLR), 2020

2.3K

6,566

07 Sep 2020

Exploiting Cloze Questions for Few Shot Text Classification and Natural Language InferenceConference of the European Chapter of the Association for Computational Linguistics (EACL), 2020

Timo Schick

Hinrich Schütze

1.1K

1,754

21 Jan 2020