v1v2 (latest)

ETOM: A Five-Level Benchmark for Evaluating Tool Orchestration within the MCP Ecosystem

22 October 2025

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)Github (5★)

Main:11 Pages

17 Figures

Bibliography:2 Pages

7 Tables

Appendix:24 Pages

Abstract

We introduce ETOM, a five-level benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents within a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often assess tools in isolation, overlooking challenges such as functional overlap and cross-server orchestration, which can lead to overly optimistic evaluations. ETOM addresses these gaps by constructing ground truth through "equal function sets", enabling objective metrics such as F1 score and reducing reliance on LLM-as-a-judge evaluation. Its five-level curriculum systematically tests agent capabilities, from single-tool orchestration to complex cross-server planning, as well as robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. ETOM provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents.

View on arXiv

Comments on this paper