What Do Agents Learn from Trajectory-SFT: Semantics or Interfaces?

2 February 2026

Weizheng Gu

Chengze Li

Zhuohao Yu

Mengyuan Sun

Zhibang Yang

Wei Wang

Hongrui Jia

Shikun Zhang

Wei Ye

LLMAG

ArXiv (abs)PDF HTML

Main:9 Pages

5 Figures

Bibliography:1 Pages

30 Tables

Appendix:21 Pages

Abstract

Large language models are increasingly evaluated as interactive agents, yet standard agent benchmarks conflate two qualitatively distinct sources of success: semantic tool-use and interface-specific interaction pattern memorization. Because both mechanisms can yield identical task success on the original interface, benchmark scores alone are not identifiable evidence of environment-invariant capability. We propose PIPE, a protocol-level evaluation augmentation for diagnosing interface reliance by minimally rewriting environment interfaces while preserving task semantics and execution behavior. Across 16 environments from AgentBench and AgentGym and a range of open-source and API-based agents, PIPE reveals that trajectory-SFT substantially amplifies interface shortcutting: trained agents degrade sharply under minimal interface rewrites, while non-trajectory-trained models remain largely stable. We further introduce Interface Reliance (IR), a counterbalanced alias-based metric that quantifies preference for training-time interfaces, and show that interface shortcutting exhibits environment-dependent, non-monotonic training dynamics that remain invisible under standard evaluation. Our code is available atthis https URL.

View on arXiv

Comments on this paper