11

TRIP-Bench: A Benchmark for Long-Horizon Interactive Agents in Real-World Scenarios

Yuanzhe Shen
Zisu Huang
Zhengyuan Wang
Muzhao Tian
Zhengkang Guo
Chenyang Zhang
Shuaiyu Zhou
Zengjie Hu
Dailin Li
Jingwen Xu
Kaimin Wang
Wenhao Liu
Tianlong Li
Fengpeng Yue
Feng Hong
Cao Liu
Ke Zeng
Main:8 Pages
6 Figures
Bibliography:2 Pages
7 Tables
Appendix:30 Pages
Abstract

As LLM-based agents are deployed in increasingly complex real-world settings, existing benchmarks underrepresent key challenges such as enforcing global constraints, coordinating multi-tool reasoning, and adapting to evolving user behavior over long, multi-turn interactions. To bridge this gap, we introduce \textbf{TRIP-Bench}, a long-horizon benchmark grounded in realistic travel-planning scenarios. TRIP-Bench leverages real-world data, offers 18 curated tools and 40+ travel requirements, and supports automated evaluation. It includes splits of varying difficulty; the hard split emphasizes long and ambiguous interactions, style shifts, feasibility changes, and iterative version revision. Dialogues span up to 15 user turns, can involve 150+ tool calls, and may exceed 200k tokens of context. Experiments show that even advanced models achieve at most 50\% success on the easy split, with performance dropping below 10\% on hard subsets. We further propose \textbf{GTPO}, an online multi-turn reinforcement learning method with specialized reward normalization and reward differencing. Applied to Qwen2.5-32B-Instruct, GTPO improves constraint satisfaction and interaction robustness, outperforming Gemini-3-Pro in our evaluation. We expect TRIP-Bench to advance practical long-horizon interactive agents, and GTPO to provide an effective online RL recipe for robust long-horizon training.

View on arXiv
Comments on this paper