Agent-Fence: Mapping Security Vulnerabilities Across Deep Research Agents
- AI4CE
Large language models are increasingly deployed as *deep agents* that plan, maintain persistent state, and invoke external tools, shifting safety failures from unsafe text to unsafe *trajectories*. We introduce **AgentFence**, an architecture-centric security evaluation that defines 14 trust-boundary attack classes spanning planning, memory, retrieval, tool use, and delegation, and detects failures via *trace-auditable conversation breaks* (unauthorized or unsafe tool use, wrong-principal actions, state/objective integrity violations, and attack-linked deviations). Holding the base model fixed, we evaluate eight agent archetypes under persistent multi-turn interaction and observe substantial architectural variation in mean security break rate (MSBR), ranging from (LangGraph) to (AutoGPT). The highest-risk classes are operational: Denial-of-Wallet (), Authorization Confusion (), Retrieval Poisoning (), and Planning Manipulation (), while prompt-centric classes remain below under standard settings. Breaks are dominated by boundary violations (SIV 31%, WPA 27%, UTI+UTA 24%, ATD 18%), and authorization confusion correlates with objective and tool hijacking ( and ). AgentFence reframes agent security around what matters operationally: whether an agent stays within its goal and authority envelope over time.
View on arXiv