24
v1v2 (latest)

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

Nanda Rani
Kimberly Milner
Minghao Shao
Meet Udeshi
Haoran Xi
Venkata Sai Charan Putrevu
Saksham Aggarwal
Sandeep K. Shukla
Prashanth Krishnamurthy
Farshad Khorrami
Muhammad Shafique
Ramesh Karri
Main:8 Pages
17 Figures
Bibliography:2 Pages
15 Tables
Appendix:8 Pages
Abstract

Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.

View on arXiv
Comments on this paper