v1v2 (latest)

CyberExplorer: Benchmarking LLM Offensive Security Capabilities in a Real-World Attacking Simulation Environment

8 February 2026

Nanda Rani

Kimberly Milner

Minghao Shao

Meet Udeshi

Haoran Xi

Venkata Sai Charan Putrevu

Saksham Aggarwal

Sandeep K. Shukla

Prashanth Krishnamurthy

Farshad Khorrami

Muhammad Shafique

Ramesh Karri

ELM

ArXiv (abs)PDF HTML Github

Main:8 Pages

17 Figures

Bibliography:2 Pages

15 Tables

Appendix:8 Pages

Abstract

Real-world offensive security operations are inherently open-ended: attackers explore unknown attack surfaces, revise hypotheses under uncertainty, and operate without guaranteed success. Existing LLM-based offensive agent evaluations rely on closed-world settings with predefined goals and binary success criteria. To address this gap, we introduce CyberExplorer, an evaluation suite with two core components: (1) an open-environment benchmark built on a virtual machine hosting 40 vulnerable web services derived from real-world CTF challenges, where agents autonomously perform reconnaissance, target selection, and exploitation without prior knowledge of vulnerability locations; and (2) a reactive multi-agent framework supporting dynamic exploration without predefined plans. CyberExplorer enables fine-grained evaluation beyond flag recovery, capturing interaction dynamics, coordination behavior, failure modes, and vulnerability discovery signals-bridging the gap between benchmarks and realistic multi-target attack scenarios.

View on arXiv

Comments on this paper