ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.10112
17
0

Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design

14 April 2025
A. Happe
Jürgen Cito
ArXivPDFHTML
Abstract

Large Language Models (LLMs) have emerged as a powerful approach for driving offensive penetration-testing tooling. This paper analyzes the methodology and benchmarking practices used for evaluating Large Language Model (LLM)-driven attacks, focusing on offensive uses of LLMs in cybersecurity. We review 16 research papers detailing 15 prototypes and their respective testbeds.We detail our findings and provide actionable recommendations for future research, emphasizing the importance of extending existing testbeds, creating baselines, and including comprehensive metrics and qualitative analysis. We also note the distinction between security research and practice, suggesting that CTF-based challenges may not fully represent real-world penetration testing scenarios.

View on arXiv
@article{happe2025_2504.10112,
  title={ Benchmarking Practices in LLM-driven Offensive Security: Testbeds, Metrics, and Experiment Design },
  author={ Andreas Happe and Jürgen Cito },
  journal={arXiv preprint arXiv:2504.10112},
  year={ 2025 }
}
Comments on this paper