ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2506.00739
43
0
v1v2 (latest)

DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments

31 May 2025
Chiyu Zhang
Marc-Alexandre Cote
Michael Albada
Anush Sankaran
Jack W. Stokes
Tong Wang
Amir H. Abdi
William Blum
Muhammad Abdul-Mageed
    LLMAGAAMLELM
ArXiv (abs)PDFHTML
Main:9 Pages
2 Figures
Bibliography:5 Pages
3 Tables
Appendix:2 Pages
Abstract

Large language model (LLM) agents have shown impressive capabilities in human language comprehension and reasoning, yet their potential in cybersecurity remains underexplored. We introduce DefenderBench, a practical, open-source toolkit for evaluating language agents across offense, defense, and cybersecurity knowledge-based tasks. DefenderBench includes environments for network intrusion, malicious content detection, code vulnerability analysis, and cybersecurity knowledge assessment. It is intentionally designed to be affordable and easily accessible for researchers while providing fair and rigorous assessment. We benchmark several state-of-the-art (SoTA) and popular LLMs, including both open- and closed-weight models, using a standardized agentic framework. Our results show that Claude-3.7-sonnet performs best with a DefenderBench score of 81.65, followed by Claude-3.7-sonnet-think with 78.40, while the best open-weight model, Llama 3.3 70B, is not far behind with a DefenderBench score of 71.81. DefenderBench's modular design allows seamless integration of custom LLMs and tasks, promoting reproducibility and fair comparisons. An anonymized version of DefenderBench is available at this https URL.

View on arXiv
@article{zhang2025_2506.00739,
  title={ DefenderBench: A Toolkit for Evaluating Language Agents in Cybersecurity Environments },
  author={ Chiyu Zhang and Marc-Alexandre Cote and Michael Albada and Anush Sankaran and Jack W. Stokes and Tong Wang and Amir Abdi and William Blum and Muhammad Abdul-Mageed },
  journal={arXiv preprint arXiv:2506.00739},
  year={ 2025 }
}
Comments on this paper