Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training

16 February 2025

Abstract

Large Language Models (LLMs) have shown remarkable advancements in specialized fields such as finance, law, and medicine. However, in cybersecurity, we have noticed a lack of open-source datasets, with a particular lack of high-quality cybersecurity pretraining corpora, even though much research indicates that LLMs acquire their knowledge during pretraining. To address this, we present a comprehensive suite of datasets covering all major training stages, including pretraining, instruction fine-tuning, and reasoning distillation with cybersecurity-specific self-reflection data. Extensive ablation studies demonstrate their effectiveness on public cybersecurity benchmarks. In particular, continual pre-training on our dataset yields a 15.88% improvement in the aggregate score, while reasoning distillation leads to a 10% gain in security certification (CISSP). We will release all datasets and trained cybersecurity LLMs under the ODC-BY and MIT licenses to encourage further research in the community. For access to all datasets and model weights, please refer tothis https URL.

View on arXiv

@article{yu2025_2502.11191,
  title={ Primus: A Pioneering Collection of Open-Source Datasets for Cybersecurity LLM Training },
  author={ Yao-Ching Yu and Tsun-Han Chiang and Cheng-Wei Tsai and Chien-Ming Huang and Wen-Kwang Tsao },
  journal={arXiv preprint arXiv:2502.11191},
  year={ 2025 }
}

Comments on this paper