Evaluating LLaMA 3.2 for Software Vulnerability Detection

10 March 2025

Abstract

Deep Learning (DL) has emerged as a powerful tool for vulnerability detection, often outperforming traditional solutions. However, developing effective DL models requires large amounts of real-world data, which can be difficult to obtain in sufficient quantities. To address this challenge, DiverseVul dataset has been curated as the largest dataset of vulnerable and non-vulnerable C/C++ functions extracted exclusively from real-world projects. Its goal is to provide high-quality, large-scale samples for training DL models. However, during our study several inconsistencies were identified in the raw dataset while applying pre-processing techniques, highlighting the need for a refined version. In this work, we present a refined version of DiverseVul dataset, which is used to fine-tune a large language model, LLaMA 3.2, for vulnerability detection. Experimental results show that the use of pre-processing techniques led to an improvement in performance, with the model achieving an F1-Score of 66%, a competitive result when compared to our baseline, which achieved a 47% F1-Score in software vulnerability detection.

View on arXiv

@article{gonçalves2025_2503.07770,
  title={ Evaluating LLaMA 3.2 for Software Vulnerability Detection },
  author={ José Gonçalves and Miguel Silva and Bernardo Cabral and Tiago Dias and Eva Maia and Isabel Praça and Ricardo Severino and Luís Lino Ferreira },
  journal={arXiv preprint arXiv:2503.07770},
  year={ 2025 }
}

Comments on this paper