Evaluating Large Language Models for Line-Level Vulnerability Localization
Recently, Automated Vulnerability Localization (AVL) has attracted growing attention, aiming to facilitate diagnosis by pinpointing the specific lines of code responsible for vulnerabilities. Large Language Models (LLMs) have shown potential in various domains, yet their effectiveness in line-level vulnerability localization remains underexplored.In this work, we present the first comprehensive empirical evaluation of LLMs for AVL. Our study examines 19 leading LLMs suitable for code analysis, including ChatGPT and multiple open-source models, spanning encoder-only, encoder-decoder, and decoder-only architectures, with model sizes from 60M to 70B parameters. We evaluate three paradigms including few-shot prompting, discriminative fine-tuning, and generative fine-tuning with and without Low-Rank Adaptation (LoRA), on both a BigVul-derived dataset for C/C++ and a smart contract vulnerability dataset.}Our results show that discriminative fine-tuning achieves substantial performance gains over existing learning-based AVL methods when sufficient training data is available. In low-data settings, prompting advanced LLMs such as ChatGPT proves more effective. We also identify challenges related to input length and unidirectional context during fine-tuning, and propose two remedial strategies: a sliding window approach and right-forward embedding, both of which yield significant improvements. Moreover, we provide the first assessment of LLM generalizability in AVL, showing that certain models can transfer effectively across Common Weakness Enumerations (CWEs) and projects. However, performance degrades notably for newly discovered vulnerabilities containing unfamiliar lexical or structural patterns, underscoring the need for continual adaptation.
View on arXiv