On Benchmarking Code LLMs for Android Malware Analysis

1 April 2025

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in various code intelligence tasks. However, their effectiveness for Android malware analysis remains underexplored. Decompiled Android code poses unique challenges for analysis, primarily due to its large volume of functions and the frequent absence of meaningful function names. This paper presents Cama, a benchmarking framework designed to systematically evaluate the effectiveness of Code LLMs in Android malware analysis tasks. Cama specifies structured model outputs (comprising function summaries, refined function names, and maliciousness scores) to support key malware analysis tasks, including malicious function identification and malware purpose summarization. Built on these, it integrates three domain-specific evaluation metrics, consistency, fidelity, and semantic relevance, enabling rigorous stability and effectiveness assessment and cross-model comparison. We construct a benchmark dataset consisting of 118 Android malware samples, encompassing over 7.5 million distinct functions, and use Cama to evaluate four popular open-source models. Our experiments provide insights into how Code LLMs interpret decompiled code and quantify the sensitivity to function renaming, highlighting both the potential and current limitations of Code LLMs in malware analysis tasks.

View on arXiv

@article{he2025_2504.00694,
  title={ On Benchmarking Code LLMs for Android Malware Analysis },
  author={ Yiling He and Hongyu She and Xingzhi Qian and Xinran Zheng and Zhuo Chen and Zhan Qin and Lorenzo Cavallaro },
  journal={arXiv preprint arXiv:2504.00694},
  year={ 2025 }
}

Comments on this paper