v1v2v3v4v5 (latest)

From Code Foundation Models to Agents and Applications: A Comprehensive Survey and Practical Guide to Code Intelligence

23 November 2025

Yizhi Li

Shark Liu

Xianzhen Luo

Yuyu Luo

Changzai Pan

Ensheng Shi

Yingshui Tan

Renshuai Tao

Jiajun Wu

Xianjie Wu

Zhenhe Wu

Daoguang Zan

Chenchen Zhang

Wei Zhang

He Zhu

Terry Yue Zhuo

Kerui Cao

Xianfu Cheng

Jun Dong

Shengjie Fang

Zhiwei Fei

Xiangyuan Guan

Qipeng Guo

Zhiguang Han

Joseph James

Tianqi Luo

Renyuan Li

Yuhang Li

Yiming Liang

Congnan Liu

Jiaheng Liu

Qian Liu

Ruitong Liu

Tyler Loakman

Xiangxin Meng

Chuang Peng

Tianhao Peng

Jiajun Shi

Mingjie Tang

Boyang Wang

Haowen Wang

Yunli Wang

Fanglin Xu

Zihan Xu

Fei Yuan

Ge Zhang

Jiayi Zhang

Xinhao Zhang

Wangchunshu Zhou

Hualei Zhu

King Zhu

Bryan Dai

Aishan Liu

Zhoujun Li

Chenghua Lin

Tianyu Liu

Chao Peng

Kai Shen

Libo Qin

Shuangyong Song

Zizheng Zhan

Jiajun Zhang

Jie Zhang

Zhaoxiang Zhang

Bo Zheng

LLMAG

ALM

ELM

ArXiv (abs)PDF HTML HuggingFace (49 upvotes)Github (283★)

Main:48 Pages

51 Figures

32 Tables

Appendix:255 Pages

Abstract

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

View on arXiv

Comments on this paper