226
v1v2 (latest)

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Yizhi Li
Shark Liu
Xianzhen Luo
Yuyu Luo
Changzai Pan
Ensheng Shi
Yingshui Tan
Renshuai Tao
Jiajun Wu
Xianjie Wu
Zhenhe Wu
Daoguang Zan
Chenchen Zhang
Wei Zhang
He Zhu
Terry Yue Zhuo
Kerui Cao
Xianfu Cheng
Jun Dong
Shengjie Fang
Zhiwei Fei
Xiangyuan Guan
Qipeng Guo
Zhiguang Han
Joseph James
Tianqi Luo
Renyuan Li
Yuhang Li
Yiming Liang
Congnan Liu
Jiaheng Liu
Qian Liu
Ruitong Liu
Tyler Loakman
Xiangxin Meng
Chuang Peng
Tianhao Peng
Jiajun Shi
Mingjie Tang
Boyang Wang
Haowen Wang
Yunli Wang
Fanglin Xu
Zihan Xu
Fei Yuan
Ge Zhang
Jiayi Zhang
Xinhao Zhang
Wangchunshu Zhou
Hualei Zhu
King Zhu
Brown Dai
Aishan Liu
Zhoujun Li
Chenghua Lin
Tianyu Liu
Chao Peng
Kai Shen
Libo Qin
Shuangyong Song
Zizheng Zhan
Jiajun Zhang
Jie Zhang
Zhaoxiang Zhang
Bo Zheng
Main:20 Pages
10 Figures
Bibliography:9 Pages
7 Tables
Abstract

Large language models (LLMs) have fundamentally transformed automated software development by enabling direct translation of natural language descriptions into functional code, driving commercial adoption through tools like Github Copilot (Microsoft), Cursor (Anysphere), Trae (ByteDance), and Claude Code (Anthropic). While the field has evolved dramatically from rule-based systems to Transformer-based architectures, achieving performance improvements from single-digit to over 95\% success rates on benchmarks like HumanEval. In this work, we provide a comprehensive synthesis and practical guide (a series of analytic and probing experiments) about code LLMs, systematically examining the complete model life cycle from data curation to post-training through advanced prompting paradigms, code pre-training, supervised fine-tuning, reinforcement learning, and autonomous coding agents. We analyze the code capability of the general LLMs (GPT-4, Claude, LLaMA) and code-specialized LLMs (StarCoder, Code LLaMA, DeepSeek-Coder, and QwenCoder), critically examining the techniques, design decisions, and trade-offs. Further, we articulate the research-practice gap between academic research (e.g., benchmarks and tasks) and real-world deployment (e.g., software-related code tasks), including code correctness, security, contextual awareness of large codebases, and integration with development workflows, and map promising research directions to practical needs. Last, we conduct a series of experiments to provide a comprehensive analysis of code pre-training, supervised fine-tuning, and reinforcement learning, covering scaling law, framework selection, hyperparameter sensitivity, model architectures, and dataset comparisons.

View on arXiv
Comments on this paper