Can Vision-Language Models Handle Long-Context Code? An Empirical Study on Visual Compression

31 January 2026

Jianping Zhong

Guochang Li

Chen Zhi

Junxiao Han

Zhen Qin

Xinkui Zhao

Nan Wang

Shuiguang Deng

Jianwei Yin

VLM

ArXiv (abs)PDF HTML

Main:18 Pages

3 Figures

Bibliography:3 Pages

12 Tables

Abstract

Large Language Models (LLMs) struggle with long-context code due to window limitations. Existing textual code compression methods mitigate this via selective filtering but often disrupt dependency closure, causing semantic fragmentation. To address this, we introduce LongCodeOCR, a visual compression framework that renders code into compressed two-dimensional image sequences for Vision-Language Models (VLMs). By preserving a global view, this approach avoids the dependency breakage inherent in filtering. We systematically evaluate LongCodeOCR against the state-of-the-art LongCodeZip across four benchmarks spanning code summarization, code question answering, and code completion.Our results demonstrate that visual code compression serves as a viable alternative for tasks requiring global understanding. At comparable compression ratios ( $\sim$ 1.7 $\times$ ), LongCodeOCR improves CompScore on Long Module Summarization by 36.85 points over LongCodeZip. At a 1M-token context length with Glyph (a specialized 9B VLM), LongCodeOCR maintains higher accuracy than LongCodeZip while operating at about 4 $\times$ higher compression. Moreover, compared with LongCodeZip, LongCodeOCR drastically reduces compression-stage overhead (reducing latency from $\sim$ 4.3 hours to $\sim$ 1 minute at 1M tokens). Finally, our results characterize a fundamental coverage--fidelity trade-off: visual code compression retains broader context coverage to support global dependencies, yet faces fidelity bottlenecks on exactness-critical tasks; by contrast, textual code compression preserves symbol-level precision while sacrificing structural coverage.

View on arXiv

Comments on this paper