We propose OmniCaptioner, a versatile visual captioning framework for generating fine-grained textual descriptions across a wide variety of visual domains. Unlike prior methods limited to specific image types (e.g., natural images or geometric visuals), our framework provides a unified solution for captioning natural images, visual text (e.g., posters, UIs, textbooks), and structured visuals (e.g., documents, tables, charts). By converting low-level pixel information into semantically rich textual representations, our framework bridges the gap between visual and textual modalities. Our results highlight three key advantages: (i) Enhanced Visual Reasoning with LLMs, where long-context captions of visual modalities empower LLMs, particularly the DeepSeek-R1 series, to reason effectively in multimodal scenarios; (ii) Improved Image Generation, where detailed captions improve tasks like text-to-image generation and image transformation; and (iii) Efficient Supervised Fine-Tuning (SFT), which enables faster convergence with less data. We believe the versatility and adaptability of OmniCaptioner can offer a new perspective for bridging the gap between language and visual modalities.
View on arXiv@article{lu2025_2504.07089, title={ OmniCaptioner: One Captioner to Rule Them All }, author={ Yiting Lu and Jiakang Yuan and Zhen Li and Shitian Zhao and Qi Qin and Xinyue Li and Le Zhuo and Licheng Wen and Dongyang Liu and Yuewen Cao and Xiangchao Yan and Xin Li and Tianshuo Peng and Shufei Zhang and Botian Shi and Tao Chen and Zhibo Chen and Lei Bai and Bo Zhang and Peng Gao }, journal={arXiv preprint arXiv:2504.07089}, year={ 2025 } }