Recent advances in vision-language models (VLMs) have enabled end-to-end document parsing and understanding, achieving strong performance on diverse optical character recognition (OCR) tasks. However, VLMs are prone to generate words that do not exist in the input image due to over-reliance on language priors. By contrast, traditional OCR models, whose architectures are tailored for specific recognition tasks, often achieve stronger fine-grained visual perception with fewer hallucinations, but they typically lack the contextual semantic understanding and reasoning capabilities needed in more challenging cases. To bridge this gap, we propose DianJin-OCR-R1, a reasoning-enhanced framework for recognition that trains VLMs in a reasoning-and-tool interleaved paradigm. Our DianJin-OCR-R1 model first recognizes the content in the input image through its own OCR capabilities, and then calls other expert models for extra results as references. After that, it is guided to "look again" at the image and compare its own recognized content with other results to find errors or omissions. Finally, it integrates all available evidence to generate a more accurate output. This design empowers the model to learn how to implicitly re-focus on the visual input and effectively leverage the results of other expert models for better performance. We evaluate our DianJin-OCR-R1 model on ReST and OmniDocBench, where it consistently outperforms both its non-reasoning counterparts and expert models, demonstrating the effectiveness of our method.

View on arXiv

Comments on this paper