92
v1v2 (latest)

Towards Faithful Reasoning in Comics for Small MLLMs

Chengcheng Feng
Haojie Yin
Yucheng Jin
Kaizhu Huang
Main:8 Pages
6 Figures
Bibliography:3 Pages
26 Tables
Appendix:22 Pages
Abstract

Comic understanding presents a significant challenge for Multimodal Large Language Models (MLLMs), as the intended meaning of a comic often emerges from the joint interpretation of visual, textual, and social cues. This naturally motivates Chain-of-Thought (CoT) prompting, since explicit intermediate reasoning appears promising for integrating such heterogeneous signals. However, existing CoT methods are poorly matched to this structure: they tend to force interpretation into a single reasoning path before multiple cues have been jointly considered, often degrading performance, especially for small MLLMs. Our key idea is to explicitly preserve multi-cue interpretation during supervision construction, rather than collapsing comic understanding into a single reasoning chain. To this end, we propose a two-stage framework for faithful comic reasoning in small MLLMs. First, we introduce MoCoT, a modular supervision construction framework that preserves multi-cue interpretation and turns it into more faithful supervision. Second, we propose VERA, a structured reward mechanism that turns such supervision into faithful reasoning behavior by aligning optimization with both reasoning faithfulness and answer correctness. Extensive experiments on five benchmarks spanning comic understanding and broader humor-centric and abstract visual reasoning tasks demonstrate that our framework achieves strong results in the \leq 4B regime, surpasses several 7B baselines, improves four small MLLMs by an average of \mathbf{12.1%} as a plug-in, and consistently enhances reasoning faithfulness while preserving inference efficiency.

View on arXiv
Comments on this paper