Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision
- MLLMLRM
Main:10 Pages
9 Figures
Bibliography:4 Pages
7 Tables
Appendix:4 Pages
Abstract
Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures.
View on arXivComments on this paper
