220

Uni-cot: Towards Unified Chain-of-Thought Reasoning Across Text and Vision

Main:10 Pages
9 Figures
Bibliography:4 Pages
7 Tables
Appendix:4 Pages
Abstract

Chain-of-Thought (CoT) reasoning has been widely adopted to enhance Large Language Models (LLMs) by decomposing complex tasks into simpler, sequential subtasks. However, extending CoT to vision-language reasoning tasks remains challenging, as it often requires interpreting transitions of visual states to support reasoning. Existing methods often struggle with this due to limited capacity of modeling visual state transitions or incoherent visual trajectories caused by fragmented architectures.

View on arXiv
Comments on this paper