ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.06232
36
1

Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning

8 March 2025
Yanjun Chen
Yirong Sun
Xinghao Chen
Jian Wang
Xiaoyu Shen
W. Li
Wei Zhang
    3DV
    LRM
ArXivPDFHTML
Abstract

Chain-of-Thought (CoT) reasoning has proven effective in natural language tasks but remains underexplored in multimodal alignment. This study investigates its integration into 3D vision-language learning by embedding structured reasoning into alignment training. We introduce the 3D-CoT Benchmark, a dataset with hierarchical CoT annotations covering shape recognition, functional inference, and causal reasoning. Through controlled experiments, we compare CoT-structured and standard textual annotations across large reasoning models (LRMs) and large language models (LLMs). Our evaluation employs a dual-layer framework assessing both intermediate reasoning and final inference quality. Extensive experiments demonstrate that CoT significantly improves 3D semantic grounding, with LRMs leveraging CoT more effectively than LLMs. Furthermore, we highlight that annotation structure influences performance-explicit reasoning markers aid LLMs, while unmarked CoT better aligns with LRM inference patterns. Our analyses suggest that CoT is crucial for enhancing multimodal reasoning, with implications beyond 3D tasks. The dataset will be publicly available atthis https URL

View on arXiv
@article{chen2025_2503.06232,
  title={ Integrating Chain-of-Thought for Multimodal Alignment: A Study on 3D Vision-Language Learning },
  author={ Yanjun Chen and Yirong Sun and Xinghao Chen and Jian Wang and Xiaoyu Shen and Wenjie Li and Wei Zhang },
  journal={arXiv preprint arXiv:2503.06232},
  year={ 2025 }
}
Comments on this paper