ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2405.15973
22
1

Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement

24 May 2024
Xiyao Wang
Jiuhai Chen
Zhaoyang Wang
Yuhang Zhou
Yiyang Zhou
Huaxiu Yao
Tianyi Zhou
Tom Goldstein
Parminder Bhatia
Furong Huang
Cao Xiao
ArXivPDFHTML
Abstract

Large vision-language models (LVLMs) have achieved impressive results in visual question-answering and reasoning tasks through vision instruction tuning on specific datasets. However, there remains significant room for improvement in aligning visual and language modalities. Existing methods often depend on external models or data, leading to uncontrollable and unstable alignment results. In this paper, we propose SIMA, a self-improvement framework that enhances visual and language modality alignment without external dependencies. SIMA leverages existing vision instruction tuning datasets to self-generate responses, incorporating an in-context self-critic mechanism that constructs preference pairs for tuning. Crucially, our approach allows LVLMs to act as critics by designing effective critic prompts, eliminating the need for additional fine-tuning with external instruction data. We introduce three novel visual metrics within the self-critic process to guide judgment, significantly improving the accuracy of self-critic. Through extensive experiments across 14 hallucination and comprehensive benchmarks, we demonstrate that SIMA significantly improves LVLM's performance and outperforms previous approaches, achieving superior modality alignment.

View on arXiv
@article{wang2025_2405.15973,
  title={ Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement },
  author={ Xiyao Wang and Jiuhai Chen and Zhaoyang Wang and Yuhang Zhou and Yiyang Zhou and Huaxiu Yao and Tianyi Zhou and Tom Goldstein and Parminder Bhatia and Furong Huang and Cao Xiao },
  journal={arXiv preprint arXiv:2405.15973},
  year={ 2025 }
}
Comments on this paper