ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.06486
44
2

PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training

9 March 2025
Cong Chen
Mingyu Liu
Chenchen Jing
Y. Zhou
Fengyun Rao
Hao Chen
Bo Zhang
Chunhua Shen
    MLLM
    AAML
    VLM
ArXivPDFHTML
Abstract

This paper aims to address the challenge of hallucinations in Multimodal Large Language Models (MLLMs) particularly for dense image captioning tasks. To tackle the challenge, we identify the current lack of a metric that finely measures the caption quality in concept level. We hereby introduce HalFscore, a novel metric built upon the language graph and is designed to evaluate both the accuracy and completeness of dense captions at a granular level. Additionally, we identify the root cause of hallucination as the model's over-reliance on its language prior. To address this, we propose PerturboLLaVA, which reduces the model's reliance on the language prior by incorporating adversarially perturbed text during training. This method enhances the model's focus on visual inputs, effectively reducing hallucinations and producing accurate, image-grounded descriptions without incurring additional computational overhead. PerturboLLaVA significantly improves the fidelity of generated captions, outperforming existing approaches in handling multimodal hallucinations and achieving improved performance across general multimodal benchmarks.

View on arXiv
@article{chen2025_2503.06486,
  title={ PerturboLLaVA: Reducing Multimodal Hallucinations with Perturbative Visual Training },
  author={ Cong Chen and Mingyu Liu and Chenchen Jing and Yizhou Zhou and Fengyun Rao and Hao Chen and Bo Zhang and Chunhua Shen },
  journal={arXiv preprint arXiv:2503.06486},
  year={ 2025 }
}
Comments on this paper