ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2504.10049
19
0

Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

14 April 2025
Théo Gigant
Camille Guinaudeau
Frédéric Dufaux
ArXivPDFHTML
Abstract

Vision-Language Models (VLMs) can process visual and textual information in multiple formats: texts, images, interleaved texts and images, or even hour-long videos. In this work, we conduct fine-grained quantitative and qualitative analyses of automatic summarization of multimodal presentations using VLMs with various representations as input. From these experiments, we suggest cost-effective strategies for generating summaries from text-heavy multimodal documents under different input-length budgets using VLMs. We show that slides extracted from the video stream can be beneficially used as input against the raw video, and that a structured representation from interleaved slides and transcript provides the best performance. Finally, we reflect and comment on the nature of cross-modal interactions in multimodal presentations and share suggestions to improve the capabilities of VLMs to understand documents of this nature.

View on arXiv
@article{gigant2025_2504.10049,
  title={ Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure },
  author={ Théo Gigant and Camille Guinaudeau and Frédéric Dufaux },
  journal={arXiv preprint arXiv:2504.10049},
  year={ 2025 }
}
Comments on this paper