ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2503.19361
62
0

ImageSet2Text: Describing Sets of Images through Text

25 March 2025
Piera Riccio
F. Galati
Kajetan Schweighofer
Noa Garcia
Nuria Oliver
    VLM
    CoGe
ArXivPDFHTML
Abstract

We introduce ImageSet2Text, a novel approach that leverages vision-language foundation models to automatically create natural language descriptions of image sets. Inspired by concept bottleneck models (CBMs) and based on visual-question answering (VQA) chains, ImageSet2Text iteratively extracts key concepts from image subsets, encodes them into a structured graph, and refines insights using an external knowledge graph and CLIP-based validation. This iterative process enhances interpretability and enables accurate and detailed set-level summarization. Through extensive experiments, we evaluate ImageSet2Text's descriptions on accuracy, completeness, readability and overall quality, benchmarking it against existing vision-language models and introducing new datasets for large-scale group image captioning.

View on arXiv
@article{riccio2025_2503.19361,
  title={ ImageSet2Text: Describing Sets of Images through Text },
  author={ Piera Riccio and Francesco Galati and Kajetan Schweighofer and Noa Garcia and Nuria Oliver },
  journal={arXiv preprint arXiv:2503.19361},
  year={ 2025 }
}
Comments on this paper