Organizing Unstructured Image Collections using Natural Language

Organizing unstructured visual data into semantic clusters is a key challenge in computer vision. Traditional deep clustering approaches focus on a single partition of data, while multiple clustering (MC) methods address this limitation by uncovering distinct clustering solutions. The rise of large language models (LLMs) and multimodal LLMs has enhanced MC by allowing users to define text clustering criteria. However, expecting users to manually define such criteria for large datasets before understanding the data is impractical. In this work, we introduce the task of Open-ended Semantic Multiple Clustering, that aims to automatically discover clustering criteria from large, unstructured image collections, uncovering interpretable substructures without requiring human input. Our framework, X-Cluster: eXploratory Clustering, uses text as a proxy to concurrently reason over large image collections, discover partitioning criteria, expressed in natural language, and reveal semantic substructures. To evaluate X-Cluster, we introduce the COCO-4c and Food-4c benchmarks, each containing four grouping criteria and ground-truth annotations. We apply X-Cluster to various real-world applications, such as discovering biases and analyzing social media image popularity, demonstrating its utility as a practical tool for organizing large unstructured image collections and revealing novel insights. We will open-source our code and benchmarks for reproducibility and future research.
View on arXiv@article{liu2025_2410.05217, title={ Organizing Unstructured Image Collections using Natural Language }, author={ Mingxuan Liu and Zhun Zhong and Jun Li and Gianni Franchi and Subhankar Roy and Elisa Ricci }, journal={arXiv preprint arXiv:2410.05217}, year={ 2025 } }