Multimodal representations that enable cross-modal retrieval are widely used. However, these often lack interpretability making it difficult to explain the retrieved results. Solutions such as learning sparse disentangled representations are typically guided by the text tokens in the data, making the dimensionality of the resulting embeddings very high. We propose an approach that generates smaller dimensionality fixed-size embeddings that are not only disentangled but also offer better control for retrieval tasks. We demonstrate their utility using challenging exclusion queries over MSCOCO and Conceptual Captions benchmarks. Our experiments show that our approach is superior to traditional dense models such as CLIP, BLIP and VISTA (gains up to 11% in AP@10), as well as sparse disentangled models like VDR (gains up to 21% in AP@10). We also present qualitative results to further underline the interpretability of disentangled representations.
View on arXiv@article{j2025_2504.03184, title={ Answering Multimodal Exclusion Queries with Lightweight Sparse Disentangled Representations }, author={ Prachi J and Sumit Bhatia and Srikanta Bedathur }, journal={arXiv preprint arXiv:2504.03184}, year={ 2025 } }