Set-Theoretic Compositionality of Sentence Embeddings

28 February 2025

Abstract

Sentence encoders play a pivotal role in various NLP tasks; hence, an accurate evaluation of their compositional properties is paramount. However, existing evaluation methods predominantly focus on goal task-specific performance. This leaves a significant gap in understanding how well sentence embeddings demonstrate fundamental compositional properties in a task-independent context. Leveraging classical set theory, we address this gap by proposing six criteria based on three core "set-like" compositions/operations: \textit{TextOverlap}, \textit{TextDifference}, and \textit{TextUnion}. We systematically evaluate $7$ classical and $9$ Large Language Model (LLM)-based sentence encoders to assess their alignment with these criteria. Our findings show that SBERT consistently demonstrates set-like compositional properties, surpassing even the latest LLMs. Additionally, we introduce a new dataset of ~ $192$ K samples designed to facilitate future benchmarking efforts on set-like compositionality of sentence embeddings.

View on arXiv

@article{bansal2025_2502.20975,
  title={ Set-Theoretic Compositionality of Sentence Embeddings },
  author={ Naman Bansal and Yash mahajan and Sanjeev Sinha and Santu Karmaker },
  journal={arXiv preprint arXiv:2502.20975},
  year={ 2025 }
}

Comments on this paper