Learning meaningful representations of complex objects that can be seen through multiple () views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to views, either by instantiating loss-pairs, or by using reduced embeddings, following a \textit{one vs. average-of-rest} strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all views. Given a batch of points, each seen as a -tuple of views subsequently transformed into embeddings, our loss contrasts the cost of matching these ground-truth -tuples with the MM-OT polymatching cost, which seeks optimally arranged -tuples chosen within these vectors. While the exponential complexity ) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e.g., views using mini-batches of size . Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.
View on arXiv