110

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

Main:6 Pages
3 Figures
Bibliography:2 Pages
9 Tables
Appendix:5 Pages
Abstract

We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector ee can be decomposed as e~+μ\tilde{e} + \mu, where μ\mu is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 σ\sigma on retrieval tasks, 3.1 σ\sigma on classification tasks, and 0.8 σ\sigma on other types of tasks. Renormalization has two variants: directly subtracting μ\mu from ee, or subtracting the projection of ee onto μ\mu. We theoretically predict that the latter performs better, and our experiments confirm this prediction.

View on arXiv
Comments on this paper