Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

14 November 2025

Xingyu Ren

Youran Sun

Haoyu Liang

ArXiv (abs)PDF HTML Github

Main:6 Pages

3 Figures

Bibliography:2 Pages

9 Tables

Appendix:5 Pages

Abstract

We find that current text embedding models produce outputs with a consistent bias, i.e., each embedding vector $e$ can be decomposed as $\tilde{e} + \mu$ , where $\mu$ is almost identical across all sentences. We propose a plug-and-play, training-free and lightweight solution called Renormalization. Through extensive experiments, we show that renormalization consistently and statistically significantly improves the performance of existing models on the Massive Multilingual Text Embedding Benchmark (MMTEB). In particular, across 38 models, renormalization improves performance by 9.7 $\sigma$ on retrieval tasks, 3.1 $\sigma$ on classification tasks, and 0.8 $\sigma$ on other types of tasks. Renormalization has two variants: directly subtracting $\mu$ from $e$ , or subtracting the projection of $e$ onto $\mu$ . We theoretically predict that the latter performs better, and our experiments confirm this prediction.

View on arXiv

Comments on this paper