ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2505.18651
54
0

On the Emergence of Linear Analogies in Word Embeddings

24 May 2025
Daniel J. Korchinski
Dhruva Karkada
Yasaman Bahri
Matthieu Wyart
ArXiv (abs)PDFHTML
Main:9 Pages
11 Figures
Bibliography:3 Pages
1 Tables
Appendix:8 Pages
Abstract

Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability P(i,j)P(i,j)P(i,j) of words iii and jjj in text corpora. The resulting vectors WiW_iWi​ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, Wking−Wman+Wwoman≈WqueenW_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}Wking​−Wman​+Wwoman​≈Wqueen​ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix M(i,j)=P(i,j)/P(i)P(j)M(i,j) = P(i,j)/P(i)P(j)M(i,j)=P(i,j)/P(i)P(j), (ii) strengthens and then saturates as more eigenvectors of M(i,j)M (i, j)M(i,j), which controls the dimension of the embeddings, are included, (iii) is enhanced when using log⁡M(i,j)\log M(i,j)logM(i,j) rather than M(i,j)M(i,j)M(i,j), and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

View on arXiv
@article{korchinski2025_2505.18651,
  title={ On the Emergence of Linear Analogies in Word Embeddings },
  author={ Daniel J. Korchinski and Dhruva Karkada and Yasaman Bahri and Matthieu Wyart },
  journal={arXiv preprint arXiv:2505.18651},
  year={ 2025 }
}
Comments on this paper