Token embeddings violate the manifold hypothesis

1 April 2025

Michael Robinson

Sourya Dey

Tony Chiang

ArXiv (abs)PDF HTML HuggingFace (1 upvotes)

Main:10 Pages

14 Figures

Bibliography:5 Pages

11 Tables

Appendix:15 Pages

Abstract

To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions.

View on arXiv

Comments on this paper