566

Token embeddings violate the manifold hypothesis

Main:10 Pages
14 Figures
Bibliography:5 Pages
11 Tables
Appendix:15 Pages
Abstract

To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions.

View on arXiv
Comments on this paper