Token embeddings violate the manifold hypothesis
Main:10 Pages
14 Figures
Bibliography:5 Pages
11 Tables
Appendix:15 Pages
Abstract
To fully understand the behavior of a large language model (LLM) requires our understanding of its input space. If this input space differs from our assumption, our understanding of and conclusions about the LLM is likely flawed, regardless of its architecture. Here, we elucidate the structure of the token embeddings, the input domain for LLMs, both empirically and theoretically. We present a generalized and statistically testable model where the neighborhood of each token splits into well-defined signal and noise dimensions.
View on arXivComments on this paper
