v1v2 (latest)

Exploiting Latent Linearity in LLMs Improves Explainable Molecular Representation Learning

11 October 2024

Main:10 Pages

5 Figures

Bibliography:3 Pages

5 Tables

Abstract

Large language models (LLMs) have demonstrated broad utility across molecular domains, spanning drug discovery and materials design. Analyzing LLMs' latent representations is crucial for elucidating their underlying mechanisms, improving explainability, and ultimately advancing downstream performance. We propose MoleX, a simple yet effective framework that decomposes molecular embeddings within LLM representations into a concept-aligned space for explainable molecular representation learning. We further show that these high-dimensional embeddings admit a linear mapping onto chemically consistent concepts. Our analysis suggests that the uncovered linearity aligns with established chemical principles, indicating a mechanistically explainable latent structure in LLM representations for scientific applications. When applied to downstream tasks, this latent linearity improves both predictive and explanatory performance. Extensive experiments demonstrate that MoleX outperforms existing approaches in accuracy, explainability, and efficiency, achieving CPU inference on large-scale datasets 300 times faster with 100,000 fewer parameters than LLMs.

View on arXiv

Comments on this paper