Concept Bottleneck Large Language Models

We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models. Our code is available atthis https URL.
View on arXiv@article{sun2025_2412.07992, title={ Concept Bottleneck Large Language Models }, author={ Chung-En Sun and Tuomas Oikarinen and Berk Ustun and Tsui-Wei Weng }, journal={arXiv preprint arXiv:2412.07992}, year={ 2025 } }