Concept Bottleneck Large Language Models

11 December 2024

Abstract

We introduce Concept Bottleneck Large Language Models (CB-LLMs), a novel framework for building inherently interpretable Large Language Models (LLMs). In contrast to traditional black-box LLMs that rely on limited post-hoc interpretations, CB-LLMs integrate intrinsic interpretability directly into the LLMs -- allowing accurate explanations with scalability and transparency. We build CB-LLMs for two essential NLP tasks: text classification and text generation. In text classification, CB-LLMs is competitive with, and at times outperforms, traditional black-box models while providing explicit and interpretable reasoning. For the more challenging task of text generation, interpretable neurons in CB-LLMs enable precise concept detection, controlled generation, and safer outputs. The embedded interpretability empowers users to transparently identify harmful content, steer model behavior, and unlearn undesired concepts -- significantly enhancing the safety, reliability, and trustworthiness of LLMs, which are critical capabilities notably absent in existing models. Our code is available atthis https URL.

View on arXiv

@article{sun2025_2412.07992,
  title={ Concept Bottleneck Large Language Models },
  author={ Chung-En Sun and Tuomas Oikarinen and Berk Ustun and Tsui-Wei Weng },
  journal={arXiv preprint arXiv:2412.07992},
  year={ 2025 }
}

Comments on this paper