Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Neural Information Processing Systems (NeurIPS), 2024

11 January 2024

Quentin Delfosse

Sebastian Sztwiertnia

M. Rothermel

Wolfgang Stammer

Kristian Kersting

ArXiv (abs)PDF HTML Github (12★)

Abstract

Reward sparsity, difficult credit assignment, and misalignment are only a few of the many issues that make it difficult, if not impossible, for deep reinforcement learning (RL) agents to learn optimal policies. Unfortunately, the black-box nature of deep networks impedes the inclusion of domain experts who could interpret the model and correct wrong behavior. To this end, we introduce Successive Concept Bottlenecks Agents (SCoBots), which make the whole decision pipeline transparent via the integration of consecutive concept bottleneck layers. SCoBots make use of not only relevant object properties but also of relational concepts. Our experimental results provide strong evidence that SCoBots allow domain experts to efficiently understand and regularize their behavior, resulting in potentially better human-aligned RL. In this way, SCoBots enabled us to identify a misalignment problem in the most simple and iconic video game, Pong, and resolve it.

View on arXiv

Comments on this paper