Learning Interpretable Models for Black-Box Agents
This paper develops a new approach for learning a STRIPS-like model of a non-stationary black-box autonomous agent that can plan and act. In this approach, the user may ask an autonomous agent a series of questions, which the agent answers truthfully. Our main contribution is an algorithm that generates an interrogation policy in the form of a contingent sequence of questions to be posed to the agent. Answers to these questions are used to learn a minimal, functionally indistinguishable class of agent models. This approach requires a minimal query-answering capability from the agent. Empirical evaluation of our approach shows that despite the intractable space of possible models, our approach can learn interpretable agent models for a class of black-box autonomous agents in a scalable manner.
View on arXiv