Pathologies of Neural Models Make Interpretations Difficult
- AAMLFAtt
Model interpretability is a crucial problem in neural networks. Existing interpretation methods highlight salient input features, often determining each feature's importance based on gradient information from the model. We instead remove the least influential words, one at a time, from language inputs. This exposes pathological model behavior on language tasks: models produce high confidence values for reduced inputs, even when humans find them nonsensical. We examine the reasons for this behavior and suggest methods of mitigation. Our results have implications for gradient-based interpretation methods, showing that determining word importance using a model's gradient often does not align with humans' perceived importance of that word. We propose a simple entropy regularization technique that mitigates these issues without affecting performance on clean examples.
View on arXiv