Tapping into the Black Box: Uncovering Aligned Representations in Pretrained Neural Networks
- FAtt
In ReLU networks, gradients of output units can be seen as their input-level representations, as they correspond to the units' pullbacks through the active subnetwork. However, gradients of deeper models are notoriously misaligned, significantly contributing to their black-box nature. We claim that this is because active subnetworks are inherently noisy due to the ReLU hard-gating. To tackle that noise, we propose soft-gating in the backward pass only. The resulting input-level vector field (called 'éxcitation pullback'') exhibits remarkable perceptual alignment, revealing high-resolution input- and target-specific features that ''just make sense'', therefore establishing a compelling novel explanation method. Furthermore, we speculate that excitation pullbacks approximate (directionally) the gradients of a simpler model, linear in the network's path space, learned implicitly during optimization and largely determining the network's decision; thus arguing for the faithfulness of the produced explanations and their overall significance.
View on arXiv