Using Learning Dynamics to Explore the Role of Implicit Regularization
in Adversarial Examples
- AAML
Recent work (Ilyas et al., 2019) suggests that adversarial examples are features not bugs. If adversarial perturbations are indeed useful but non-robust features, what is their origin? To answer this question, we performed a novel analysis of the learning dynamics of adversarial perturbations, both in pixel and frequency domains, and a systematic steganography experiment to explore the implicit bias induced by different model parametrizations. We find that: (1) adversarial examples are not present at initialization but instead emerge during training; (2) the frequency-based nature of common adversarial perturbations in natural images is critically dependent on an implicit bias towards L1-sparsity in the frequency domain; and (3) the origin of this bias is the locality and translation invariance of convolutional filters, along with (4) the existence of useful frequency-based features in the datasets. We propose a simple theoretical explanation for these findings, providing a clear and minimalist target for theorists in future work. Looking forward, our work shows that analyzing the learning dynamics of perturbations can provide useful insights for understanding the origin of adversarial sensitivities and developing robust solutions.
View on arXiv