87

Efficient Dynamic Structured Sparse Training with Learned Shuffles

Main:8 Pages
6 Figures
Bibliography:4 Pages
12 Tables
Appendix:8 Pages
Abstract

Structured sparsity accelerates training and inference on modern GPUs, yet it still trails unstructured dynamic sparse training (DST) in accuracy. The shortfall stems from a loss of expressivity: whereas a dense layer can realize every possible mask obtained by choosing any ww active weights out of nn, a fixed block or N:M layout explores only a subset of those possibilities. We propose to close this gap by learning, for each layer, a single permutation matrix jointly with the structured weight matrix. Applied to three canonical structures -- block, N:M, and diagonals -- we show that permutation-augmented DST (PA-DST) matches unstructured baselines (RigL, SET) at 90--95\% sparsity on ImageNet-1K (ViT-B/16) and WikiText-103 (GPT-2), yet trains up to 1.21×1.21\times and infers up to 2.9×2.9\times faster. The results position structure + learned permutation as a sweet spot between accuracy and efficiency.

View on arXiv
Comments on this paper