Simplifying and Understanding State Space Models with Diagonal Linear
RNNs
Sequence models based on linear state spaces (SSMs) have recently emerged as a promising choice of architecture for modeling long range dependencies across various modalities. However, they invariably rely on discretization of a continuous state space, which complicates their presentation and understanding. In this work, we dispose of the discretization step, and propose a model based on vanilla Diagonal Linear RNNs (). We empirically show that is as performant as previously-proposed SSMs in the presence of strong supervision, despite being conceptually much simpler. Moreover, we characterize the expressivity of SSMs (including ) and attention-based models via a suite of synthetic sequence-to-sequence tasks involving interactions over tens of thousands of tokens, ranging from simple operations, such as shifting an input sequence, to detecting co-dependent visual features over long spatial ranges in flattened images. We find that while SSMs report near-perfect performance on tasks that can be modeled via convolutional kernels, they struggle on tasks requiring such kernels and especially when the desired sequence manipulation is . For example, learns to perfectly shift a -long input by an arbitrary number of positions but fails when the shift size depends on context. Despite these limitations, reaches high performance on two higher-order reasoning tasks and with input lengths and respectively, and gives encouraging performance on with input length for which attention is not a viable choice.
View on arXiv