In this paper we start with a simple question, how is it possible that humans can recognize different movements over skin with only a prior visual experience of them? Or in general, what is the representation of spatial sequences that are invariant to scale, rotation, and translation across different modalities? To answer, we rethink the mathematical representation of spatial sequences, argue against the minimum description length principle, and focus on the movements of attention. We advance the idea that spatial sequences must be represented on different levels of abstraction, this adds redundancy but is necessary for recognition and generalization. To address the open question of how these abstractions are formed we propose two hypotheses: the first invites exploring selectionism learning, instead of finding parameters in some models; the second proposes to find new data structures, not neural network architectures, to efficiently store and operate over redundant features to be further selected. Movements of attention are central to human cognition and lessons should be applied to new better learning algorithms.
View on arXiv