On minimizers and convolutional filters: a partial justification for the effectiveness of CNNs in categorical sequence analysis

9 November 2021

Abstract

Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how those filters can be used to classify the sequence. In this manuscript, we demonstrate through a careful mathematical analysis of hash function properties that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In additional empirical experiments, we find that this property manifests as decreased density in repetitive regions. This provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.

View on arXiv

Comments on this paper