A Deep Variational Convolutional Neural Network for Robust Speech Recognition in the Waveform Domain
- BDL

We investigate the potential of probabilistic neural networks for learning of robust waveform-based acoustic models. To that end, we consider a deep convolutional network that first decomposes speech into frequency sub-bands via an adaptive parametric convolutional block where filters are specified by cosine modulations of compactly supported windows. The network then employs standard non-parametric wide-pass filters, i.e., 1D convolutions, to extract the most relevant spectro-temporal patterns while gradually compressing the structured high dimensional representation generated by the parametric block. We rely on a probabilistic parametrization of the proposed architecture and learn the model using stochastic variational inference. This requires evaluation of an analytically intractable integral defining the Kullback-Leibler divergence term responsible for regularization, for which we propose an effective approximation based on the Gauss-Hermite quadrature. Our empirical results demonstrate a superior performance of the proposed approach over relevant waveform-based baselines and indicate that it could lead to robustness. Moreover, the approach outperforms a recently proposed deep convolutional network for learning of robust acoustic models with standard filterbank features.
View on arXiv