
Although deep learning based speech enhancement methods have demonstrated good performance in adverse acoustic environments, their performance is strongly affected by the distance between the speech source and the microphones since speech signals fade quickly during the propagation through air. In this paper, we propose \textit{deep ad-hoc beamforming} to address the far field speech processing problem. It contains two novel components. First, it combines \textit{ad-hoc microphone arrays} with deep-learning-based multichannel speech enhancement, where an ad-hoc microphone array is a set of randomly distributed microphones collaborating with each other. This combination reduces the probability of the occurrence of far-field acoustic environments significantly. Second, it opens a new problem---\textit{channel selection}---to the deep-learning-based multichannel speech enhancement, and groups the microphones around the speech source to a local microphone array by a channel selection algorithm. The channel selection algorithm first predicts the quality of the received speech signal of each channel by a deep neural network. Then, it groups the microphones that have high speech quality and strong cross-channel signal correlation into a local microphone array. We developed several channel selection algorithms from the simplest one-best channel selection to a machine-learning-based channel selection. We conducted an extensive experiment in scenarios where the locations of the speech sources are far-field, random, and blind to the microphones. Results show that our method outperforms representative deep-learning-based speech enhancement methods by a large margin in both diffuse noise reverberant environments and point source noise reverberant environments.
View on arXiv