Points2Sound: From mono to binaural audio using 3D point cloud scenes
- 3DPC
For immersive applications, the generation of binaural sound that matches the visual counterpart is crucial to bring meaningful experiences to people in a virtual environment. Recent works have shown the possibility to use neural networks for synthesizing binaural audio from mono audio using 2D visual information as guidance. Extending this approach by guiding the audio using 3D visual information and operating in the waveform domain may allow for a more accurate auralization of a virtual audio scene. In this paper, we present Points2Sound, a multi-modal deep learning model which generates a binaural version from mono audio using 3D point cloud scenes. Specifically, Points2Sound consists of a vision network with 3D sparse convolutions which extracts visual features from the point cloud scene to condition an audio network, which operates in the waveform domain, to synthesize the binaural version. Experimental results indicate that 3D visual information can successfully guide multi-modal deep learning models for the task of binaural synthesis. In addition, we investigate different loss functions and 3D point cloud attributes, showing that directly predicting the full binaural signal and using rgb-depth features increases the performance of our proposed model.
View on arXiv