We propose a multimodal deep learning framework that can transfer the knowledge obtained from a single-modal neural network to a network with a different modality. For instance, we show that we can leverage the speech data to fine-tune the network trained for video recognition, given an initial set of audio-video parallel dataset within the same semantics. Our approach learns the analogy-preserving embeddings between the abstract representations learned from each network, allowing for semantics-level transfer or reconstruction of the data among different modalities. Our method is thus specifically useful when one of the modalities is more scarce in labeled data than other modalities. While we mainly focus on applying transfer learning on the audio-visual recognition task as an application of our approach, our framework is flexible and thus can work with any multimodal datasets. In this work-in-progress report, we show our preliminary results on the AV-Letters dataset.
View on arXiv