SoundNet: Learning Sound Representations from Unlabeled Video

27 October 2016

Y. Aytar

Carl Vondrick

Antonio Torralba

SSL

ArXiv PDF HTML

Papers citing "SoundNet: Learning Sound Representations from Unlabeled Video"

50 / 120 papers shown

Title
Geometry-Aware Multi-Task Learning for Binaural Audio Generation from Video Rishabh Garg Ruohan Gao Kristen Grauman 13 27 0 21 Nov 2021
Wav2CLIP: Learning Robust Audio Representations From CLIP Ho-Hsiang Wu Prem Seetharaman Kundan Kumar J. P. Bello CLIP VLM 31 267 0 21 Oct 2021
The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning Haider Al-Tahan Y. Mohsenzadeh SSL AI4TS 19 0 0 13 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... Prateek Verma AI4TS 21 2 0 07 Oct 2021
Understanding and Improving Usability of Data Dashboards for Simplified Privacy Control of Voice Assistant Data (Extended Version) Vandit Sharma Mainack Mondal 9 3 0 06 Oct 2021
Parsing Birdsong with Deep Audio Embeddings Irina Tolkova Brian Chu Marcel Hedman Stefan Kahl Holger Klinck 28 10 0 20 Aug 2021
LIGA-Stereo: Learning LiDAR Geometry Aware Representations for Stereo-based 3D Detector Xiaoyang Guo Shaoshuai Shi Xiaogang Wang Hongsheng Li 3DPC 23 106 0 18 Aug 2021
Cross-modal Spectrum Transformation Network For Acoustic Scene classification Yang Liu A. Neophytou Sunando Sengupta Eric Sommerlade 11 9 0 13 Aug 2021
DarkGAN: Exploiting Knowledge Distillation for Comprehensible Audio Synthesis with GANs J. Nistal Stefan Lattner G. Richard 19 8 0 03 Aug 2021
Attention Bottlenecks for Multimodal Fusion Arsha Nagrani Shan Yang Anurag Arnab A. Jansen Cordelia Schmid Chen Sun 25 539 0 30 Jun 2021
Unsupervised Sound Localization via Iterative Contrastive Learning Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin Ming-Hsuan Yang SSL 19 34 0 01 Apr 2021
Slow-Fast Auditory Streams For Audio Recognition Evangelos Kazakos Arsha Nagrani Andrew Zisserman Dima Damen 8 66 0 05 Mar 2021
Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss Naoki Makishima Mana Ihori Akihiko Takashima Tomohiro Tanaka Shota Orihashi Ryo Masumura 13 8 0 02 Mar 2021
There is More than Meets the Eye: Self-Supervised Multi-Object Detection and Tracking with Sound by Distilling Multimodal Knowledge Francisco Rivera Valverde Juana Valeria Hurtado Abhinav Valada 26 72 0 01 Mar 2021
Environment Transfer for Distributed Systems Chunheng Jiang Jae-wook Ahn N. Desai 26 1 0 06 Jan 2021
Learning to dance: A graph convolutional adversarial network to generate realistic dance motions from audio João P. Ferreira Thiago M. Coutinho Thiago L. Gomes J. F. Neto Rafael Azevedo Renato Martins Erickson R. Nascimento GAN 28 68 0 25 Nov 2020
Learning Representations from Audio-Visual Spatial Alignment Pedro Morgado Yi Li Nuno Vasconcelos SSL 11 121 0 03 Nov 2020
Listening to Sounds of Silence for Speech Denoising Ruilin Xu Rundi Wu Y. Ishiwaka Carl Vondrick Changxi Zheng 15 32 0 22 Oct 2020
Learning Video Representations from Textual Web Supervision Jonathan C. Stroud Zhichao Lu Chen Sun Jia Deng Rahul Sukthankar Cordelia Schmid David A. Ross SSL 19 48 0 29 Jul 2020
Learning Modality Interaction for Temporal Sentence Localization and Event Captioning in Videos Shaoxiang Chen Wenhao Jiang Wei Liu Yu-Gang Jiang 19 101 0 28 Jul 2020
Rethinking CNN Models for Audio Classification Kamalesh Palanisamy Dipika Singhania Angela Yao SSL 17 144 0 22 Jul 2020
Multiple Sound Sources Localization from Coarse to Fine Rui Qian Di Hu Heinrich Dinkel Mengyue Wu N. Xu Weiyao Lin 23 153 0 13 Jul 2020
Self-Supervised MultiModal Versatile Networks Jean-Baptiste Alayrac Adrià Recasens R. Schneider Relja Arandjelović Jason Ramapuram J. Fauw Lucas Smaira Sander Dieleman Andrew Zisserman SSL 40 371 0 29 Jun 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos Andrew Rouditchenko Angie Boggust David F. Harwath Brian Chen D. Joshi ... Rogerio Feris Brian Kingsbury M. Picheny Antonio Torralba James R. Glass SSL 22 141 0 16 Jun 2020
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network Lingyu Zhu Esa Rahtu 14 23 0 04 Jun 2020
High-Fidelity Audio Generation and Representation Learning with Guided Adversarial Autoencoder Kazi Nazmul Haque R. Rana Björn W Schuller DRL 24 12 0 01 Jun 2020
Multimodal Target Speech Separation with Voice and Face References Leyuan Qu C. Weber S. Wermter CVBM 19 19 0 17 May 2020
Cross-modal Speaker Verification and Recognition: A Multilingual Perspective M. S. Saeed Shah Nawaz Pietro Morerio Arif Mahmood I. Gallo Muhammad Haroon Yousaf Alessio Del Bue CVBM 19 25 0 28 Apr 2020
Conditioned Source Separation for Music Instrument Performances Olga Slizovskaia G. Haro E. Gómez 22 38 0 08 Apr 2020
Disentangled Speech Embeddings using Cross-modal Self-supervision Arsha Nagrani Joon Son Chung Samuel Albanie Andrew Zisserman SSL 11 88 0 20 Feb 2020
Audiovisual SlowFast Networks for Video Recognition Fanyi Xiao Yong Jae Lee Kristen Grauman Jitendra Malik Christoph Feichtenhofer 192 205 0 23 Jan 2020
STAViS: Spatio-Temporal AudioVisual Saliency Network A. Tsiami Petros Koutras Petros Maragos 16 73 0 09 Jan 2020
Listen to Look: Action Recognition by Previewing Audio Ruohan Gao Tae-Hyun Oh Kristen Grauman Lorenzo Torresani VLM 27 251 0 10 Dec 2019
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem Du Tran SSL 17 428 0 28 Nov 2019
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications Arda Senocak Tae-Hyun Oh Junsik Kim Ming-Hsuan Yang In So Kweon SSL 17 52 0 20 Nov 2019
Deep Long Audio Inpainting Ya-Liang Chang Kuan-Ying Lee Po-Yu Wu Hung-yi Lee Winston H. Hsu 17 33 0 15 Nov 2019
DEPA: Self-Supervised Audio Embedding for Depression Detection Pingyue Zhang Mengyue Wu Heinrich Dinkel Kai Yu 11 51 0 29 Oct 2019
Vision-Infused Deep Audio Inpainting Hang Zhou Ziwei Liu Lingfeng Guo Ping Luo Dahua Lin 21 88 0 24 Oct 2019
Contrastive Representation Distillation Yonglong Tian Dilip Krishnan Phillip Isola 24 1,029 0 23 Oct 2019
Coordinated Joint Multimodal Embeddings for Generalized Audio-Visual Zeroshot Classification and Retrieval of Videos Kranti K. Parida Neeraj Matiyali T. Guha Gaurav Sharma VLM 14 41 0 19 Oct 2019
Watch, Listen and Tell: Multi-modal Weakly Supervised Dense Event Captioning Tanzila Rahman Bicheng Xu Leonid Sigal 17 77 0 22 Sep 2019
Multimodal Deep Models for Predicting Affective Responses Evoked by Movies Ha Thi Phuong Thao Dorien Herremans Gemma Roig 21 16 0 16 Sep 2019
Recursive Visual Sound Separation Using Minus-Plus Net Xudong Xu Bo Dai Dahua Lin 13 91 0 30 Aug 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition Evangelos Kazakos Arsha Nagrani Andrew Zisserman Dima Damen EgoV 14 0 0 22 Aug 2019
Multi-task Self-Supervised Learning for Human Activity Detection Aaqib Saeed T. Ozcelebi J. Lukkien SSL 6 268 0 27 Jul 2019
Adaptive Regularization via Residual Smoothing in Deep Learning Optimization Jung-Kyun Cho Junseok Kwon Byung-Woo Hong 21 1 0 23 Jul 2019
Bag-of-Audio-Words based on Autoencoder Codebook for Continuous Emotion Prediction Mohammed Senoussaoui P. Cardinal Alessandro Lameiras Koerich 14 2 0 06 Jul 2019
Learning Video Representations using Contrastive Bidirectional Transformer Chen Sun Fabien Baradel Kevin Patrick Murphy Cordelia Schmid SSL ViT 8 133 0 13 Jun 2019
Learning Individual Styles of Conversational Gesture Shiry Ginosar Amir Bar Gefen Kohavi Caroline Chan Andrew Owens Jitendra Malik SLR 12 326 0 10 Jun 2019
End-to-End Environmental Sound Classification using a 1D Convolutional Neural Network Sajjad Abdoli P. Cardinal Alessandro Lameiras Koerich 34 269 0 18 Apr 2019