Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

20 December 2017

Andrew Owens

Jiajun Wu

Josh H. McDermott

William T. Freeman

Antonio Torralba

SSL

ArXiv PDF HTML

Papers citing "Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning"

39 / 39 papers shown

Title
Audio-Visual Generalized Zero-Shot Learning using Pre-Trained Large Multi-Modal Models David Kurzendörfer Otniel-Bogdan Mercea A. Sophia Koepke Zeynep Akata VLM CLIP 26 2 0 09 Apr 2024
Mix and Localize: Localizing Sound Sources in Mixtures Xixi Hu Ziyang Chen Andrew Owens 23 51 0 28 Nov 2022
Contrastive Audio-Visual Masked Autoencoder Yuan Gong Andrew Rouditchenko Alexander H. Liu David F. Harwath Leonid Karlinsky Hilde Kuehne James R. Glass 24 119 0 02 Oct 2022
TVLT: Textless Vision-Language Transformer Zineng Tang Jaemin Cho Yixin Nie Mohit Bansal VLM 49 28 0 28 Sep 2022
A Closer Look at Weakly-Supervised Audio-Visual Source Localization Shentong Mo Pedro Morgado 79 64 0 30 Aug 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey Yanbei Chen Massimiliano Mancini Xiatian Zhu Zeynep Akata 30 113 0 24 Aug 2022
Temporal and cross-modal attention for audio-visual zero-shot learning Otniel-Bogdan Mercea Thomas Hummel A. Sophia Koepke Zeynep Akata 30 25 0 20 Jul 2022
Finding Fallen Objects Via Asynchronous Audio-Visual Integration Chuang Gan Yi Gu Siyuan Zhou Jeremy Schwartz S. Alter James Traer Dan Gutfreund J. Tenenbaum Josh H. McDermott Antonio Torralba 40 19 0 07 Jul 2022
Sound Localization by Self-Supervised Time Delay Estimation Ziyang Chen David Fouhey Andrew Owens SSL 19 19 0 26 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound Yan-Bo Lin Jie Lei Mohit Bansal Gedas Bertasius 31 39 0 06 Apr 2022
Audio Self-supervised Learning: A Survey Shuo Liu Adria Mallol-Ragolta Emilia Parada-Cabeleiro Kun Qian Xingshuo Jing Alexander Kathan Bin Hu Bjoern W. Schuller SSL 26 106 0 02 Mar 2022
Keyword localisation in untranscribed speech using visually grounded speech models Kayode Olaleye Dan Oneaţă Herman Kamper 19 7 0 02 Feb 2022
Video Transformers: A Survey Javier Selva A. S. Johansen Sergio Escalera Kamal Nasrollahi T. Moeslund Albert Clapés ViT 20 103 0 16 Jan 2022
The Impact of Spatiotemporal Augmentations on Self-Supervised Audiovisual Representation Learning Haider Al-Tahan Y. Mohsenzadeh SSL AI4TS 24 0 0 13 Oct 2021
Attention is All You Need? Good Embeddings with Statistics are enough:Large Scale Audio Understanding without Transformers/ Convolutions/ BERTs/ Mixers/ Attention/ RNNs or .... Prateek Verma AI4TS 24 2 0 07 Oct 2021
LiRA: Learning Visual Speech Representations from Audio through Self-supervision Pingchuan Ma Rodrigo Mira Stavros Petridis Björn W. Schuller M. Pantic SSL 16 53 0 16 Jun 2021
Unsupervised Sound Localization via Iterative Contrastive Learning Yan-Bo Lin Hung-Yu Tseng Hsin-Ying Lee Yen-Yu Lin Ming-Hsuan Yang SSL 19 34 0 01 Apr 2021
Listening to Sounds of Silence for Speech Denoising Ruilin Xu Rundi Wu Y. Ishiwaka Carl Vondrick Changxi Zheng 15 32 0 22 Oct 2020
Self-Supervised Learning of Audio-Visual Objects from Video Triantafyllos Afouras Andrew Owens Joon Son Chung Andrew Zisserman SSL 17 250 0 10 Aug 2020
Learning Video Representations from Textual Web Supervision Jonathan C. Stroud Zhichao Lu Chen Sun Jia Deng Rahul Sukthankar Cordelia Schmid David A. Ross SSL 21 48 0 29 Jul 2020
Multiple Sound Sources Localization from Coarse to Fine Rui Qian Di Hu Heinrich Dinkel Mengyue Wu N. Xu Weiyao Lin 23 153 0 13 Jul 2020
Visually Guided Sound Source Separation using Cascaded Opponent Filter Network Lingyu Zhu Esa Rahtu 14 23 0 04 Jun 2020
S3VAE: Self-Supervised Sequential VAE for Representation Disentanglement and Data Generation Yizhe Zhu Martin Renqiang Min Asim Kadav H. Graf CoGe DRL 6 95 0 23 May 2020
Self-Supervised Learning by Cross-Modal Audio-Video Clustering Humam Alwassel D. Mahajan Bruno Korbar Lorenzo Torresani Bernard Ghanem Du Tran SSL 20 428 0 28 Nov 2019
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications Arda Senocak Tae-Hyun Oh Junsik Kim Ming-Hsuan Yang In So Kweon SSL 17 52 0 20 Nov 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition Evangelos Kazakos Arsha Nagrani Andrew Zisserman Dima Damen EgoV 16 0 0 22 Aug 2019
Learning Video Representations using Contrastive Bidirectional Transformer Chen Sun Fabien Baradel Kevin Patrick Murphy Cordelia Schmid SSL ViT 13 133 0 13 Jun 2019
A Simple Baseline for Audio-Visual Scene-Aware Dialog Idan Schwartz A. Schwing Tamir Hazan 19 69 0 11 Apr 2019
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild Samuel Albanie Arsha Nagrani Andrea Vedaldi Andrew Zisserman CVBM 22 270 0 16 Aug 2018
Unsupervised learning of foreground object detection Ioana Croitoru Simion-Vlad Bogolin Marius Leordeanu OCL 16 48 0 14 Aug 2018
Audio-Visual Scene Analysis with Self-Supervised Multisensory Features Andrew Owens Alexei A. Efros SSL 14 743 0 10 Apr 2018
Objects that Sound Relja Arandjelović Andrew Zisserman ObjD VOS 21 528 0 18 Dec 2017
Interpreting Deep Visual Representations via Network Dissection Bolei Zhou David Bau A. Oliva Antonio Torralba FAtt MILM 29 323 0 15 Nov 2017
Unsupervised Representation Learning by Sorting Sequences Hsin-Ying Lee Jia-Bin Huang Maneesh Kumar Singh Ming-Hsuan Yang SSL DRL 17 533 0 03 Aug 2017
Active Decision Boundary Annotation with Deep Generative Models Miriam W. Huijser J. C. V. Gemert 12 46 0 20 Mar 2017
Colorization as a Proxy Task for Visual Understanding Gustav Larsson Michael Maire Gregory Shakhnarovich SSL 19 493 0 11 Mar 2017
Learning Features by Watching Objects Move Deepak Pathak Ross B. Girshick Piotr Dollár Trevor Darrell Bharath Hariharan SSL VOS OCL 13 521 0 19 Dec 2016
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction Richard Y. Zhang Phillip Isola Alexei A. Efros SSL DRL 17 665 0 29 Nov 2016
Generating Videos with Scene Dynamics Carl Vondrick Hamed Pirsiavash Antonio Torralba GAN VGen 69 1,460 0 08 Sep 2016