Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input

4 April 2018

Antonio Torralba

Papers citing "Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input"

39 / 39 papers shown

Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment Edson Araujo Andrew Rouditchenko Yuan Gong Saurabhchand Bhati Samuel Thomas Brian Kingsbury Leonid Karlinsky Rogerio Feris James Glass 34 0 0 02 May 2025
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations Minoh Jeong Min Namgung Zae Myung Kim Dongyeop Kang Yao-Yi Chiang Alfred Hero 25 0 0 02 Oct 2024
Measuring Sound Symbolism in Audio-visual Models Wei-Cheng Tseng Yi-Jen Shih David Harwath Raymond Mooney 32 0 0 18 Sep 2024
Cross-Lingual Transfer Learning for Speech Translation Rao Ma Yassir Fathullah Mengjie Qian Siyuan Tang Mark J. F. Gales Kate Knill 20 1 0 01 Jul 2024
A model of early word acquisition based on realistic-scale audiovisual naming events Khazar Khorrami Okko Rasanen NAI 40 0 0 07 Jun 2024
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT Cheol Jun Cho Abdelrahman Mohamed Shang-Wen Li Alan W. Black Gopala K. Anumanchipalli 29 8 0 16 Oct 2023
Visually grounded few-shot word acquisition with fewer shots Leanne Nortje Benjamin van Niekerk Herman Kamper 18 1 0 25 May 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model Puyuan Peng Shang-Wen Li Okko Rasanen Abdel-rahman Mohamed David F. Harwath SSL VLM 26 7 0 19 May 2023
Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples H. Ryu Arda Senocak In So Kweon Joon Son Chung VLM 21 8 0 30 Mar 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions Brian Chen Nina Shvetsova Andrew Rouditchenko D. Kondermann Samuel Thomas Shih-Fu Chang Rogerio Feris James R. Glass Hilde Kuehne 32 7 0 29 Mar 2023
Using Multiple Instance Learning to Build Multimodal Representations Peiqi Wang W. Wells Seth Berkowitz Steven Horng Polina Golland SSL 24 6 0 11 Dec 2022
Mix and Localize: Localizing Sound Sources in Mixtures Xixi Hu Ziyang Chen Andrew Owens 23 51 0 28 Nov 2022
Towards visually prompted keyword localisation for zero-resource spoken languages Leanne Nortje Herman Kamper 11 6 0 12 Oct 2022
TVLT: Textless Vision-Language Transformer Zineng Tang Jaemin Cho Yixin Nie Mohit Bansal VLM 51 28 0 28 Sep 2022
Self-Supervised Speech Representation Learning: A Review Abdel-rahman Mohamed Hung-yi Lee Lasse Borgholt Jakob Drachmann Havtorn Joakim Edin ... Shang-Wen Li Karen Livescu Lars Maaløe Tara N. Sainath Shinji Watanabe SSL AI4TS 128 349 0 21 May 2022
Weakly-Supervised Action Detection Guided by Audio Narration Keren Ye Adriana Kovashka 22 0 0 12 May 2022
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations Dan Oneaţă H. Cucu 19 19 0 27 Apr 2022
Audio Self-supervised Learning: A Survey Shuo Liu Adria Mallol-Ragolta Emilia Parada-Cabeleiro Kun Qian Xingshuo Jing Alexander Kathan Bin Hu Bjoern W. Schuller SSL 35 106 0 02 Mar 2022
Keyword localisation in untranscribed speech using visually grounded speech models Kayode Olaleye Dan Oneaţă Herman Kamper 19 7 0 02 Feb 2022
Self-Supervised Moving Vehicle Detection from Audio-Visual Cues Jannik Zürn Wolfram Burgard SSL 26 8 0 30 Jan 2022
Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech Gaoussou Youssouf Kebe Luke E. Richards Edward Raff Francis Ferraro Cynthia Matuszek SSL 18 5 0 27 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval Nina Shvetsova Brian Chen Andrew Rouditchenko Samuel Thomas Brian Kingsbury Rogerio Feris David F. Harwath James R. Glass Hilde Kuehne ViT 28 129 0 08 Dec 2021
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning Yizhen Zhang Minkyu Choi Kuan Han Zhongming Liu VLM 15 15 0 13 Nov 2021
Voice-assisted Image Labelling for Endoscopic Ultrasound Classification using Neural Networks E. Bonmati Yipeng Hu A. Grimwood G. Johnson G. Goodchild ... K. Gurusamy Brian P. Davidson Matthew J. Clarkson Stephen P. Pereira D. Barratt 19 15 0 12 Oct 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning Mandela Patrick Yuki M. Asano Bernie Huang Ishan Misra Florian Metze Joao Henriques Andrea Vedaldi AI4TS 18 33 0 18 Mar 2021
Multimodal Representation Learning via Maximization of Local Mutual Information Ruizhi Liao Daniel Moyer Miriam Cha Keegan Quigley Seth Berkowitz Steven Horng Polina Golland W. Wells SSL 13 41 0 08 Mar 2021
Deep Learning and the Global Workspace Theory R. V. Rullen Ryota Kanai 37 65 0 04 Dec 2020
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds Efthymios Tzinis Scott Wisdom A. Jansen Shawn Hershey Tal Remez D. Ellis J. Hershey 26 68 0 02 Nov 2020
Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems Yinghui Huang H. Kuo Samuel Thomas Zvi Kons Kartik Audhkhasi Brian Kingsbury R. Hoory M. Picheny VLM 6 63 0 08 Oct 2020
Self-Supervised Learning of Audio-Visual Objects from Video Triantafyllos Afouras Andrew Owens Joon Son Chung Andrew Zisserman SSL 17 250 0 10 Aug 2020
Multi-modal Transformer for Video Retrieval Valentin Gabeur Chen Sun Alahari Karteek Cordelia Schmid ViT 415 595 0 21 Jul 2020
Self-Supervised MultiModal Versatile Networks Jean-Baptiste Alayrac Adrià Recasens R. Schneider Relja Arandjelović Jason Ramapuram J. Fauw Lucas Smaira Sander Dieleman Andrew Zisserman SSL 40 371 0 29 Jun 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos Andrew Rouditchenko Angie Boggust David F. Harwath Brian Chen D. Joshi ... Rogerio Feris Brian Kingsbury M. Picheny Antonio Torralba James R. Glass SSL 22 141 0 16 Jun 2020
Experience Grounds Language Yonatan Bisk Ari Holtzman Jesse Thomason Jacob Andreas Yoshua Bengio ... Angeliki Lazaridou Jonathan May Aleksandr Nisnevich Nicolas Pinto Joseph P. Turian 19 350 0 21 Apr 2020
Deep daxes: Mutual exclusivity arises through both learning biases and pragmatic strategies in neural networks Kristina Gulordava T. Brochhagen Gemma Boleda 11 3 0 08 Apr 2020
Direct Speech-to-image Translation Jiguo Li Xinfeng Zhang Chuanmin Jia Jizheng Xu Li Zhang Y. Wang Siwei Ma Wen Gao 28 29 0 07 Apr 2020
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications Arda Senocak Tae-Hyun Oh Junsik Kim Ming-Hsuan Yang In So Kweon SSL 27 52 0 20 Nov 2019
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible Marcely Zanon Boito William N. Havard Mahault Garnerin Éric Le Ferrand Laurent Besacier 22 46 0 30 Jul 2019
Semantic speech retrieval with a visually grounded model of untranscribed speech Herman Kamper Gregory Shakhnarovich Karen Livescu 21 53 0 05 Oct 2017