Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1804.01452
Cited By
Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input
4 April 2018
David F. Harwath
Adrià Recasens
Dídac Surís
Galen Chuang
Antonio Torralba
James R. Glass
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input"
39 / 39 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
34
0
0
02 May 2025
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Minoh Jeong
Min Namgung
Zae Myung Kim
Dongyeop Kang
Yao-Yi Chiang
Alfred Hero
25
0
0
02 Oct 2024
Measuring Sound Symbolism in Audio-visual Models
Wei-Cheng Tseng
Yi-Jen Shih
David Harwath
Raymond Mooney
32
0
0
18 Sep 2024
Cross-Lingual Transfer Learning for Speech Translation
Rao Ma
Yassir Fathullah
Mengjie Qian
Siyuan Tang
Mark J. F. Gales
Kate Knill
20
1
0
01 Jul 2024
A model of early word acquisition based on realistic-scale audiovisual naming events
Khazar Khorrami
Okko Rasanen
NAI
40
0
0
07 Jun 2024
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in HuBERT
Cheol Jun Cho
Abdelrahman Mohamed
Shang-Wen Li
Alan W. Black
Gopala K. Anumanchipalli
29
8
0
16 Oct 2023
Visually grounded few-shot word acquisition with fewer shots
Leanne Nortje
Benjamin van Niekerk
Herman Kamper
18
1
0
25 May 2023
Syllable Discovery and Cross-Lingual Generalization in a Visually Grounded, Self-Supervised Speech Model
Puyuan Peng
Shang-Wen Li
Okko Rasanen
Abdel-rahman Mohamed
David F. Harwath
SSL
VLM
26
7
0
19 May 2023
Hindi as a Second Language: Improving Visually Grounded Speech with Semantically Similar Samples
H. Ryu
Arda Senocak
In So Kweon
Joon Son Chung
VLM
21
8
0
30 Mar 2023
What, when, and where? -- Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
D. Kondermann
Samuel Thomas
Shih-Fu Chang
Rogerio Feris
James R. Glass
Hilde Kuehne
32
7
0
29 Mar 2023
Using Multiple Instance Learning to Build Multimodal Representations
Peiqi Wang
W. Wells
Seth Berkowitz
Steven Horng
Polina Golland
SSL
24
6
0
11 Dec 2022
Mix and Localize: Localizing Sound Sources in Mixtures
Xixi Hu
Ziyang Chen
Andrew Owens
23
51
0
28 Nov 2022
Towards visually prompted keyword localisation for zero-resource spoken languages
Leanne Nortje
Herman Kamper
11
6
0
12 Oct 2022
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
51
28
0
28 Sep 2022
Self-Supervised Speech Representation Learning: A Review
Abdel-rahman Mohamed
Hung-yi Lee
Lasse Borgholt
Jakob Drachmann Havtorn
Joakim Edin
...
Shang-Wen Li
Karen Livescu
Lars Maaløe
Tara N. Sainath
Shinji Watanabe
SSL
AI4TS
128
349
0
21 May 2022
Weakly-Supervised Action Detection Guided by Audio Narration
Keren Ye
Adriana Kovashka
22
0
0
12 May 2022
Improving Multimodal Speech Recognition by Data Augmentation and Speech Representations
Dan Oneaţă
H. Cucu
19
19
0
27 Apr 2022
Audio Self-supervised Learning: A Survey
Shuo Liu
Adria Mallol-Ragolta
Emilia Parada-Cabeleiro
Kun Qian
Xingshuo Jing
Alexander Kathan
Bin Hu
Bjoern W. Schuller
SSL
35
106
0
02 Mar 2022
Keyword localisation in untranscribed speech using visually grounded speech models
Kayode Olaleye
Dan Oneaţă
Herman Kamper
19
7
0
02 Feb 2022
Self-Supervised Moving Vehicle Detection from Audio-Visual Cues
Jannik Zürn
Wolfram Burgard
SSL
26
8
0
30 Jan 2022
Bridging the Gap: Using Deep Acoustic Representations to Learn Grounded Language from Percepts and Raw Speech
Gaoussou Youssouf Kebe
Luke E. Richards
Edward Raff
Francis Ferraro
Cynthia Matuszek
SSL
18
5
0
27 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
28
129
0
08 Dec 2021
Explainable Semantic Space by Grounding Language to Vision with Cross-Modal Contrastive Learning
Yizhen Zhang
Minkyu Choi
Kuan Han
Zhongming Liu
VLM
15
15
0
13 Nov 2021
Voice-assisted Image Labelling for Endoscopic Ultrasound Classification using Neural Networks
E. Bonmati
Yipeng Hu
A. Grimwood
G. Johnson
G. Goodchild
...
K. Gurusamy
Brian P. Davidson
Matthew J. Clarkson
Stephen P. Pereira
D. Barratt
19
15
0
12 Oct 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
18
33
0
18 Mar 2021
Multimodal Representation Learning via Maximization of Local Mutual Information
Ruizhi Liao
Daniel Moyer
Miriam Cha
Keegan Quigley
Seth Berkowitz
Steven Horng
Polina Golland
W. Wells
SSL
13
41
0
08 Mar 2021
Deep Learning and the Global Workspace Theory
R. V. Rullen
Ryota Kanai
37
65
0
04 Dec 2020
Into the Wild with AudioScope: Unsupervised Audio-Visual Separation of On-Screen Sounds
Efthymios Tzinis
Scott Wisdom
A. Jansen
Shawn Hershey
Tal Remez
D. Ellis
J. Hershey
26
68
0
02 Nov 2020
Leveraging Unpaired Text Data for Training End-to-End Speech-to-Intent Systems
Yinghui Huang
H. Kuo
Samuel Thomas
Zvi Kons
Kartik Audhkhasi
Brian Kingsbury
R. Hoory
M. Picheny
VLM
6
63
0
08 Oct 2020
Self-Supervised Learning of Audio-Visual Objects from Video
Triantafyllos Afouras
Andrew Owens
Joon Son Chung
Andrew Zisserman
SSL
17
250
0
10 Aug 2020
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
415
595
0
21 Jul 2020
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
40
371
0
29 Jun 2020
AVLnet: Learning Audio-Visual Language Representations from Instructional Videos
Andrew Rouditchenko
Angie Boggust
David F. Harwath
Brian Chen
D. Joshi
...
Rogerio Feris
Brian Kingsbury
M. Picheny
Antonio Torralba
James R. Glass
SSL
22
141
0
16 Jun 2020
Experience Grounds Language
Yonatan Bisk
Ari Holtzman
Jesse Thomason
Jacob Andreas
Yoshua Bengio
...
Angeliki Lazaridou
Jonathan May
Aleksandr Nisnevich
Nicolas Pinto
Joseph P. Turian
19
350
0
21 Apr 2020
Deep daxes: Mutual exclusivity arises through both learning biases and pragmatic strategies in neural networks
Kristina Gulordava
T. Brochhagen
Gemma Boleda
11
3
0
08 Apr 2020
Direct Speech-to-image Translation
Jiguo Li
Xinfeng Zhang
Chuanmin Jia
Jizheng Xu
Li Zhang
Y. Wang
Siwei Ma
Wen Gao
28
29
0
07 Apr 2020
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Arda Senocak
Tae-Hyun Oh
Junsik Kim
Ming-Hsuan Yang
In So Kweon
SSL
27
52
0
20 Nov 2019
MaSS: A Large and Clean Multilingual Corpus of Sentence-aligned Spoken Utterances Extracted from the Bible
Marcely Zanon Boito
William N. Havard
Mahault Garnerin
Éric Le Ferrand
Laurent Besacier
22
46
0
30 Jul 2019
Semantic speech retrieval with a visually grounded model of untranscribed speech
Herman Kamper
Gregory Shakhnarovich
Karen Livescu
21
53
0
05 Oct 2017
1