Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1706.00932
Cited By
See, Hear, and Read: Deep Aligned Representations
3 June 2017
Y. Aytar
Carl Vondrick
Antonio Torralba
VLM
AI4TS
Re-assign community
ArXiv
PDF
HTML
Papers citing
"See, Hear, and Read: Deep Aligned Representations"
36 / 36 papers shown
Title
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Minoh Jeong
Min Namgung
Zae Myung Kim
Dongyeop Kang
Yao-Yi Chiang
Alfred Hero
35
0
0
02 Oct 2024
InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions
Kushal Kedia
Atiksh Bhardwaj
Prithwish Dan
Sanjiban Choudhury
34
9
0
21 Nov 2023
Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement
Jinming Ma
Feng Wu
Yingfeng Chen
Xianpeng Ji
Yu-qiong Ding
OffRL
33
4
0
18 Feb 2023
Robust Sound-Guided Image Manipulation
Seung Hyun Lee
Gyeongrok Oh
Wonmin Byeon
Sang Ho Yoon
Jinkyu Kim
Sangpil Kim
DiffM
26
7
0
30 Aug 2022
Is an Object-Centric Video Representation Beneficial for Transfer?
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
ViT
37
27
0
20 Jul 2022
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
41
132
0
18 Jun 2022
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Shashank Goel
Hritik Bansal
S. Bhatia
Ryan A. Rossi
Vishwa Vinay
Aditya Grover
CLIP
VLM
186
134
0
28 May 2022
Cross Modal Retrieval with Querybank Normalisation
Simion-Vlad Bogolin
Ioana Croitoru
Hailin Jin
Yang Liu
Samuel Albanie
29
84
0
23 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David Harwath
James R. Glass
Hilde Kuehne
ViT
34
129
0
08 Dec 2021
Sound-Guided Semantic Image Manipulation
Seung Hyun Lee
Wonseok Roh
Wonmin Byeon
Sang Ho Yoon
Chanyoung Kim
Jinkyu Kim
Sangpil Kim
DiffM
37
43
0
30 Nov 2021
Learning to Cut by Watching Movies
Alejandro Pardo
Fabian Caba Heilbron
Juan Carlos León Alcázar
Ali K. Thabet
Guohao Li
VGen
58
20
0
09 Aug 2021
Audio Retrieval with Natural Language Queries
Andreea-Maria Oncescu
A. Sophia Koepke
João F. Henriques
Zeynep Akata
Samuel Albanie
21
77
0
05 May 2021
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Hilde Kuehne
Samuel Thomas
...
Rogerio Feris
David Harwath
James R. Glass
M. Picheny
Shih-Fu Chang
SSL
36
89
0
26 Apr 2021
Towards a Collective Agenda on AI for Earth Science Data Analysis
D. Tuia
R. Roscher
Jan Dirk Wegner
Nathan Jacobs
Xiaoxiang Zhu
Gustau Camps-Valls
AI4CE
44
68
0
11 Apr 2021
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
31
169
0
01 Nov 2020
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
Wei Chen
Weiping Wang
Li Liu
M. Lew
VLM
120
31
0
16 Oct 2020
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Ying Cheng
Ruize Wang
Zhihao Pan
Rui Feng
Yuejie Zhang
SSL
36
106
0
13 Aug 2020
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
40
372
0
29 Jun 2020
Disentangled Speech Embeddings using Cross-modal Self-supervision
Arsha Nagrani
Joon Son Chung
Samuel Albanie
Andrew Zisserman
SSL
21
88
0
20 Feb 2020
Deep Audio-Visual Learning: A Survey
Hao Zhu
Mandi Luo
Rui Wang
A. Zheng
Ran He
31
156
0
14 Jan 2020
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Arda Senocak
Tae-Hyun Oh
Junsik Kim
Ming-Hsuan Yang
In So Kweon
SSL
33
52
0
20 Nov 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Evangelos Kazakos
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
16
332
0
22 Aug 2019
Audio-Visual Model Distillation Using Acoustic Images
Andrés F. Pérez
Valentina Sanguineti
Pietro Morerio
Vittorio Murino
VLM
15
27
0
16 Apr 2019
2.5D Visual Sound
Ruohan Gao
Kristen Grauman
VGen
27
130
0
11 Dec 2018
Uncertainty aware audiovisual activity recognition using deep Bayesian variational inference
Mahesh Subedar
R. Krishnan
P. López-Meyer
Omesh Tickoo
Jonathan Huang
BDL
EDL
UQCV
29
0
0
27 Nov 2018
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Samuel Albanie
Arsha Nagrani
Andrea Vedaldi
Andrew Zisserman
CVBM
30
270
0
16 Aug 2018
Playing hard exploration games by watching YouTube
Y. Aytar
Tobias Pfaff
David Budden
T. Paine
Ziyun Wang
Nando de Freitas
35
269
0
29 May 2018
Unifying and Merging Well-trained Deep Neural Networks for Inference Stage
Yi-Min Chou
Yi-Ming Chan
Jia-Hong Lee
Chih-Yi Chiu
Chu-Song Chen
MoMe
35
34
0
14 May 2018
Weakly-supervised Visual Instrument-playing Action Detection in Videos
Jen-Yu Liu
Yi-Hsuan Yang
Shyh-Kang Jeng
21
13
0
05 May 2018
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Sanjeel Parekh
S. Essid
A. Ozerov
Ngoc Q. K. Duong
P. Pérez
G. Richard
SSL
16
19
0
19 Apr 2018
Zero-Shot Object Detection
Ankan Bansal
Karan Sikka
Gaurav Sharma
Rama Chellappa
Ajay Divakaran
VLM
ObjD
46
359
0
12 Apr 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
22
233
0
07 Apr 2018
Cross-modal Embeddings for Video and Audio Retrieval
Dídac Surís
A. Duarte
Amaia Salvador
Jordi Torres
Xavier Giró-i-Nieto
SSL
21
69
0
07 Jan 2018
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjD
VOS
44
528
0
18 Dec 2017
Convolutional Neural Networks for Sentence Classification
Yoon Kim
AILaw
VLM
312
13,377
0
25 Aug 2014
A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics
Yunchao Gong
Qifa Ke
Michael Isard
Svetlana Lazebnik
3DV
78
584
0
18 Dec 2012
1