ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1706.00932
  4. Cited By
See, Hear, and Read: Deep Aligned Representations

See, Hear, and Read: Deep Aligned Representations

3 June 2017
Y. Aytar
Carl Vondrick
Antonio Torralba
    VLM
    AI4TS
ArXivPDFHTML

Papers citing "See, Hear, and Read: Deep Aligned Representations"

36 / 36 papers shown
Title
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Anchors Aweigh! Sail for Optimal Unified Multi-Modal Representations
Minoh Jeong
Min Namgung
Zae Myung Kim
Dongyeop Kang
Yao-Yi Chiang
Alfred Hero
35
0
0
02 Oct 2024
InteRACT: Transformer Models for Human Intent Prediction Conditioned on
  Robot Actions
InteRACT: Transformer Models for Human Intent Prediction Conditioned on Robot Actions
Kushal Kedia
Atiksh Bhardwaj
Prithwish Dan
Sanjiban Choudhury
34
9
0
21 Nov 2023
Effective Multimodal Reinforcement Learning with Modality Alignment and
  Importance Enhancement
Effective Multimodal Reinforcement Learning with Modality Alignment and Importance Enhancement
Jinming Ma
Feng Wu
Yingfeng Chen
Xianpeng Ji
Yu-qiong Ding
OffRL
33
4
0
18 Feb 2023
Robust Sound-Guided Image Manipulation
Robust Sound-Guided Image Manipulation
Seung Hyun Lee
Gyeongrok Oh
Wonmin Byeon
Sang Ho Yoon
Jinkyu Kim
Sangpil Kim
DiffM
26
7
0
30 Aug 2022
Is an Object-Centric Video Representation Beneficial for Transfer?
Is an Object-Centric Video Representation Beneficial for Transfer?
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
ViT
37
27
0
20 Jul 2022
Self-Supervised Learning for Videos: A Survey
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Yogesh S Rawat
M. Shah
SSL
41
132
0
18 Jun 2022
CyCLIP: Cyclic Contrastive Language-Image Pretraining
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Shashank Goel
Hritik Bansal
S. Bhatia
Ryan A. Rossi
Vishwa Vinay
Aditya Grover
CLIP
VLM
186
134
0
28 May 2022
Cross Modal Retrieval with Querybank Normalisation
Cross Modal Retrieval with Querybank Normalisation
Simion-Vlad Bogolin
Ioana Croitoru
Hailin Jin
Yang Liu
Samuel Albanie
29
84
0
23 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David Harwath
James R. Glass
Hilde Kuehne
ViT
34
129
0
08 Dec 2021
Sound-Guided Semantic Image Manipulation
Sound-Guided Semantic Image Manipulation
Seung Hyun Lee
Wonseok Roh
Wonmin Byeon
Sang Ho Yoon
Chanyoung Kim
Jinkyu Kim
Sangpil Kim
DiffM
37
43
0
30 Nov 2021
Learning to Cut by Watching Movies
Learning to Cut by Watching Movies
Alejandro Pardo
Fabian Caba Heilbron
Juan Carlos León Alcázar
Ali K. Thabet
Guohao Li
VGen
58
20
0
09 Aug 2021
Audio Retrieval with Natural Language Queries
Audio Retrieval with Natural Language Queries
Andreea-Maria Oncescu
A. Sophia Koepke
João F. Henriques
Zeynep Akata
Samuel Albanie
21
77
0
05 May 2021
Multimodal Clustering Networks for Self-supervised Learning from
  Unlabeled Videos
Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos
Brian Chen
Andrew Rouditchenko
Kevin Duarte
Hilde Kuehne
Samuel Thomas
...
Rogerio Feris
David Harwath
James R. Glass
M. Picheny
Shih-Fu Chang
SSL
36
89
0
26 Apr 2021
Towards a Collective Agenda on AI for Earth Science Data Analysis
Towards a Collective Agenda on AI for Earth Science Data Analysis
D. Tuia
R. Roscher
Jan Dirk Wegner
Nathan Jacobs
Xiaoxiang Zhu
Gustau Camps-Valls
AI4CE
44
68
0
11 Apr 2021
COOT: Cooperative Hierarchical Transformer for Video-Text Representation
  Learning
COOT: Cooperative Hierarchical Transformer for Video-Text Representation Learning
Simon Ging
Mohammadreza Zolfaghari
Hamed Pirsiavash
Thomas Brox
ViT
CLIP
31
169
0
01 Nov 2020
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
New Ideas and Trends in Deep Multimodal Content Understanding: A Review
Wei Chen
Weiping Wang
Li Liu
M. Lew
VLM
120
31
0
16 Oct 2020
Look, Listen, and Attend: Co-Attention Network for Self-Supervised
  Audio-Visual Representation Learning
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
Ying Cheng
Ruize Wang
Zhihao Pan
Rui Feng
Yuejie Zhang
SSL
36
106
0
13 Aug 2020
Self-Supervised MultiModal Versatile Networks
Self-Supervised MultiModal Versatile Networks
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
40
372
0
29 Jun 2020
Disentangled Speech Embeddings using Cross-modal Self-supervision
Disentangled Speech Embeddings using Cross-modal Self-supervision
Arsha Nagrani
Joon Son Chung
Samuel Albanie
Andrew Zisserman
SSL
21
88
0
20 Feb 2020
Deep Audio-Visual Learning: A Survey
Deep Audio-Visual Learning: A Survey
Hao Zhu
Mandi Luo
Rui Wang
A. Zheng
Ran He
31
156
0
14 Jan 2020
Learning to Localize Sound Sources in Visual Scenes: Analysis and
  Applications
Learning to Localize Sound Sources in Visual Scenes: Analysis and Applications
Arda Senocak
Tae-Hyun Oh
Junsik Kim
Ming-Hsuan Yang
In So Kweon
SSL
33
52
0
20 Nov 2019
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action
  Recognition
EPIC-Fusion: Audio-Visual Temporal Binding for Egocentric Action Recognition
Evangelos Kazakos
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
16
332
0
22 Aug 2019
Audio-Visual Model Distillation Using Acoustic Images
Audio-Visual Model Distillation Using Acoustic Images
Andrés F. Pérez
Valentina Sanguineti
Pietro Morerio
Vittorio Murino
VLM
15
27
0
16 Apr 2019
2.5D Visual Sound
2.5D Visual Sound
Ruohan Gao
Kristen Grauman
VGen
27
130
0
11 Dec 2018
Uncertainty aware audiovisual activity recognition using deep Bayesian
  variational inference
Uncertainty aware audiovisual activity recognition using deep Bayesian variational inference
Mahesh Subedar
R. Krishnan
P. López-Meyer
Omesh Tickoo
Jonathan Huang
BDL
EDL
UQCV
29
0
0
27 Nov 2018
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Emotion Recognition in Speech using Cross-Modal Transfer in the Wild
Samuel Albanie
Arsha Nagrani
Andrea Vedaldi
Andrew Zisserman
CVBM
30
270
0
16 Aug 2018
Playing hard exploration games by watching YouTube
Playing hard exploration games by watching YouTube
Y. Aytar
Tobias Pfaff
David Budden
T. Paine
Ziyun Wang
Nando de Freitas
35
269
0
29 May 2018
Unifying and Merging Well-trained Deep Neural Networks for Inference
  Stage
Unifying and Merging Well-trained Deep Neural Networks for Inference Stage
Yi-Min Chou
Yi-Ming Chan
Jia-Hong Lee
Chih-Yi Chiu
Chu-Song Chen
MoMe
35
34
0
14 May 2018
Weakly-supervised Visual Instrument-playing Action Detection in Videos
Weakly-supervised Visual Instrument-playing Action Detection in Videos
Jen-Yu Liu
Yi-Hsuan Yang
Shyh-Kang Jeng
21
13
0
05 May 2018
Weakly Supervised Representation Learning for Unsynchronized
  Audio-Visual Events
Weakly Supervised Representation Learning for Unsynchronized Audio-Visual Events
Sanjeel Parekh
S. Essid
A. Ozerov
Ngoc Q. K. Duong
P. Pérez
G. Richard
SSL
16
19
0
19 Apr 2018
Zero-Shot Object Detection
Zero-Shot Object Detection
Ankan Bansal
Karan Sikka
Gaurav Sharma
Rama Chellappa
Ajay Divakaran
VLM
ObjD
46
359
0
12 Apr 2018
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Learning a Text-Video Embedding from Incomplete and Heterogeneous Data
Antoine Miech
Ivan Laptev
Josef Sivic
22
233
0
07 Apr 2018
Cross-modal Embeddings for Video and Audio Retrieval
Cross-modal Embeddings for Video and Audio Retrieval
Dídac Surís
A. Duarte
Amaia Salvador
Jordi Torres
Xavier Giró-i-Nieto
SSL
21
69
0
07 Jan 2018
Objects that Sound
Objects that Sound
Relja Arandjelović
Andrew Zisserman
ObjD
VOS
44
528
0
18 Dec 2017
Convolutional Neural Networks for Sentence Classification
Convolutional Neural Networks for Sentence Classification
Yoon Kim
AILaw
VLM
312
13,377
0
25 Aug 2014
A Multi-View Embedding Space for Modeling Internet Images, Tags, and
  their Semantics
A Multi-View Embedding Space for Modeling Internet Images, Tags, and their Semantics
Yunchao Gong
Qifa Ke
Michael Isard
Svetlana Lazebnik
3DV
78
584
0
18 Dec 2012
1