Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2006.16228
Cited By
Self-Supervised MultiModal Versatile Networks
29 June 2020
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
SSL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Self-Supervised MultiModal Versatile Networks"
50 / 266 papers shown
Title
Intra-agent speech permits zero-shot task acquisition
Chen Yan
Federico Carnevale
Petko Georgiev
Adam Santoro
Aurelia Guy
Alistair Muldal
Chia-Chun Hung
Josh Abramson
Timothy Lillicrap
Greg Wayne
LM&Ro
26
9
0
07 Jun 2022
Beyond Just Vision: A Review on Self-Supervised Representation Learning on Multimodal and Temporal Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Jiayuan He
Daniel V. Smith
Flora D. Salim
AI4TS
17
37
0
06 Jun 2022
3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation
Zhidan Liu
Zhen Xing
Xiangdong Zhou
Yijiang Chen
G. Zhou
3DH
8
3
0
02 Jun 2022
Language Models with Image Descriptors are Strong Few-Shot Video-Language Learners
Zhenhailong Wang
Manling Li
Ruochen Xu
Luowei Zhou
Jie Lei
...
Chenguang Zhu
Derek Hoiem
Shih-Fu Chang
Mohit Bansal
Heng Ji
MLLM
VLM
159
134
0
22 May 2022
Multimodal Conversational AI: A Survey of Datasets and Approaches
Anirudh S. Sundar
Larry Heck
30
29
0
13 May 2022
Scene Consistency Representation Learning for Video Scene Segmentation
Haoqian Wu
Keyu Chen
Yanan Luo
Ruizhi Qiao
Bo Ren
Haozhe Liu
Weicheng Xie
Linlin Shen
SSL
12
16
0
11 May 2022
TransRank: Self-supervised Video Representation Learning via Ranking-based Transformation Recognition
Haodong Duan
Nanxuan Zhao
Kai-xiang Chen
Dahua Lin
ViT
AI4TS
28
19
0
04 May 2022
i-Code: An Integrative and Composable Multimodal Learning Framework
Ziyi Yang
Yuwei Fang
Chenguang Zhu
Reid Pryzant
Dongdong Chen
...
Bin Xiao
Yuanxun Lu
Takuya Yoshioka
Michael Zeng
Xuedong Huang
29
45
0
03 May 2022
On Negative Sampling for Audio-Visual Contrastive Learning from Movies
Mahdi M. Kalayeh
Shervin Ardeshir
Lingyi Liu
Nagendra Kamath
Ashok Chandrashekar
SSL
14
3
0
29 Apr 2022
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac
Jeff Donahue
Pauline Luc
Antoine Miech
Iain Barr
...
Mikolaj Binkowski
Ricardo Barreira
Oriol Vinyals
Andrew Zisserman
Karen Simonyan
MLLM
VLM
15
2,308
0
29 Apr 2022
Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast
Boqing Zhu
Kele Xu
Changjian Wang
Zheng Qin
Tao Sun
Huaimin Wang
Yuxing Peng
SSL
20
17
0
28 Apr 2022
MILES: Visual BERT Pre-training with Injected Language Semantics for Video-text Retrieval
Yuying Ge
Yixiao Ge
Xihui Liu
Alex Jinpeng Wang
Jianping Wu
Ying Shan
Xiaohu Qie
Ping Luo
VLM
6
43
0
26 Apr 2022
A Survey of Video-based Action Quality Assessment
Shunli Wang
Dingkang Yang
Peng Zhai
Qing Yu
Tao Suo
Zhan Sun
Ka Li
Lihua Zhang
13
16
0
20 Apr 2022
Frequency Selective Augmentation for Video Representation Learning
Jinhyung Kim
Taeoh Kim
Minho Shim
Dongyoon Han
Dongyoon Wee
Junmo Kim
AI4TS
17
3
0
08 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
18
39
0
06 Apr 2022
MultiMAE: Multi-modal Multi-task Masked Autoencoders
Roman Bachmann
David Mizrahi
Andrei Atanov
Amir Zamir
14
262
0
04 Apr 2022
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
S. Gorti
Noël Vouitsis
Junwei Ma
Keyvan Golestan
M. Volkovs
Animesh Garg
Guangwei Yu
12
148
0
28 Mar 2022
Single-Stream Multi-Level Alignment for Vision-Language Pretraining
Zaid Khan
B. Vijaykumar
Xiang Yu
S. Schulter
Manmohan Chandraker
Y. Fu
CLIP
VLM
14
16
0
27 Mar 2022
Versatile Multi-Modal Pre-Training for Human-Centric Perception
Fangzhou Hong
Liang Pan
Zhongang Cai
Ziwei Liu
VLM
17
24
0
25 Mar 2022
VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training
Zhan Tong
Yibing Song
Jue Wang
Limin Wang
ViT
20
768
0
23 Mar 2022
Look for the Change: Learning Object States and State-Modifying Actions from Untrimmed Web Videos
Tomávs Souvcek
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
8
32
0
22 Mar 2022
Drive&Segment: Unsupervised Semantic Segmentation of Urban Scenes via Cross-modal Distillation
Antonín Vobecký
David Hurych
Oriane Siméoni
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
3DPC
17
20
0
21 Mar 2022
A Study on Robustness to Perturbations for Representations of Environmental Sound
Sangeeta Srivastava
Ho-Hsiang Wu
Joao Rulff
Magdalena Fuentes
M. Cartwright
Claudio Silva
Anish Arora
J. P. Bello
15
3
0
20 Mar 2022
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Saghir Alfasly
Jian Lu
C. Xu
Yuru Zou
16
18
0
06 Mar 2022
COMPASS: Contrastive Multimodal Pretraining for Autonomous Systems
Shuang Ma
Sai H. Vemprala
Wenshan Wang
Jayesh K. Gupta
Yale Song
Daniel J. McDuff
Ashish Kapoor
SSL
8
9
0
20 Feb 2022
Misinformation Detection in Social Media Video Posts
Kehan Wang
David M. Chan
Seth Z. Zhao
John F. Canny
A. Zakhor
12
7
0
15 Feb 2022
data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language
Alexei Baevski
Wei-Ning Hsu
Qiantong Xu
Arun Babu
Jiatao Gu
Michael Auli
SSL
VLM
ViT
19
823
0
07 Feb 2022
Keyword localisation in untranscribed speech using visually grounded speech models
Kayode Olaleye
Dan Oneaţă
Herman Kamper
11
7
0
02 Feb 2022
From data to functa: Your data point is a function and you can treat it like one
Emilien Dupont
Hyunjik Kim
S. M. Ali Eslami
Danilo Jimenez Rezende
Dan Rosenbaum
TDI
3DPC
157
136
0
28 Jan 2022
Learning To Recognize Procedural Activities with Distant Supervision
Xudong Lin
Fabio Petroni
Gedas Bertasius
Marcus Rohrbach
Shih-Fu Chang
Lorenzo Torresani
14
82
0
26 Jan 2022
Omnivore: A Single Model for Many Visual Modalities
Rohit Girdhar
Mannat Singh
Nikhil Ravi
L. V. D. van der Maaten
Armand Joulin
Ishan Misra
209
222
0
20 Jan 2022
TriCoLo: Trimodal Contrastive Loss for Text to Shape Retrieval
Yue Ruan
Han-Hung Lee
Yiming Zhang
Ke Zhang
Angel X. Chang
10
12
0
19 Jan 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
20
101
0
16 Jan 2022
Tailor Versatile Multi-modal Learning for Multi-label Emotion Recognition
Yi Zhang
Mingyuan Chen
Jundong Shen
Chongjun Wang
10
58
0
15 Jan 2022
Bridging Video-text Retrieval with Multiple Choice Questions
Yuying Ge
Yixiao Ge
Xihui Liu
Dian Li
Ying Shan
Xiaohu Qie
Ping Luo
BDL
8
108
0
13 Jan 2022
Multi-Query Video Retrieval
Zeyu Wang
Yu Wu
Karthik Narasimhan
Olga Russakovsky
20
17
0
10 Jan 2022
MERLOT Reserve: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers
Jiasen Lu
Ximing Lu
Youngjae Yu
Yanpeng Zhao
Mohammadreza Salehi
Aditya Kusupati
Jack Hessel
Ali Farhadi
Yejin Choi
9
175
0
07 Jan 2022
Progressive Video Summarization via Multimodal Self-supervised Learning
Haopeng Li
Qiuhong Ke
Mingming Gong
Tom Drummond
AI4TS
23
17
0
07 Jan 2022
Fine-grained Multi-Modal Self-Supervised Learning
Duo Wang
S. Karout
SSL
17
7
0
22 Dec 2021
Class-aware Sounding Objects Localization via Audiovisual Correspondence
Di Hu
Yake Wei
Rui Qian
Weiyao Lin
Ruihua Song
Ji-Rong Wen
13
41
0
22 Dec 2021
Connecting the Dots between Audio and Text without Parallel Data through Visual Knowledge Transfer
Yanpeng Zhao
Jack Hessel
Youngjae Yu
Ximing Lu
Rowan Zellers
Yejin Choi
9
27
0
16 Dec 2021
Multimodal neural networks better explain multivoxel patterns in the hippocampus
Bhavin Choksi
Milad Mozafari
Rufin VanRullen
Leila Reddy
11
12
0
11 Dec 2021
Contextualized Spatio-Temporal Contrastive Learning with Self-Supervision
Liangzhe Yuan
Rui Qian
Yin Cui
Boqing Gong
Florian Schroff
Ming-Hsuan Yang
Hartwig Adam
Ting Liu
AI4TS
14
15
0
09 Dec 2021
Exploring Temporal Granularity in Self-Supervised Video Representation Learning
Rui Qian
Yeqing Li
Liangzhe Yuan
Boqing Gong
Ting Liu
Matthew A. Brown
Serge J. Belongie
Ming-Hsuan Yang
Hartwig Adam
Yin Cui
AI4TS
33
6
0
08 Dec 2021
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
15
127
0
08 Dec 2021
Creating Multimodal Interactive Agents with Imitation and Self-Supervised Learning
DeepMind Interactive Agents Team Josh Abramson
Josh Abramson
Arun Ahuja
Arthur Brussee
Federico Carnevale
...
Tamara von Glehn
Greg Wayne
Nathaniel Wong
Chen Yan
Rui Zhu
LM&Ro
24
46
0
07 Dec 2021
Routing with Self-Attention for Multimodal Capsule Networks
Kevin Duarte
Brian Chen
Nina Shvetsova
Andrew Rouditchenko
Samuel Thomas
Alexander H. Liu
David F. Harwath
James R. Glass
Hilde Kuehne
M. Shah
SSL
21
5
0
01 Dec 2021
Sound-Guided Semantic Image Manipulation
Seung Hyun Lee
Wonseok Roh
Wonmin Byeon
Sang Ho Yoon
Chanyoung Kim
Jinkyu Kim
Sangpil Kim
DiffM
8
43
0
30 Nov 2021
ContIG: Self-supervised Multimodal Contrastive Learning for Medical Imaging with Genetics
Aiham Taleb
Matthias Kirchler
Remo Monti
C. Lippert
SSL
MedIm
20
54
0
26 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
23
73
0
25 Nov 2021
Previous
1
2
3
4
5
6
Next