ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2006.16228
  4. Cited By
Self-Supervised MultiModal Versatile Networks

Self-Supervised MultiModal Versatile Networks

29 June 2020
Jean-Baptiste Alayrac
Adrià Recasens
R. Schneider
Relja Arandjelović
Jason Ramapuram
J. Fauw
Lucas Smaira
Sander Dieleman
Andrew Zisserman
    SSL
ArXivPDFHTML

Papers citing "Self-Supervised MultiModal Versatile Networks"

50 / 266 papers shown
Title
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Audio-Visual Contrastive Learning with Temporal Self-Supervision
Simon Jenni
Alexander Black
John Collomosse
SSL
10
11
0
15 Feb 2023
Zorro: the masked multimodal transformer
Zorro: the masked multimodal transformer
Adrià Recasens
Jason Lin
João Carreira
Drew Jaegle
Luyu Wang
...
Pauline Luc
Antoine Miech
Lucas Smaira
Ross Hemsley
Andrew Zisserman
24
20
0
23 Jan 2023
A Survey on Self-supervised Learning: Algorithms, Applications, and
  Future Trends
A Survey on Self-supervised Learning: Algorithms, Applications, and Future Trends
Jie Gui
Tuo Chen
Jing Zhang
Qiong Cao
Zhe Sun
Haoran Luo
Dacheng Tao
6
117
0
13 Jan 2023
What You Say Is What You Show: Visual Narration Detection in
  Instructional Videos
What You Say Is What You Show: Visual Narration Detection in Instructional Videos
Kumar Ashutosh
Rohit Girdhar
Lorenzo Torresani
Kristen Grauman
6
4
0
05 Jan 2023
Test of Time: Instilling Video-Language Models with a Sense of Time
Test of Time: Instilling Video-Language Models with a Sense of Time
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
62
36
0
05 Jan 2023
Audiovisual Masked Autoencoders
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu
Eduardo Fonseca
Radu Tudor Ionescu
Mario Lucic
Cordelia Schmid
Anurag Arnab
SSL
14
43
0
09 Dec 2022
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene
  Segmentation
Tencent AVS: A Holistic Ads Video Dataset for Multi-modal Scene Segmentation
Jie Jiang
Zhimin Li
Jiangfeng Xiong
Rongwei Quan
Qinglin Lu
Wei Liu
6
2
0
09 Dec 2022
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion
  Priors
Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors
Zhentao Yu
Zixin Yin
Deyu Zhou
Duomin Wang
Finn Wong
Baoyuan Wang
DiffM
17
35
0
07 Dec 2022
FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video
  Deepfake Detection
FakeOut: Leveraging Out-of-domain Self-supervision for Multi-modal Video Deepfake Detection
Gil Knafo
Ohad Fried
11
1
0
01 Dec 2022
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video
  Representation Learning
XKD: Cross-modal Knowledge Distillation with Domain Alignment for Video Representation Learning
Pritam Sarkar
Ali Etemad
4
20
0
25 Nov 2022
Towards Good Practices for Missing Modality Robust Action Recognition
Towards Good Practices for Missing Modality Robust Action Recognition
Sangmin Woo
Sumin Lee
Yeonju Park
Muhammad Adi Nugroho
Changick Kim
19
42
0
25 Nov 2022
Multi-Task Learning of Object State Changes from Uncurated Videos
Multi-Task Learning of Object State Changes from Uncurated Videos
Tomávs Souvcek
Jean-Baptiste Alayrac
Antoine Miech
Ivan Laptev
Josef Sivic
21
11
0
24 Nov 2022
LISA: Localized Image Stylization with Audio via Implicit Neural
  Representation
LISA: Localized Image Stylization with Audio via Implicit Neural Representation
Seung Hyun Lee
Chanyoung Kim
Wonmin Byeon
Sang Ho Yoon
Jinkyu Kim
Sangpil Kim
22
3
0
21 Nov 2022
Video Background Music Generation: Dataset, Method and Evaluation
Video Background Music Generation: Dataset, Method and Evaluation
Le Zhuo
Zhaokai Wang
Baisen Wang
Yue Liao
Chenxi Bao
Stanley Peng
Miao Lu
Xiaobo Li
Fei Fang
Si Liu
VGen
6
27
0
21 Nov 2022
Versatile Diffusion: Text, Images and Variations All in One Diffusion
  Model
Versatile Diffusion: Text, Images and Variations All in One Diffusion Model
Xingqian Xu
Zhangyang Wang
Eric Zhang
Kai Wang
Humphrey Shi
DiffM
14
124
0
15 Nov 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient
  Harmonization
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zhangyang Wang
Cong Yu
18
5
0
03 Nov 2022
On the Role of Visual Context in Enriching Music Representations
On the Role of Visual Context in Enriching Music Representations
Kleanthis Avramidis
Shanti Stewart
Shrikanth Narayanan
10
3
0
28 Oct 2022
ULN: Towards Underspecified Vision-and-Language Navigation
ULN: Towards Underspecified Vision-and-Language Navigation
Weixi Feng
Tsu-jui Fu
Yujie Lu
William Yang Wang
18
4
0
18 Oct 2022
A Human-ML Collaboration Framework for Improving Video Content Reviews
A Human-ML Collaboration Framework for Improving Video Content Reviews
Meghana Deodhar
Xiao Ma
Yixin Cai
Alex Koes
Alex Beutel
Jilin Chen
15
3
0
18 Oct 2022
Spatiotemporal Classification with limited labels using Constrained
  Clustering for large datasets
Spatiotemporal Classification with limited labels using Constrained Clustering for large datasets
Praveen Ravirathinam
Rahul Ghosh
Ke Wang
Keyang Xuan
A. Khandelwal
H. Dugan
Paul C. Hanson
Vipin Kumar
15
1
0
14 Oct 2022
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
ConTra: (Con)text (Tra)nsformer for Cross-Modal Video Retrieval
A. Fragomeni
Michael Wray
Dima Damen
CLIP
ViT
17
1
0
09 Oct 2022
Self-supervised Video Representation Learning with Motion-Aware Masked
  Autoencoders
Self-supervised Video Representation Learning with Motion-Aware Masked Autoencoders
Haosen Yang
Deng Huang
Bin Wen
Jiannan Wu
H. Yao
Yi-Xin Jiang
Xiatian Zhu
Zehuan Yuan
16
19
0
09 Oct 2022
Mind Reader: Reconstructing complex images from brain activities
Mind Reader: Reconstructing complex images from brain activities
Sikun Lin
Thomas C. Sprague
Ambuj K. Singh
DiffM
105
86
0
30 Sep 2022
Learning State-Aware Visual Representations from Audible Interactions
Learning State-Aware Visual Representations from Audible Interactions
Himangi Mittal
Pedro Morgado
Unnat Jain
Abhinav Gupta
55
20
0
27 Sep 2022
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
Junke Wang
Dongdong Chen
Zuxuan Wu
Chong Luo
Luowei Zhou
Yucheng Zhao
Yujia Xie
Ce Liu
Yu-Gang Jiang
Lu Yuan
MLLM
VLM
22
148
0
15 Sep 2022
Distribution Aware Metrics for Conditional Natural Language Generation
Distribution Aware Metrics for Conditional Natural Language Generation
David M. Chan
Yiming Ni
David A. Ross
Sudheendra Vijayanarasimhan
Austin Myers
John F. Canny
29
4
0
15 Sep 2022
Self-supervised multimodal neuroimaging yields predictive
  representations for a spectrum of Alzheimer's phenotypes
Self-supervised multimodal neuroimaging yields predictive representations for a spectrum of Alzheimer's phenotypes
A. Fedorov
Eloy P. T. Geenjaar
Lei Wu
Tristan Sylvain
T. DeRamus
Margaux Luck
Maria B. Misiura
R. Devon Hjelm
Sergey Plis
Vince D. Calhoun
14
2
0
07 Sep 2022
Robust Sound-Guided Image Manipulation
Robust Sound-Guided Image Manipulation
Seung Hyun Lee
Gyeongrok Oh
Wonmin Byeon
Sang Ho Yoon
Jinkyu Kim
Sangpil Kim
DiffM
13
7
0
30 Aug 2022
Contrastive Audio-Language Learning for Music
Contrastive Audio-Language Learning for Music
Ilaria Manco
Emmanouil Benetos
Elio Quinton
Gyorgy Fazekas
12
44
0
25 Aug 2022
Bidirectional Contrastive Split Learning for Visual Question Answering
Bidirectional Contrastive Split Learning for Visual Question Answering
Yuwei Sun
H. Ochiai
11
2
0
24 Aug 2022
Modality Mixer for Multi-modal Action Recognition
Modality Mixer for Multi-modal Action Recognition
Sumin Lee
Sangmin Woo
Yeonju Park
Muhammad Adi Nugroho
Changick Kim
9
10
0
24 Aug 2022
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Semi-Supervised and Unsupervised Deep Visual Learning: A Survey
Yanbei Chen
Massimiliano Mancini
Xiatian Zhu
Zeynep Akata
22
112
0
24 Aug 2022
CrossA11y: Identifying Video Accessibility Issues via Cross-modal
  Grounding
CrossA11y: Identifying Video Accessibility Issues via Cross-modal Grounding
Xingyu Bruce Liu
Ruolin Wang
Dingzeyu Li
Xiang Ánthony' Chen
Amy Pavel
13
17
0
23 Aug 2022
ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain
  Generalization
ModSelect: Automatic Modality Selection for Synthetic-to-Real Domain Generalization
Zdravko Marinov
Alina Roitberg
David Schneider
Rainer Stiefelhagen
9
4
0
19 Aug 2022
How does the degree of novelty impacts semi-supervised representation
  learning for novel class retrieval?
How does the degree of novelty impacts semi-supervised representation learning for novel class retrieval?
Q. Leroy
Olivier Buisson
Alexis Joly
SSL
14
0
0
17 Aug 2022
GPPF: A General Perception Pre-training Framework via Sparsely Activated
  Multi-Task Learning
GPPF: A General Perception Pre-training Framework via Sparsely Activated Multi-Task Learning
Benyuan Sun
Jinqiao Dai
Zihao Liang
Cong Liu
Yi Yang
Bo Bai
MoE
16
4
0
03 Aug 2022
COCOA: Cross Modality Contrastive Learning for Sensor Data
COCOA: Cross Modality Contrastive Learning for Sensor Data
Shohreh Deldari
Hao Xue
Aaqib Saeed
Daniel V. Smith
Flora D. Salim
SSL
26
38
0
31 Jul 2022
GOCA: Guided Online Cluster Assignment for Self-Supervised Video
  Representation Learning
GOCA: Guided Online Cluster Assignment for Self-Supervised Video Representation Learning
Huseyin Coskun
Alireza Zareian
Joshua L. Moore
F. Tombari
Chen Wang
SSL
32
3
0
20 Jul 2022
Is an Object-Centric Video Representation Beneficial for Transfer?
Is an Object-Centric Video Representation Beneficial for Transfer?
Chuhan Zhang
Ankush Gupta
Andrew Zisserman
ViT
13
26
0
20 Jul 2022
LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
LAVA: Language Audio Vision Alignment for Contrastive Video Pre-Training
Sumanth Gurram
An Fang
David M. Chan
John F. Canny
VLM
AI4TS
17
1
0
16 Jul 2022
SVGraph: Learning Semantic Graphs from Instructional Videos
SVGraph: Learning Semantic Graphs from Instructional Videos
Madeline Chantry Schiappa
Y. S. Rawat
6
4
0
16 Jul 2022
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
Visually-aware Acoustic Event Detection using Heterogeneous Graphs
A. Shirian
Krishna Somandepalli
Victor Sanchez
T. Guha
11
3
0
16 Jul 2022
Learning Music-Dance Representations through Explicit-Implicit Rhythm
  Synchronization
Learning Music-Dance Representations through Explicit-Implicit Rhythm Synchronization
Jiashuo Yu
Junfu Pu
Ying Cheng
Rui Feng
Ying Shan
9
4
0
07 Jul 2022
A survey of multimodal deep generative models
A survey of multimodal deep generative models
Masahiro Suzuki
Y. Matsuo
SyDa
DRL
43
75
0
05 Jul 2022
Multimodal Frame-Scoring Transformer for Video Summarization
Multimodal Frame-Scoring Transformer for Video Summarization
Jeiyoon Park
Kiho Kwoun
Chanhee Lee
Heuiseok Lim
ViT
17
5
0
05 Jul 2022
Semantic Role Aware Correlation Transformer for Text to Video Retrieval
Semantic Role Aware Correlation Transformer for Text to Video Retrieval
Burak Satar
Hongyuan Zhu
Xavier Bresson
J. Lim
ViT
4
8
0
26 Jun 2022
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video
  Retrieval
RoME: Role-aware Mixture-of-Expert Transformer for Text-to-Video Retrieval
Burak Satar
Hongyuan Zhu
Hanwang Zhang
J. Lim
18
10
0
26 Jun 2022
Self-Supervised Learning for Videos: A Survey
Self-Supervised Learning for Videos: A Survey
Madeline Chantry Schiappa
Y. S. Rawat
M. Shah
SSL
11
130
0
18 Jun 2022
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale
  Knowledge
MineDojo: Building Open-Ended Embodied Agents with Internet-Scale Knowledge
Linxi Fan
Guanzhi Wang
Yunfan Jiang
Ajay Mandlekar
Yuncong Yang
Haoyi Zhu
Andrew Tang
De-An Huang
Yuke Zhu
Anima Anandkumar
LM&Ro
12
343
0
17 Jun 2022
It's Time for Artistic Correspondence in Music and Video
It's Time for Artistic Correspondence in Music and Video
Dídac Surís
Carl Vondrick
Bryan C. Russell
Justin Salamon
9
37
0
14 Jun 2022
Previous
123456
Next