Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2012.04124
Cited By
Parameter Efficient Multimodal Transformers for Video Representation Learning
8 December 2020
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Parameter Efficient Multimodal Transformers for Video Representation Learning"
18 / 18 papers shown
Title
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Akam Rahimi
Triantafyllos Afouras
Andrew Zisserman
40
28
0
02 Jan 2025
MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition
Peihao Xiang
Chaohao Lin
Kaida Wu
Ou Bai
32
3
0
28 Apr 2024
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
19
48
0
26 Jul 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
41
525
0
13 Jun 2022
VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation
Yuxing Chen
Renshu Gu
Ouhan Huang
Gangyong Jia
3DH
30
11
0
25 May 2022
Are Multimodal Transformers Robust to Missing Modality?
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
26
146
0
12 Apr 2022
Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness
Giulio Lovisotto
Nicole Finnie
Mauricio Muñoz
Chaithanya Kumar Mummadi
J. H. Metzen
AAML
ViT
17
32
0
25 Mar 2022
Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs
Jingfei Xia
Mingchen Zhuge
Tiantian Geng
Shun Fan
Yuantai Wei
Zhenyu He
Feng Zheng
21
14
0
08 Mar 2022
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Saghir Alfasly
Jian Lu
C. Xu
Yuru Zou
34
18
0
06 Mar 2022
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition
Zitian Zhang
Jie M. Zhang
Jian-Shu Zhang
Ming Wu
Xin Fang
Lirong Dai
SSL
22
10
0
15 Feb 2022
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
20
102
0
16 Jan 2022
SWAT: Spatial Structure Within and Among Tokens
Kumara Kahatapitiya
Michael S. Ryoo
23
6
0
26 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
33
73
0
25 Nov 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
21
8
0
26 Oct 2021
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
64
44
0
21 Sep 2021
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Jiawei Chen
C. Ho
ViT
24
76
0
20 Aug 2021
Attention Bottlenecks for Multimodal Fusion
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
25
539
0
30 Jun 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
250
927
0
24 Sep 2019
1