Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2012.04124
Cited By
v1
v2 (latest)
Parameter Efficient Multimodal Transformers for Video Representation Learning
8 December 2020
Sangho Lee
Youngjae Yu
Gunhee Kim
Thomas Breuel
Jan Kautz
Yale Song
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Parameter Efficient Multimodal Transformers for Video Representation Learning"
50 / 53 papers shown
LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA
Zeyi Kang
Liang He
Yanxin Zhang
Zuheng Ming
Kaixing Zhao
234
0
0
23 Sep 2025
Social-MAE: A Transformer-Based Multimodal Autoencoder for Face and Voice
IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2024
Hugo Bohy
M. Tran
Kevin El Haddad
Thierry Dutoit
M. Soleymani
176
2
0
24 Aug 2025
Learning Long-Range Action Representation by Two-Stream Mamba Pyramid Network for Figure Skating Assessment
Fengshun Wang
Qiurui Wang
Peilin Zhao
153
1
0
22 Aug 2025
Reading to Listen at the Cocktail Party: Multi-Modal Speech Separation
Computer Vision and Pattern Recognition (CVPR), 2022
Akam Rahimi
Triantafyllos Afouras
Andrew Zisserman
412
34
0
02 Jan 2025
Human Action Recognition (HAR) Using Skeleton-based Spatial Temporal Relative Transformer Network: ST-RTR
Faisal Mehmood
Enqing Chen
Touqeer Abbas
Samah M. Alzanin
319
1
0
31 Oct 2024
SAVE: Segment Audio-Visual Easy way using Segment Anything Model
Khanh-Binh Nguyen
Chae Jung Park
VLM
VOS
432
4
0
02 Jul 2024
From CNNs to Transformers in Multimodal Human Action Recognition: A Survey
Muhammad Bilal Shaikh
Syed Mohammed Shamsul Islam
Douglas Chai
Naveed Akhtar
418
36
0
22 May 2024
MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition
Peihao Xiang
Chaohao Lin
Kaida Wu
Ou Bai
250
8
0
28 Apr 2024
Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction
Computer Vision and Pattern Recognition (CVPR), 2024
Jianping Jiang
Xinyu Zhou
Bingxuan Wang
Xiaoming Deng
Chao Xu
Boxin Shi
358
15
0
12 Mar 2024
HiCMAE: Hierarchical Contrastive Masked Autoencoder for Self-Supervised Audio-Visual Emotion Recognition
Information Fusion (Inf. Fusion), 2024
Guoying Zhao
Zheng Lian
Yinan Han
Jianhua Tao
331
77
0
11 Jan 2024
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation
Computer Vision and Pattern Recognition (CVPR), 2023
Yangyang Guo
Guangzhi Wang
Mohan S. Kankanhalli
258
12
0
16 Oct 2023
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
International Conference on Machine Learning (ICML), 2023
Jaewoo Lee
Jaehong Yoon
Wonjae Kim
Yunji Kim
Sung Ju Hwang
CLL
360
2
0
12 Oct 2023
Speed Co-Augmentation for Unsupervised Audio-Visual Pre-training
Jiangliu Wang
Jianbo Jiao
Yibing Song
Stephen James
Zhan Tong
Chongjian Ge
Pieter Abbeel
Yunhui Liu
160
0
0
25 Sep 2023
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2023
Yuan Tseng
Layne Berry
Yi-Ting Chen
I-Hsiang Chiu
Hsuan-Hao Lin
...
Yu Tsao
Shinji Watanabe
Abdel-rahman Mohamed
Chi-Luen Feng
Hung-yi Lee
VLM
SSL
364
23
0
19 Sep 2023
Compressing Vision Transformers for Low-Resource Visual Learning
Eric Youn
J. SaiMitheran
Sanjana Prabhu
Siyuan Chen
ViT
218
5
0
05 Sep 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
IEEE International Conference on Computer Vision (ICCV), 2023
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
415
144
0
11 Jul 2023
Factorized Contrastive Learning: Going Beyond Multi-view Redundancy
Neural Information Processing Systems (NeurIPS), 2023
Paul Pu Liang
Zihao Deng
Martin Q. Ma
James Zou
Louis-Philippe Morency
Ruslan Salakhutdinov
SSL
342
99
0
08 Jun 2023
Object Detection with Transformers: A Review
Italian National Conference on Sensors (INS), 2023
Tahira Shehzadi
K. Hashmi
D. Stricker
Muhammad Zeshan Afzal
ViT
MU
467
70
0
07 Jun 2023
Annotation-free Audio-Visual Segmentation
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Jinxian Liu
Yu Wang
Chen Ju
Chaofan Ma
Ya Zhang
Weidi Xie
VOS
VLM
455
52
0
18 May 2023
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Muhammad Usama
Junaid Qadir
501
76
0
21 Mar 2023
Deep Visual Forced Alignment: Learning to Align Transcription with Talking Face Video
AAAI Conference on Artificial Intelligence (AAAI), 2023
Minsu Kim
Chae Won Kim
Y. Ro
CVBM
DiffM
200
4
0
27 Feb 2023
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Computer Vision and Pattern Recognition (CVPR), 2022
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Joey Tianyi Zhou
Gedas Bertasius
420
116
0
15 Dec 2022
Multimodal Transformer for Parallel Concatenated Variational Autoencoders
Stephen D. Liang
J. Mendel
ViT
295
6
0
28 Oct 2022
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
European Conference on Computer Vision (ECCV), 2022
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
281
57
0
26 Jul 2022
Multimodal Learning with Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Peng Xu
Xiatian Zhu
David Clifton
ViT
648
934
0
13 Jun 2022
VTP: Volumetric Transformer for Multi-view Multi-person 3D Pose Estimation
Yuxing Chen
Renshu Gu
Ouhan Huang
Gangyong Jia
3DH
256
13
0
25 May 2022
Are Multimodal Transformers Robust to Missing Modality?
Computer Vision and Pattern Recognition (CVPR), 2022
Mengmeng Ma
Jian Ren
Long Zhao
Davide Testuggine
Xi Peng
ViT
345
233
0
12 Apr 2022
Give Me Your Attention: Dot-Product Attention Considered Harmful for Adversarial Patch Robustness
Computer Vision and Pattern Recognition (CVPR), 2022
Giulio Lovisotto
Nicole Finnie
Mauricio Muñoz
Chaithanya Kumar Mummadi
J. H. Metzen
AAML
ViT
197
51
0
25 Mar 2022
Skating-Mixer: Long-Term Sport Audio-Visual Modeling with MLPs
AAAI Conference on Artificial Intelligence (AAAI), 2022
Jingfei Xia
Mingchen Zhuge
Tiantian Geng
Shun Fan
Yuantai Wei
Zhenyu He
Feng Zheng
470
34
0
08 Mar 2022
Learnable Irrelevant Modality Dropout for Multimodal Action Recognition on Modality-Specific Annotated Videos
Computer Vision and Pattern Recognition (CVPR), 2022
Saghir Alfasly
Jian Lu
C. Xu
Yuru Zou
383
29
0
06 Mar 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
291
47
0
02 Mar 2022
Learning Contextually Fused Audio-visual Representations for Audio-visual Speech Recognition
International Conference on Information Photonics (ICIP), 2022
Zitian Zhang
Jie Zhang
Jian-Shu Zhang
Ming Wu
Xin Fang
Lirong Dai
SSL
314
12
0
15 Feb 2022
ACORT: A Compact Object Relation Transformer for Parameter Efficient Image Captioning
Neurocomputing (Neurocomputing), 2022
J. Tan
Y. Tan
C. Chan
Joon Huang Chuah
VLM
ViT
258
23
0
11 Feb 2022
A Pre-trained Audio-Visual Transformer for Emotion Recognition
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2022
Minh Tran
M. Soleymani
194
38
0
23 Jan 2022
Video Transformers: A Survey
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
550
152
0
16 Jan 2022
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
International Conference on Learning Representations (ICLR), 2022
Bowen Shi
Wei-Ning Hsu
Kushal Lakhotia
Abdel-rahman Mohamed
SSL
438
441
0
05 Jan 2022
Audio-Visual Synchronisation in the wild
Honglie Chen
Weidi Xie
Triantafyllos Afouras
Arsha Nagrani
Andrea Vedaldi
Andrew Zisserman
257
51
0
08 Dec 2021
SWAT: Spatial Structure Within and Among Tokens
International Joint Conference on Artificial Intelligence (IJCAI), 2021
Kumara Kahatapitiya
Michael S. Ryoo
296
9
0
26 Nov 2021
PolyViT: Co-training Vision Transformers on Images, Videos and Audio
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
ViT
217
85
0
25 Nov 2021
With a Little Help from my Temporal Context: Multimodal Egocentric Action Recognition
British Machine Vision Conference (BMVC), 2021
Evangelos Kazakos
Jaesung Huh
Arsha Nagrani
Andrew Zisserman
Dima Damen
EgoV
328
56
0
01 Nov 2021
TriBERT: Full-body Human-centric Audio-visual Representation Learning for Visual Sound Separation
Tanzila Rahman
Mengyu Yang
Leonid Sigal
ViT
159
8
0
26 Oct 2021
Pairwise Emotional Relationship Recognition in Drama Videos: Dataset and Benchmark
ACM Multimedia (ACM MM), 2021
Xun Gao
Yin Zhao
Jie Zhang
Longjun Cai
169
9
0
23 Sep 2021
Survey: Transformer based Video-Language Pre-training
Ludan Ruan
Qin Jin
VLM
ViT
293
51
0
21 Sep 2021
Dyadformer: A Multi-modal Transformer for Long-Range Modeling of Dyadic Interactions
D. Curto
Albert Clapés
Javier Selva
Sorina Smeureanu
Julio C. S. Jacques Junior
...
G. Guilera
D. Leiva
T. Moeslund
Sergio Escalera
Cristina Palmero
180
40
0
20 Sep 2021
Multilingual Molecular Representation Learning via Contrastive Pre-training
Zhihui Guo
P. Sharma
Andy Martinez
Liang Du
Robin Abraham
291
37
0
18 Sep 2021
MM-ViT: Multi-Modal Video Transformer for Compressed Video Action Recognition
Jiawei Chen
C. Ho
ViT
306
112
0
20 Aug 2021
Attention Bottlenecks for Multimodal Fusion
Neural Information Processing Systems (NeurIPS), 2021
Arsha Nagrani
Shan Yang
Anurag Arnab
A. Jansen
Cordelia Schmid
Chen Sun
665
742
0
30 Jun 2021
Keeping Your Eye on the Ball: Trajectory Attention in Video Transformers
Neural Information Processing Systems (NeurIPS), 2021
Mandela Patrick
Dylan Campbell
Yuki M. Asano
Ishan Misra
Ishan Misra Florian Metze
Christoph Feichtenhofer
Andrea Vedaldi
João F. Henriques
367
347
0
09 Jun 2021
Attention mechanisms and deep learning for machine vision: A survey of the state of the art
A. M. Hafiz
S. A. Parah
R. A. Bhat
258
59
0
03 Jun 2021
Space-Time Crop & Attend: Improving Cross-modal Video Representation Learning
IEEE International Conference on Computer Vision (ICCV), 2021
Mandela Patrick
Yuki M. Asano
Bernie Huang
Ishan Misra
Florian Metze
Joao Henriques
Andrea Vedaldi
AI4TS
320
36
0
18 Mar 2021
1
2
Next
Page 1 of 2