ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08071
  4. Cited By
MAViL: Masked Audio-Video Learners

MAViL: Masked Audio-Video Learners

15 December 2022
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
ArXivPDFHTML

Papers citing "MAViL: Masked Audio-Video Learners"

22 / 22 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
27
0
0
02 May 2025
DGFM: Full Body Dance Generation Driven by Music Foundation Models
DGFM: Full Body Dance Generation Driven by Music Foundation Models
Xinran Liu
Zhenhua Feng
Diptesh Kanojia
Wenwu Wang
DiffM
62
1
0
27 Feb 2025
The Sound of Water: Inferring Physical Properties from Pouring Liquids
Piyush Bagad
Makarand Tapaswi
Cees G. M. Snoek
Andrew Zisserman
37
0
0
18 Nov 2024
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image
  Captioner using Audiovisual Distribution Alignment
An Eye for an Ear: Zero-shot Audio Description Leveraging an Image Captioner using Audiovisual Distribution Alignment
Hugo Malard
Michel Olvera
Stéphane Lathuilière
S. Essid
VLM
25
0
0
08 Oct 2024
Masked Image Modeling: A Survey
Masked Image Modeling: A Survey
Vlad Hondru
Florinel-Alin Croitoru
Shervin Minaee
Radu Tudor Ionescu
N. Sebe
59
6
0
13 Aug 2024
Sequential Contrastive Audio-Visual Learning
Sequential Contrastive Audio-Visual Learning
Ioannis Tsiamas
Santiago Pascual
Chunghsin Yeh
Joan Serra
26
2
0
08 Jul 2024
MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion
  Recognition
MultiMAE-DER: Multimodal Masked Autoencoder for Dynamic Emotion Recognition
Peihao Xiang
Chaohao Lin
Kaida Wu
Ou Bai
22
3
0
28 Apr 2024
OmniSat: Self-Supervised Modality Fusion for Earth Observation
OmniSat: Self-Supervised Modality Fusion for Earth Observation
Guillaume Astruc
Nicolas Gonthier
Clement Mallet
Loic Landrieu
23
23
0
12 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
30
5
0
28 Mar 2024
OmniVec: Learning robust representations with cross modal sharing
OmniVec: Learning robust representations with cross modal sharing
Siddharth Srivastava
Gaurav Sharma
SSL
16
64
0
07 Nov 2023
Semantic Proximity Alignment: Towards Human Perception-consistent Audio
  Tagging by Aligning with Label Text Description
Semantic Proximity Alignment: Towards Human Perception-consistent Audio Tagging by Aligning with Label Text Description
Youbin Jeon
Yanzhen Ren
VLM
17
0
0
28 Sep 2023
AudioFormer: Audio Transformer learns audio feature representations from discrete acoustic codes
Zhaohui Li
Haitao Wang
Xinghua Jiang
24
1
0
14 Aug 2023
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
44
28
0
28 Sep 2022
Exploring Target Representations for Masked Autoencoders
Exploring Target Representations for Masked Autoencoders
Xingbin Liu
Jinghao Zhou
Tao Kong
Xianming Lin
Rongrong Ji
76
49
0
08 Sep 2022
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via
  Feature Distillation
Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation
Yixuan Wei
Han Hu
Zhenda Xie
Zheng-Wei Zhang
Yue Cao
Jianmin Bao
Dong Chen
B. Guo
CLIP
78
123
0
27 May 2022
Self-Supervised Speech Representation Learning: A Review
Self-Supervised Speech Representation Learning: A Review
Abdel-rahman Mohamed
Hung-yi Lee
Lasse Borgholt
Jakob Drachmann Havtorn
Joakim Edin
...
Shang-Wen Li
Karen Livescu
Lars Maaløe
Tara N. Sainath
Shinji Watanabe
SSL
AI4TS
124
339
0
21 May 2022
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound
  Classification and Detection
HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection
Ke Chen
Xingjian Du
Bilei Zhu
Zejun Ma
Taylor Berg-Kirkpatrick
Shlomo Dubnov
ViT
114
262
0
02 Feb 2022
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
258
7,337
0
11 Nov 2021
Emerging Properties in Self-Supervised Vision Transformers
Emerging Properties in Self-Supervised Vision Transformers
Mathilde Caron
Hugo Touvron
Ishan Misra
Hervé Jégou
Julien Mairal
Piotr Bojanowski
Armand Joulin
283
5,723
0
29 Apr 2021
Multimodal Self-Supervised Learning of General Audio Representations
Multimodal Self-Supervised Learning of General Audio Representations
Luyu Wang
Pauline Luc
Adrià Recasens
Jean-Baptiste Alayrac
Aaron van den Oord
SSL
70
41
0
26 Apr 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
253
4,735
0
24 Feb 2021
Audiovisual SlowFast Networks for Video Recognition
Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao
Yong Jae Lee
Kristen Grauman
Jitendra Malik
Christoph Feichtenhofer
192
204
0
23 Jan 2020
1