ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2403.19638
  4. Cited By
Siamese Vision Transformers are Scalable Audio-visual Learners

Siamese Vision Transformers are Scalable Audio-visual Learners

28 March 2024
Yan-Bo Lin
Gedas Bertasius
ArXivPDFHTML

Papers citing "Siamese Vision Transformers are Scalable Audio-visual Learners"

12 / 12 papers shown
Title
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
Edson Araujo
Andrew Rouditchenko
Yuan Gong
Saurabhchand Bhati
Samuel Thomas
Brian Kingsbury
Leonid Karlinsky
Rogerio Feris
James Glass
27
0
0
02 May 2025
Audio-visual Event Localization on Portrait Mode Short Videos
Audio-visual Event Localization on Portrait Mode Short Videos
Wuyang Liu
Yi Chai
Yongpeng Yan
Yanzhen Ren
16
0
0
09 Apr 2025
Adaptive Perception for Unified Visual Multi-modal Object Tracking
Xiantao Hu
Bineng Zhong
Qihua Liang
Zhiyi Mo
Liangtao Shi
Ying Tai
Jian Yang
33
1
0
10 Feb 2025
Images that Sound: Composing Images and Sounds on a Single Canvas
Images that Sound: Composing Images and Sounds on a Single Canvas
Ziyang Chen
Daniel Geng
Andrew Owens
DiffM
38
8
0
20 May 2024
TVLT: Textless Vision-Language Transformer
TVLT: Textless Vision-Language Transformer
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
35
28
0
28 Sep 2022
Exploring Target Representations for Masked Autoencoders
Exploring Target Representations for Masked Autoencoders
Xingbin Liu
Jinghao Zhou
Tao Kong
Xianming Lin
Rongrong Ji
67
49
0
08 Sep 2022
Omnivore: A Single Model for Many Visual Modalities
Omnivore: A Single Model for Many Visual Modalities
Rohit Girdhar
Mannat Singh
Nikhil Ravi
L. V. D. van der Maaten
Armand Joulin
Ishan Misra
209
222
0
20 Jan 2022
Masked Autoencoders Are Scalable Vision Learners
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
258
7,337
0
11 Nov 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Distilling Audio-Visual Knowledge by Compositional Contrastive Learning
Yanbei Chen
Yongqin Xian
A. Sophia Koepke
Ying Shan
Zeynep Akata
59
79
0
22 Apr 2021
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and
  Aggregation
PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation
Yuan Gong
Yu-An Chung
James R. Glass
VLM
97
120
0
02 Feb 2021
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual
  Video Representation Learning
ACAV100M: Automatic Curation of Large-Scale Datasets for Audio-Visual Video Representation Learning
Sangho Lee
Jiwan Chung
Youngjae Yu
Gunhee Kim
Thomas Breuel
Gal Chechik
Yale Song
69
45
0
26 Jan 2021
1