Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.11178
Cited By
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
22 April 2021
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text"
8 / 8 papers shown
Title
Uncertainty-Weighted Image-Event Multimodal Fusion for Video Anomaly Detection
SungHeon Jeong
Jihong Park
Mohsen Imani
24
0
0
05 May 2025
Learning Streaming Video Representation via Multitask Training
Yibin Yan
Jilan Xu
Shangzhe Di
Yikun Liu
Yudi Shi
Qirui Chen
Zeqian Li
Yifei Huang
Weidi Xie
CLL
64
123
0
28 Apr 2025
Lexicon3D: Probing Visual Foundation Models for Complex 3D Scene Understanding
Yunze Man
Shuhong Zheng
Zhipeng Bao
M. Hebert
Liang-Yan Gui
Yu-xiong Wang
28
2
0
05 Sep 2024
VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling
Zeyue Tian
Zhaoyang Liu
Ruibin Yuan
Jiahao Pan
Xiaoqiang Huang
Xu Tan
Xu Tan
Qifeng Chen
Y. Guo
VGen
72
2
0
06 Jun 2024
Transformadores: Fundamentos teoricos y Aplicaciones
J. D. L. Torre
26
0
0
18 Feb 2023
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
267
1,486
0
09 Feb 2021
Graph-Based Global Reasoning Networks
Yunpeng Chen
Marcus Rohrbach
Zhicheng Yan
Shuicheng Yan
Jiashi Feng
Yannis Kalantidis
GNN
NAI
239
432
0
30 Nov 2018
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
218
29,632
0
16 Jan 2013
1