Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2209.14156
Cited By
TVLT: Textless Vision-Language Transformer
28 September 2022
Zineng Tang
Jaemin Cho
Yixin Nie
Mohit Bansal
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"TVLT: Textless Vision-Language Transformer"
23 / 23 papers shown
Title
Multimodal Sentiment Analysis based on Video and Audio Inputs
Antonio Fernandez
Suzan Awinat
56
0
0
12 Dec 2024
A Theoretical Survey on Foundation Models
Shi Fu
Yuzhu Chen
Yingjie Wang
Dacheng Tao
18
0
0
15 Oct 2024
Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo
Haofan Wang
Huaxia Li
Xu Tang
27
1
0
12 May 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
27
5
0
28 Mar 2024
Comment-aided Video-Language Alignment via Contrastive Pre-training for Short-form Video Humor Detection
Yang Liu
Tongfei Shen
Dong Zhang
Qingying Sun
Shoushan Li
Guodong Zhou
11
4
0
14 Feb 2024
CREMA: Generalizable and Efficient Video-Language Reasoning via Multimodal Modular Fusion
Shoubin Yu
Jaehong Yoon
Mohit Bansal
62
4
0
08 Feb 2024
Modality Plug-and-Play: Elastic Modality Adaptation in Multimodal LLMs for Embodied AI
Kai Huang
Boyuan Yang
Wei Gao
12
1
0
13 Dec 2023
BirdSAT: Cross-View Contrastive Masked Autoencoders for Bird Species Classification and Mapping
S. Sastry
Subash Khanal
A. Dhakal
Di Huang
Nathan Jacobs
36
9
0
29 Oct 2023
MOSEL: Inference Serving Using Dynamic Modality Selection
Bodun Hu
Le Xu
Jeongyoon Moon
N. Yadwadkar
Aditya Akella
11
3
0
27 Oct 2023
STELLA: Continual Audio-Video Pre-training with Spatio-Temporal Localized Alignment
Jaewoo Lee
Jaehong Yoon
Wonjae Kim
Yunji Kim
Sung Ju Hwang
CLL
14
1
0
12 Oct 2023
Latent Wander: an Alternative Interface for Interactive and Serendipitous Discovery of Large AV Archives
Yuchen Yang
Linyida Zhang
8
2
0
09 Oct 2023
VidChapters-7M: Video Chapters at Scale
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
11
26
0
25 Sep 2023
A Unified Transformer-based Network for multimodal Emotion Recognition
Kamran Ali
Charles E. Hughes
8
1
0
27 Aug 2023
Exploring the Role of Audio in Video Captioning
Yuhan Shen
Linjie Yang
Longyin Wen
Haichao Yu
Ehsan Elhamifar
Heng Wang
10
2
0
21 Jun 2023
Any-to-Any Generation via Composable Diffusion
Zineng Tang
Ziyi Yang
Chenguang Zhu
Michael Zeng
Mohit Bansal
VGen
DiffM
15
169
0
19 May 2023
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
Sihan Chen
Xingjian He
Longteng Guo
Xinxin Zhu
Weining Wang
Jinhui Tang
Jinhui Tang
VLM
21
99
0
17 Apr 2023
MAViL: Masked Audio-Video Learners
Po-Yao (Bernie) Huang
Vasu Sharma
Hu Xu
Chaitanya K. Ryali
Haoqi Fan
Yanghao Li
Shang-Wen Li
Gargi Ghosh
Jitendra Malik
Christoph Feichtenhofer
11
50
0
15 Dec 2022
CLIPPO: Image-and-Language Understanding from Pixels Only
Michael Tschannen
Basil Mustafa
N. Houlsby
CLIP
VLM
11
47
0
15 Dec 2022
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Mohit Bansal
Gedas Bertasius
13
69
0
15 Dec 2022
Masked Autoencoders Are Scalable Vision Learners
Kaiming He
Xinlei Chen
Saining Xie
Yanghao Li
Piotr Dollár
Ross B. Girshick
ViT
TPM
258
7,337
0
11 Nov 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
Generative Spoken Language Modeling from Raw Audio
Kushal Lakhotia
Evgeny Kharitonov
Wei-Ning Hsu
Yossi Adi
Adam Polyak
...
Tu Nguyen
Jade Copet
Alexei Baevski
A. Mohamed
Emmanuel Dupoux
AuLLM
174
336
0
01 Feb 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
250
922
0
24 Sep 2019
1