Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2112.04446
Cited By
Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval
8 December 2021
Nina Shvetsova
Brian Chen
Andrew Rouditchenko
Samuel Thomas
Brian Kingsbury
Rogerio Feris
David F. Harwath
James R. Glass
Hilde Kuehne
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Everything at Once -- Multi-modal Fusion Transformer for Video Retrieval"
15 / 15 papers shown
Title
HierSum: A Global and Local Attention Mechanism for Video Summarization
Apoorva Beedu
Irfan Essa
34
0
0
25 Apr 2025
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
27
5
0
28 Mar 2024
Write What You Want: Applying Text-to-video Retrieval to Audiovisual Archives
Yuchen Yang
VGen
11
7
0
09 Oct 2023
Day2Dark: Pseudo-Supervised Activity Recognition beyond Silent Daylight
Yunhua Zhang
Hazel Doughty
Cees G. M. Snoek
VLM
26
0
0
05 Dec 2022
C2KD: Cross-Lingual Cross-Modal Knowledge Distillation for Multilingual Text-Video Retrieval
Andrew Rouditchenko
Yung-Sung Chuang
Nina Shvetsova
Samuel Thomas
Rogerio Feris
Brian Kingsbury
Leonid Karlinsky
David F. Harwath
Hilde Kuehne
James R. Glass
VLM
18
4
0
07 Oct 2022
UAVM: Towards Unifying Audio and Visual Models
Yuan Gong
Alexander H. Liu
Andrew Rouditchenko
James R. Glass
17
20
0
29 Jul 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
41
518
0
13 Jun 2022
Contrastive language and vision learning of general fashion concepts
P. Chia
Giuseppe Attanasio
Federico Bianchi
Silvia Terragni
A. Magalhães
Diogo Gonçalves
C. Greco
Jacopo Tagliabue
CLIP
13
42
0
08 Apr 2022
ECLIPSE: Efficient Long-range Video Retrieval using Sight and Sound
Yan-Bo Lin
Jie Lei
Mohit Bansal
Gedas Bertasius
23
39
0
06 Apr 2022
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
T2VLAD: Global-Local Sequence Alignment for Text-Video Retrieval
Xiaohan Wang
Linchao Zhu
Yi Yang
138
166
0
20 Apr 2021
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Huaishao Luo
Lei Ji
Ming Zhong
Yang Chen
Wen Lei
Nan Duan
Tianrui Li
CLIP
VLM
303
771
0
18 Apr 2021
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
398
532
0
21 Jul 2020
Audiovisual SlowFast Networks for Video Recognition
Fanyi Xiao
Yong Jae Lee
Kristen Grauman
Jitendra Malik
Christoph Feichtenhofer
192
204
0
23 Jan 2020
Efficient Estimation of Word Representations in Vector Space
Tomáš Mikolov
Kai Chen
G. Corrado
J. Dean
3DV
228
29,632
0
16 Jan 2013
1