ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2111.12993
  4. Cited By
PolyViT: Co-training Vision Transformers on Images, Videos and Audio

PolyViT: Co-training Vision Transformers on Images, Videos and Audio

25 November 2021
Valerii Likhosherstov
Anurag Arnab
K. Choromanski
Mario Lucic
Yi Tay
Adrian Weller
Mostafa Dehghani
    ViT
ArXivPDFHTML

Papers citing "PolyViT: Co-training Vision Transformers on Images, Videos and Audio"

50 / 50 papers shown
Title
AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation
AMA-SAM: Adversarial Multi-Domain Alignment of Segment Anything Model for High-Fidelity Histology Nuclei Segmentation
Jiahe Qian
Yaoyu Fang
Jinkui Hao
Bo Zhou
MedIm
27
0
0
27 Mar 2025
Sensitive Image Classification by Vision Transformers
Sensitive Image Classification by Vision Transformers
Hanxian He
Campbell Wilson
Thanh Thi Nguyen
Janis Dalins
ViT
66
0
0
21 Dec 2024
UNQA: Unified No-Reference Quality Assessment for Audio, Image, Video,
  and Audio-Visual Content
UNQA: Unified No-Reference Quality Assessment for Audio, Image, Video, and Audio-Visual Content
Y. Cao
Xiongkuo Min
Yixuan Gao
Wei Sun
Weisi Lin
Guangtao Zhai
31
2
0
29 Jul 2024
Computer vision tasks for intelligent aerospace missions: An overview
Computer vision tasks for intelligent aerospace missions: An overview
Huilin Chen
Qiyu Sun
Fangfei Li
Yang Tang
27
0
0
09 Jul 2024
UDON: Universal Dynamic Online distillatioN for generic image
  representations
UDON: Universal Dynamic Online distillatioN for generic image representations
Nikolaos-Antonios Ypsilantis
Kaifeng Chen
André Araujo
Ondřej Chum
22
1
0
12 Jun 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
27
5
0
28 Mar 2024
A Versatile Framework for Multi-scene Person Re-identification
A Versatile Framework for Multi-scene Person Re-identification
Wei-Shi Zheng
Junkai Yan
Yi-Xing Peng
VLM
27
5
0
17 Mar 2024
Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to
  Vision Encoders with Multimodal Loss
Zoom-shot: Fast and Efficient Unsupervised Zero-Shot Transfer of CLIP to Vision Encoders with Multimodal Loss
Jordan Shipard
Arnold Wiliem
Kien Nguyen Thanh
Wei Xiang
Clinton Fookes
VLM
CLIP
38
2
0
22 Jan 2024
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
G. Loaiza-Ganem
M. Volkovs
29
1
0
15 Dec 2023
TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance
  Segmentation
TMT-VIS: Taxonomy-aware Multi-dataset Joint Training for Video Instance Segmentation
Rongkun Zheng
Lu Qi
Xi Chen
Yi Wang
Kun Wang
Yu Qiao
Hengshuang Zhao
14
2
0
11 Dec 2023
Audio-Visual LLM for Video Understanding
Audio-Visual LLM for Video Understanding
Fangxun Shu
Lei Zhang
Hao Jiang
Cihang Xie
VLM
MLLM
17
36
0
11 Dec 2023
ViT-Lens: Towards Omni-modal Representations
ViT-Lens: Towards Omni-modal Representations
Weixian Lei
Yixiao Ge
Kun Yi
Jianfeng Zhang
Difei Gao
Dylan Sun
Yuying Ge
Ying Shan
Mike Zheng Shou
18
18
0
27 Nov 2023
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous
  Driving
Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving
Thomas E. Huang
Yifan Liu
Luc Van Gool
Fisher Yu
11
5
0
08 Sep 2023
Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge
  for Generic Image Representations
Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations
Nikolaos-Antonios Ypsilantis
Kaifeng Chen
Bingyi Cao
Mário Lipovský
Pelin Dogan-Schönberger
Grzegorz Makosa
Boris Bluntschli
Mojtaba Seyedhosseini
Ondrej Chum
André Araujo
SSL
8
13
0
04 Sep 2023
Joint learning of images and videos with a single Vision Transformer
Joint learning of images and videos with a single Vision Transformer
Shuki Shimizu
Toru Tamaki
ViT
11
0
0
21 Aug 2023
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
ViT-Lens: Initiating Omni-Modal Exploration through 3D Insights
Weixian Lei
Yixiao Ge
Jianfeng Zhang
Dylan Sun
Kun Yi
Ying Shan
Mike Zheng Shou
19
1
0
20 Aug 2023
Does Visual Pretraining Help End-to-End Reasoning?
Does Visual Pretraining Help End-to-End Reasoning?
Chen Sun
Calvin Luo
Xingyi Zhou
Anurag Arnab
Cordelia Schmid
OCL
LRM
ViT
25
3
0
17 Jul 2023
DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model
DaTaSeg: Taming a Universal Multi-Dataset Multi-Task Segmentation Model
Xiuye Gu
Yin Cui
Jonathan Huang
Abdullah M. Rashwan
X. Yang
...
Golnaz Ghiasi
Weicheng Kuo
Huizhong Chen
Liang-Chieh Chen
David A. Ross
ISeg
19
26
0
02 Jun 2023
Alternating Gradient Descent and Mixture-of-Experts for Integrated
  Multimodal Perception
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Hassan Akbari
Dan Kondratyuk
Yin Cui
Rachel Hornung
H. Wang
Hartwig Adam
VLM
MoE
12
11
0
10 May 2023
ImageBind: One Embedding Space To Bind Them All
ImageBind: One Embedding Space To Bind Them All
Rohit Girdhar
Alaaeldin El-Nouby
Zhuang Liu
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
VLM
16
817
0
09 May 2023
Modality-invariant Visual Odometry for Embodied Vision
Modality-invariant Visual Odometry for Embodied Vision
Marius Memmel
Roman Bachmann
Amir Zamir
48
8
0
29 Apr 2023
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors
BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors
Kathryn Wantlin
Chenwei Wu
Shih-Cheng Huang
Oishi Banerjee
Farah Z. Dadabhoy
...
A. Adamson
Laura Heacock
G. Tison
Alex Tamkin
Pranav Rajpurkar
SSL
OOD
27
1
0
17 Apr 2023
ViC-MAE: Self-Supervised Representation Learning from Images and Video
  with Contrastive Masked Autoencoders
ViC-MAE: Self-Supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders
J. Hernandez
Ruben Villegas
Vicente Ordonez
SSL
21
2
0
21 Mar 2023
Visual Exemplar Driven Task-Prompting for Unified Perception in
  Autonomous Driving
Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving
Xiwen Liang
Minzhe Niu
Jianhua Han
Hang Xu
Chunjing Xu
Xiaodan Liang
VLM
13
13
0
03 Mar 2023
CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
CoMAE: Single Model Hybrid Pre-training on Small-Scale RGB-D Datasets
Jiang Yang
Sheng Guo
Gangshan Wu
Limin Wang
VLM
15
6
0
13 Feb 2023
Adaptive Computation with Elastic Input Sequence
Adaptive Computation with Elastic Input Sequence
Fuzhao Xue
Valerii Likhosherstov
Anurag Arnab
N. Houlsby
Mostafa Dehghani
Yang You
19
18
0
30 Jan 2023
CLIPPO: Image-and-Language Understanding from Pixels Only
CLIPPO: Image-and-Language Understanding from Pixels Only
Michael Tschannen
Basil Mustafa
N. Houlsby
CLIP
VLM
11
47
0
15 Dec 2022
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Vision Transformers are Parameter-Efficient Audio-Visual Learners
Yan-Bo Lin
Yi-Lin Sung
Jie Lei
Mohit Bansal
Gedas Bertasius
13
69
0
15 Dec 2022
TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models
  of Different Modalities
TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities
Zhe Zhao
Yudong Li
Cheng-An Hou
Jing-xin Zhao
Rong Tian
...
Xingwu Sun
Zhanhui Kang
Xiaoyong Du
Linlin Shen
Kimmo Yan
VLM
21
16
0
13 Dec 2022
Audiovisual Masked Autoencoders
Audiovisual Masked Autoencoders
Mariana-Iuliana Georgescu
Eduardo Fonseca
Radu Tudor Ionescu
Mario Lucic
Cordelia Schmid
Anurag Arnab
SSL
16
43
0
09 Dec 2022
Deep Architectures for Content Moderation and Movie Content Rating
Deep Architectures for Content Moderation and Movie Content Rating
Fatih Çagatay Akyön
A. Temi̇zel
20
4
0
08 Dec 2022
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video
  Learning
Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning
A. Piergiovanni
Weicheng Kuo
A. Angelova
ViT
21
54
0
06 Dec 2022
Scaling Multimodal Pre-Training via Cross-Modality Gradient
  Harmonization
Scaling Multimodal Pre-Training via Cross-Modality Gradient Harmonization
Junru Wu
Yi Liang
Feng Han
Hassan Akbari
Zhangyang Wang
Cong Yu
18
5
0
03 Nov 2022
Play It Back: Iterative Attention for Audio Recognition
Play It Back: Iterative Attention for Audio Recognition
Alexandros Stergiou
Dima Damen
16
4
0
20 Oct 2022
Multi-dataset Training of Transformers for Robust Action Recognition
Multi-dataset Training of Transformers for Robust Action Recognition
Junwei Liang
Enwei Zhang
Jun Zhang
Chunhua Shen
ViT
29
11
0
26 Sep 2022
Learning Model Predictive Controllers with Real-Time Attention for
  Real-World Navigation
Learning Model Predictive Controllers with Real-Time Attention for Real-World Navigation
Xuesu Xiao
Tingnan Zhang
K. Choromanski
Edward J. Lee
Anthony G. Francis
...
Leila Takayama
Roy Frostig
Jie Tan
Carolina Parada
Vikas Sindhwani
63
54
0
22 Sep 2022
Effective Adaptation in Multi-Task Co-Training for Unified Autonomous
  Driving
Effective Adaptation in Multi-Task Co-Training for Unified Autonomous Driving
Xiwen Liang
Yangxin Wu
Jianhua Han
Hang Xu
Chunjing Xu
Xiaodan Liang
14
30
0
19 Sep 2022
Foundations and Trends in Multimodal Machine Learning: Principles,
  Challenges, and Open Questions
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Paul Pu Liang
Amir Zadeh
Louis-Philippe Morency
10
59
0
07 Sep 2022
UAVM: Towards Unifying Audio and Visual Models
UAVM: Towards Unifying Audio and Visual Models
Yuan Gong
Alexander H. Liu
Andrew Rouditchenko
James R. Glass
17
20
0
29 Jul 2022
OmniMAE: Single Model Masked Pretraining on Images and Videos
OmniMAE: Single Model Masked Pretraining on Images and Videos
Rohit Girdhar
Alaaeldin El-Nouby
Mannat Singh
Kalyan Vasudev Alwala
Armand Joulin
Ishan Misra
ViT
17
95
0
16 Jun 2022
Multimodal Learning with Transformers: A Survey
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
26
518
0
13 Jun 2022
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture
  of Experts
Multimodal Contrastive Learning with LIMoE: the Language-Image Mixture of Experts
Basil Mustafa
C. Riquelme
J. Puigcerver
Rodolphe Jenatton
N. Houlsby
VLM
MoE
26
183
0
06 Jun 2022
NL-FCOS: Improving FCOS through Non-Local Modules for Object Detection
NL-FCOS: Improving FCOS through Non-Local Modules for Object Detection
Lukas Pavez
Jose M. Saavedra Rondo
ObjD
21
0
0
29 Mar 2022
X-Learner: Learning Cross Sources and Tasks for Universal Visual
  Representation
X-Learner: Learning Cross Sources and Tasks for Universal Visual Representation
Yinan He
Gengshi Huang
Siyu Chen
Jianing Teng
Wang Kun
Zhen-fei Yin
Lu Sheng
Ziwei Liu
Yu Qiao
Jing Shao
VLM
SSL
ViT
17
5
0
16 Mar 2022
High-Modality Multimodal Transformer: Quantifying Modality & Interaction
  Heterogeneity for High-Modality Representation Learning
High-Modality Multimodal Transformer: Quantifying Modality & Interaction Heterogeneity for High-Modality Representation Learning
Paul Pu Liang
Yiwei Lyu
Xiang Fan
Jeffrey Tsaw
Yudong Liu
Shentong Mo
Dani Yogatama
Louis-Philippe Morency
Ruslan Salakhutdinov
9
29
0
02 Mar 2022
HiP: Hierarchical Perceiver
HiP: Hierarchical Perceiver
João Carreira
Skanda Koppula
Daniel Zoran
Adrià Recasens
Catalin Ionescu
...
M. Botvinick
Oriol Vinyals
Karen Simonyan
Andrew Zisserman
Andrew Jaegle
VLM
15
14
0
22 Feb 2022
SCENIC: A JAX Library for Computer Vision Research and Beyond
SCENIC: A JAX Library for Computer Vision Research and Beyond
Mostafa Dehghani
A. Gritsenko
Anurag Arnab
Matthias Minderer
Yi Tay
41
67
0
18 Oct 2021
Exploring the Limits of Large Scale Pre-training
Exploring the Limits of Large Scale Pre-training
Samira Abnar
Mostafa Dehghani
Behnam Neyshabur
Hanie Sedghi
AI4CE
29
114
0
05 Oct 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
231
573
0
22 Apr 2021
Is Space-Time Attention All You Need for Video Understanding?
Is Space-Time Attention All You Need for Video Understanding?
Gedas Bertasius
Heng Wang
Lorenzo Torresani
ViT
278
1,939
0
09 Feb 2021
1