ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2103.04037
  4. Cited By
Perspectives and Prospects on Transformer Architecture for Cross-Modal
  Tasks with Language and Vision

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

6 March 2021
Andrew Shin
Masato Ishii
T. Narihira
ArXivPDFHTML

Papers citing "Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision"

11 / 11 papers shown
Title
Video Transformers: A Survey
Video Transformers: A Survey
Javier Selva
A. S. Johansen
Sergio Escalera
Kamal Nasrollahi
T. Moeslund
Albert Clapés
ViT
20
102
0
16 Jan 2022
Core Challenges in Embodied Vision-Language Planning
Core Challenges in Embodied Vision-Language Planning
Jonathan M Francis
Nariaki Kitamura
Felix Labelle
Xiaopeng Lu
Ingrid Navarro
Jean Oh
LM&Ro
39
45
0
26 Jun 2021
VidTr: Video Transformer Without Convolutions
VidTr: Video Transformer Without Convolutions
Yanyi Zhang
Xinyu Li
Chunhui Liu
Bing Shuai
Yi Zhu
Biagio Brattoli
Hao Chen
I. Marsic
Joseph Tighe
ViT
136
193
0
23 Apr 2021
VATT: Transformers for Multimodal Self-Supervised Learning from Raw
  Video, Audio and Text
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
240
577
0
22 Apr 2021
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction
  without Convolutions
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
Wenhai Wang
Enze Xie
Xiang Li
Deng-Ping Fan
Kaitao Song
Ding Liang
Tong Lu
Ping Luo
Ling Shao
ViT
263
3,622
0
24 Feb 2021
Zero-Shot Text-to-Image Generation
Zero-Shot Text-to-Image Generation
Aditya A. Ramesh
Mikhail Pavlov
Gabriel Goh
Scott Gray
Chelsea Voss
Alec Radford
Mark Chen
Ilya Sutskever
VLM
253
4,764
0
24 Feb 2021
Transformers in Vision: A Survey
Transformers in Vision: A Survey
Salman Khan
Muzammal Naseer
Munawar Hayat
Syed Waqas Zamir
F. Khan
M. Shah
ViT
225
2,427
0
04 Jan 2021
Multi-modal Transformer for Video Retrieval
Multi-modal Transformer for Video Retrieval
Valentin Gabeur
Chen Sun
Alahari Karteek
Cordelia Schmid
ViT
410
595
0
21 Jul 2020
Unified Vision-Language Pre-Training for Image Captioning and VQA
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
250
926
0
24 Sep 2019
A Style-Based Generator Architecture for Generative Adversarial Networks
A Style-Based Generator Architecture for Generative Adversarial Networks
Tero Karras
S. Laine
Timo Aila
262
10,344
0
12 Dec 2018
Multimodal Compact Bilinear Pooling for Visual Question Answering and
  Visual Grounding
Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding
Akira Fukui
Dong Huk Park
Daylen Yang
Anna Rohrbach
Trevor Darrell
Marcus Rohrbach
144
1,464
0
06 Jun 2016
1