ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2208.10442
  4. Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
    MLLM
    VLM
    ViT
ArXivPDFHTML

Papers citing "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"

50 / 458 papers shown
Title
CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with
  Mixture-of-Textual-Experts for Passage Retrieval
CoT-MoTE: Exploring ConTextual Masked Auto-Encoder Pre-training with Mixture-of-Textual-Experts for Passage Retrieval
Guangyuan Ma
Xing Wu
Peng Wang
Songlin Hu
MoE
RALM
13
5
0
20 Apr 2023
Enhancing Textbooks with Visuals from the Web for Improved Learning
Enhancing Textbooks with Visuals from the Web for Improved Learning
Janvijay Singh
Vilém Zouhar
Mrinmaya Sachan
11
3
0
18 Apr 2023
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic
  Segmentation
MVP-SEG: Multi-View Prompt Learning for Open-Vocabulary Semantic Segmentation
Jie Guo
Qimeng Wang
Yan Gao
Xiaolong Jiang
Xu Tang
Yao Hu
Baochang Zhang
VLM
16
11
0
14 Apr 2023
Modeling Dense Multimodal Interactions Between Biological Pathways and
  Histology for Survival Prediction
Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction
Guillaume Jaume
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Paul Pu Liang
Faisal Mahmood
33
41
0
13 Apr 2023
On the Opportunities and Challenges of Foundation Models for Geospatial
  Artificial Intelligence
On the Opportunities and Challenges of Foundation Models for Geospatial Artificial Intelligence
Gengchen Mai
Weiming Huang
Jin Sun
Suhang Song
Deepak Mishra
...
Yingjie Hu
Chris Cundy
Ziyuan Li
Rui Zhu
Ni Lao
AI4CE
22
118
0
13 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal
  representations
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
19
4
0
11 Apr 2023
Training Large Language Models Efficiently with Sparsity and Dataflow
Training Large Language Models Efficiently with Sparsity and Dataflow
V. Srinivasan
Darshan Gandhi
Urmish Thakker
R. Prabhakar
MoE
28
6
0
11 Apr 2023
A Billion-scale Foundation Model for Remote Sensing Images
A Billion-scale Foundation Model for Remote Sensing Images
Keumgang Cha
Junghoon Seo
Taekyung Lee
30
63
0
11 Apr 2023
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Improving Image Recognition by Retrieving from Web-Scale Image-Text Data
Ahmet Iscen
Alireza Fathi
Cordelia Schmid
VLM
3DV
15
25
0
11 Apr 2023
On Robustness in Multimodal Learning
On Robustness in Multimodal Learning
Brandon McKinzie
Joseph Cheng
Vaishaal Shankar
Yinfei Yang
Jonathon Shlens
Alexander Toshev
30
2
0
10 Apr 2023
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Video ChatCaptioner: Towards Enriched Spatiotemporal Descriptions
Jun Chen
Deyao Zhu
Kilichbek Haydarov
Xiang Li
Mohamed Elhoseiny
23
37
0
09 Apr 2023
Token Boosting for Robust Self-Supervised Visual Transformer
  Pre-training
Token Boosting for Robust Self-Supervised Visual Transformer Pre-training
Tianjiao Li
Lin Geng Foo
Ping Hu
Xindi Shang
Hossein Rahmani
Zehuan Yuan
J. Liu
32
7
0
09 Apr 2023
Attention: Marginal Probability is All You Need?
Attention: Marginal Probability is All You Need?
Ryan Singh
Christopher L. Buckley
31
2
0
07 Apr 2023
On Efficient Training of Large-Scale Deep Learning Models: A Literature
  Review
On Efficient Training of Large-Scale Deep Learning Models: A Literature Review
Li Shen
Yan Sun
Zhiyuan Yu
Liang Ding
Xinmei Tian
Dacheng Tao
VLM
24
39
0
07 Apr 2023
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior
  Refinement
Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement
Xiang-yu Zhu
Renrui Zhang
Bowei He
A-Long Zhou
Dong Wang
Bingyan Zhao
Peng Gao
VLM
27
79
0
03 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A Survey
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
19
43
0
31 Mar 2023
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
A Study of Autoregressive Decoders for Multi-Tasking in Computer Vision
Lucas Beyer
Bo Wan
Gagan Madan
Filip Pavetić
Andreas Steiner
...
Emanuele Bugliarello
Xiao Wang
Qihang Yu
Liang-Chieh Chen
Xiaohua Zhai
49
8
0
30 Mar 2023
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
Weicheng Kuo
A. Piergiovanni
Dahun Kim
Xiyang Luo
Benjamin Caine
...
Luowei Zhou
Andrew M. Dai
Zhifeng Chen
Claire Cui
A. Angelova
MLLM
VLM
23
23
0
29 Mar 2023
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init
  Attention
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attention
Renrui Zhang
Jiaming Han
Chris Liu
Peng Gao
Aojun Zhou
Xiangfei Hu
Shilin Yan
Pan Lu
Hongsheng Li
Yu Qiao
MLLM
33
737
0
28 Mar 2023
Revisiting Multimodal Representation in Contrastive Learning: From Patch
  and Token Embeddings to Finite Discrete Tokens
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Yuxiao Chen
Jianbo Yuan
Yu Tian
Shijie Geng
Xinyu Li
Ding Zhou
Dimitris N. Metaxas
Hongxia Yang
14
33
0
27 Mar 2023
Equivariant Similarity for Vision-Language Foundation Models
Equivariant Similarity for Vision-Language Foundation Models
Tan Wang
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Zhengyuan Yang
Hanwang Zhang
Zicheng Liu
Lijuan Wang
CoGe
38
44
0
25 Mar 2023
IFSeg: Image-free Semantic Segmentation via Vision-Language Model
IFSeg: Image-free Semantic Segmentation via Vision-Language Model
Sukmin Yun
S. Park
Paul Hongsuck Seo
Jinwoo Shin
VLM
MLLM
49
13
0
25 Mar 2023
Accelerating Vision-Language Pretraining with Free Language Modeling
Accelerating Vision-Language Pretraining with Free Language Modeling
Teng Wang
Yixiao Ge
Feng Zheng
Ran Cheng
Ying Shan
Xiaohu Qie
Ping Luo
VLM
MLLM
89
9
0
24 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
37
12
0
23 Mar 2023
MAGVLT: Masked Generative Vision-and-Language Transformer
MAGVLT: Masked Generative Vision-and-Language Transformer
Sungwoong Kim
DaeJin Jo
Donghoon Lee
Jongmin Kim
VLM
28
11
0
21 Mar 2023
eP-ALM: Efficient Perceptual Augmentation of Language Models
eP-ALM: Efficient Perceptual Augmentation of Language Models
Mustafa Shukor
Corentin Dancette
Matthieu Cord
MLLM
VLM
24
29
0
20 Mar 2023
EVA-02: A Visual Representation for Neon Genesis
EVA-02: A Visual Representation for Neon Genesis
Yuxin Fang
Quan-Sen Sun
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
ViT
CLIP
38
258
0
20 Mar 2023
MXM-CLR: A Unified Framework for Contrastive Learning of Multifold
  Cross-Modal Representations
MXM-CLR: A Unified Framework for Contrastive Learning of Multifold Cross-Modal Representations
Ye Wang
Bo‐Shu Jiang
C. Zou
Rui Ma
22
5
0
20 Mar 2023
IRGen: Generative Modeling for Image Retrieval
IRGen: Generative Modeling for Image Retrieval
Yidan Zhang
Ting Zhang
Dong Chen
Yujing Wang
Qi Chen
...
Qi Zhang
Fan Yang
Mao Yang
Q. Liao
B. Guo
3DV
VLM
33
14
0
17 Mar 2023
Investigating the Role of Attribute Context in Vision-Language Models
  for Object Recognition and Detection
Investigating the Role of Attribute Context in Vision-Language Models for Object Recognition and Detection
Kyle Buettner
Adriana Kovashka
20
0
0
17 Mar 2023
Dual-path Adaptation from Image to Video Transformers
Dual-path Adaptation from Image to Video Transformers
Jungin Park
Jiyoung Lee
K. Sohn
ViT
19
37
0
17 Mar 2023
Cross-Modal Causal Intervention for Medical Report Generation
Cross-Modal Causal Intervention for Medical Report Generation
Weixing Chen
Yang Liu
Ce Wang
Jiarui Zhu
Shen Zhao
Guanbin Li
Cheng-Lin Liu
Liang Lin
21
5
0
16 Mar 2023
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness
  with Dataset Reinforcement
Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement
Fartash Faghri
Hadi Pouransari
Sachin Mehta
Mehrdad Farajtabar
Ali Farhadi
Mohammad Rastegari
Oncel Tuzel
30
9
0
15 Mar 2023
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Align and Attend: Multimodal Summarization with Dual Contrastive Losses
Bo He
Jun Wang
Jielin Qiu
Trung Bui
Abhinav Shrivastava
Zhaowen Wang
17
65
0
13 Mar 2023
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical
  Documents
PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents
Weixiong Lin
Ziheng Zhao
Xiaoman Zhang
Chaoyi Wu
Ya-Qin Zhang
Yanfeng Wang
Weidi Xie
LM&MA
VLM
MedIm
12
142
0
13 Mar 2023
Scaling Vision-Language Models with Sparse Mixture of Experts
Scaling Vision-Language Models with Sparse Mixture of Experts
Sheng Shen
Z. Yao
Chunyuan Li
Trevor Darrell
Kurt Keutzer
Yuxiong He
VLM
MoE
11
62
0
13 Mar 2023
ViM: Vision Middleware for Unified Downstream Transferring
ViM: Vision Middleware for Unified Downstream Transferring
Yutong Feng
Biao Gong
Jianwen Jiang
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
19
1
0
13 Mar 2023
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched
  Visual Descriptions
ChatGPT Asks, BLIP-2 Answers: Automatic Questioning Towards Enriched Visual Descriptions
Deyao Zhu
Jun Chen
Kilichbek Haydarov
Xiaoqian Shen
Wenxuan Zhang
Mohamed Elhoseiny
MLLM
21
96
0
12 Mar 2023
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
One Transformer Fits All Distributions in Multi-Modal Diffusion at Scale
Fan Bao
Shen Nie
Kaiwen Xue
Chongxuan Li
Shiliang Pu
Yaole Wang
Gang Yue
Yue Cao
Hang Su
Jun Zhu
DiffM
199
148
0
12 Mar 2023
Understanding and Constructing Latent Modality Structures in Multi-modal
  Representation Learning
Understanding and Constructing Latent Modality Structures in Multi-modal Representation Learning
Qian Jiang
Changyou Chen
Han Zhao
Liqun Chen
Q. Ping
S. D. Tran
Yi Xu
Belinda Zeng
Trishul M. Chilimbi
41
38
0
10 Mar 2023
MuLTI: Efficient Video-and-Language Understanding with Text-Guided
  MultiWay-Sampler and Multiple Choice Modeling
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
Jiaqi Xu
Bo Liu
Yunkuo Chen
Mengli Cheng
Xing Shi
40
1
0
10 Mar 2023
HumanBench: Towards General Human-centric Perception with Projector
  Assisted Pretraining
HumanBench: Towards General Human-centric Perception with Projector Assisted Pretraining
Shixiang Tang
Cheng Chen
Qingsong Xie
Meilin Chen
Yizhou Wang
...
Feng Zhu
Haiyang Yang
Li Yi
Rui Zhao
Wanli Ouyang
VLM
14
35
0
10 Mar 2023
Tag2Text: Guiding Vision-Language Model via Image Tagging
Tag2Text: Guiding Vision-Language Model via Image Tagging
Xinyu Huang
Youcai Zhang
Jinyu Ma
Weiwei Tian
Rui Feng
Yuejie Zhang
Yaqian Li
Yandong Guo
Lei Zhang
CLIP
MLLM
VLM
3DV
61
74
0
10 Mar 2023
Exploiting the Textual Potential from Vision-Language Pre-training for
  Text-based Person Search
Exploiting the Textual Potential from Vision-Language Pre-training for Text-based Person Search
Guanshuo Wang
Fufu Yu
Jianing Li
Qiong Jia
Shouhong Ding
19
18
0
08 Mar 2023
Bootstrap The Original Latent: Learning a Private Model from a Black-box
  Model
Bootstrap The Original Latent: Learning a Private Model from a Black-box Model
Shuai Wang
Daoan Zhang
Jiang Zhang
Weiwei Zhang
Ruizhen Li
FedML
30
0
0
07 Mar 2023
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware
  Attention
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
Shijie Geng
Jianbo Yuan
Yu Tian
Yuxiao Chen
Yongfeng Zhang
CLIP
VLM
41
44
0
06 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
  Tasks
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
16
38
0
04 Mar 2023
DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction
DejaVu: Conditional Regenerative Learning to Enhance Dense Prediction
Shubhankar Borse
Debasmit Das
Hyojin Park
H. Cai
Risheek Garrepalli
Fatih Porikli
28
9
0
02 Mar 2023
Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves
Visual Atoms: Pre-training Vision Transformers with Sinusoidal Waves
Sora Takashima
Ryo Hayamizu
Nakamasa Inoue
Hirokatsu Kataoka
Rio Yokota
60
18
0
02 Mar 2023
RAMM: Retrieval-augmented Biomedical Visual Question Answering with
  Multi-modal Pre-training
RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training
Zheng Yuan
Qiao Jin
Chuanqi Tan
Zhengyun Zhao
Hongyi Yuan
Fei Huang
Songfang Huang
44
27
0
01 Mar 2023
Previous
123...106789
Next