Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2208.10442
Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"
50 / 458 papers shown
Title
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
10
2
0
12 Jun 2023
Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!
Zaid Khan
B. Vijaykumar
S. Schulter
Xiang Yu
Y. Fu
Manmohan Chandraker
VLM
MLLM
22
17
0
06 Jun 2023
Industrial Anomaly Detection and Localization Using Weakly-Supervised Residual Transformers
Hanxi Li
Jing Wu
Lin Yuanbo Wu
Hao Chen
Deyin Liu
Mingwen Wang
Peng Wang
ViT
34
4
0
06 Jun 2023
Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
Hang Zhang
Xin Li
Lidong Bing
MLLM
47
948
0
05 Jun 2023
Benchmarking Robustness of Adaptation Methods on Pre-trained Vision-Language Models
Shuo Chen
Jindong Gu
Zhen Han
Yunpu Ma
Philip H. S. Torr
Volker Tresp
VPVLM
VLM
27
17
0
03 Jun 2023
UniDiff: Advancing Vision-Language Models with Generative and Discriminative Learning
Xiao Dong
Runhu Huang
Xiaoyong Wei
Zequn Jie
Jianxing Yu
Jian Yin
Xiaodan Liang
VLM
DiffM
26
1
0
01 Jun 2023
On Masked Pre-training and the Marginal Likelihood
Pablo Moreno-Muñoz
Pol G. Recasens
Søren Hauberg
SSL
25
5
0
01 Jun 2023
PV2TEA: Patching Visual Modality to Textual-Established Information Extraction
Hejie Cui
Rongmei Lin
Nasser Zalmout
Chenwei Zhang
Jingbo Shang
Carl Yang
Xian Li
VLM
8
2
0
01 Jun 2023
ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning
Xiao Xu
Bei Li
Chenfei Wu
Shao-Yen Tseng
Anahita Bhiwandiwalla
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
AIFin
VLM
21
2
0
31 May 2023
Learning without Forgetting for Vision-Language Models
Da-Wei Zhou
Yuanhan Zhang
Jingyi Ning
Jingyi Ning
De-Chuan Zhan
De-Chuan Zhan
Ziwei Liu
VLM
CLL
69
37
0
30 May 2023
PaLI-X: On Scaling up a Multilingual Vision and Language Model
Xi Chen
Josip Djolonga
Piotr Padlewski
Basil Mustafa
Soravit Changpinyo
...
Mojtaba Seyedhosseini
A. Angelova
Xiaohua Zhai
N. Houlsby
Radu Soricut
VLM
44
187
0
29 May 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
J. Liu
30
96
0
29 May 2023
Emergent Modularity in Pre-trained Transformers
Zhengyan Zhang
Zhiyuan Zeng
Yankai Lin
Chaojun Xiao
Xiaozhi Wang
Xu Han
Zhiyuan Liu
Ruobing Xie
Maosong Sun
Jie Zhou
MoE
37
23
0
28 May 2023
Toward Understanding Generative Data Augmentation
Chenyu Zheng
Guoqiang Wu
Chongxuan Li
21
24
0
27 May 2023
Benchmarking Diverse-Modal Entity Linking with Generative Models
Sijia Wang
A. Li
He Zhu
Shenmin Zhang
Chung-Wei Hang
...
William Wang
Zhiguo Wang
Vittorio Castelli
Bing Xiang
Patrick K. L. Ng
VLM
33
8
0
27 May 2023
Matrix Information Theory for Self-Supervised Learning
Yifan Zhang
Zhi-Hao Tan
Jingqin Yang
Weiran Huang
Yang Yuan
SSL
40
16
0
27 May 2023
Zero-shot Visual Question Answering with Language Model Feedback
Yifan Du
Junyi Li
Tianyi Tang
Wayne Xin Zhao
Ji-Rong Wen
16
13
0
26 May 2023
Detect Any Shadow: Segment Anything for Video Shadow Detection
Yonghui Wang
Wen-gang Zhou
Yunyao Mao
Houqiang Li
VLM
8
22
0
26 May 2023
LANISTR: Multimodal Learning from Structured and Unstructured Data
Sayna Ebrahimi
Sercan Ö. Arik
Yihe Dong
Tomas Pfister
20
4
0
26 May 2023
VioLA: Unified Codec Language Models for Speech Recognition, Synthesis, and Translation
Tianrui Wang
Long Zhou
Zi-Hua Zhang
Yu-Huan Wu
Shujie Liu
Yashesh Gaur
Zhuo Chen
Jinyu Li
Furu Wei
32
100
0
25 May 2023
ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs
Zihao Zhao
Sheng Wang
Jinchen Gu
Yitao Zhu
Lanzhuju Mei
Zixu Zhuang
Zhiming Cui
Qian Wang
Dinggang Shen
LM&MA
29
36
0
25 May 2023
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models
Geewook Kim
Hodong Lee
D. Kim
Haeji Jung
S. Park
Yoon Kim
Sangdoo Yun
Taeho Kil
Bado Lee
Seunghyun Park
VLM
32
4
0
24 May 2023
Training Transitive and Commutative Multimodal Transformers with LoReTTa
Manuel Tran
Yashin Dicente Cid
Amal Lahiani
Fabian J. Theis
Tingying Peng
Eldad Klaiman
13
2
0
23 May 2023
S-CLIP: Semi-supervised Vision-Language Learning using Few Specialist Captions
Sangwoo Mo
Minkyu Kim
Kyungmin Lee
Jinwoo Shin
VLM
CLIP
36
21
0
23 May 2023
Perception Test: A Diagnostic Benchmark for Multimodal Video Models
Viorica Puatruaucean
Lucas Smaira
Ankush Gupta
Adrià Recasens Continente
L. Markeeva
...
Y. Aytar
Simon Osindero
Dima Damen
Andrew Zisserman
João Carreira
VLM
130
139
0
23 May 2023
Know Your Self-supervised Learning: A Survey on Image-based Generative and Discriminative Training
Utku Ozbulak
Hyun Jung Lee
Beril Boga
Esla Timothy Anzaku
Ho-min Park
Arnout Van Messem
W. D. Neve
J. Vankerschaver
DiffM
24
36
0
23 May 2023
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
Xingjian He
Sihan Chen
Fan Ma
Zhicheng Huang
Xiaojie Jin
Zikang Liu
Dongmei Fu
Yi Yang
J. Liu
Jiashi Feng
VLM
CLIP
18
17
0
22 May 2023
Album Storytelling with Iterative Story-aware Captioning and Large Language Models
Munan Ning
Yujia Xie
Dongdong Chen
Zeyin Song
Lu Yuan
Yonghong Tian
QiXiang Ye
Liuliang Yuan
19
8
0
22 May 2023
Has It All Been Solved? Open NLP Research Questions Not Solved by Large Language Models
Oana Ignat
Zhijing Jin
Artem Abzaliev
Laura Biester
Santiago Castro
...
Verónica Pérez-Rosas
Siqi Shen
Zekun Wang
Winston Wu
Rada Mihalcea
LRM
24
6
0
21 May 2023
i-Code V2: An Autoregressive Generation Framework over Vision, Language, and Speech Data
Ziyi Yang
Mahmoud Khademi
Yichong Xu
Reid Pryzant
Yuwei Fang
...
Yu Shi
Lu Yuan
Takuya Yoshioka
Michael Zeng
Xuedong Huang
17
2
0
21 May 2023
Enhancing Vision-Language Pre-Training with Jointly Learned Questioner and Dense Captioner
Zikang Liu
Sihan Chen
Longteng Guo
Handong Li
Xingjian He
J. Liu
13
1
0
19 May 2023
VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
Wen Wang
Zhe Chen
Xiaokang Chen
Jiannan Wu
Xizhou Zhu
...
Ping Luo
Tong Lu
Jie Zhou
Yu Qiao
Jifeng Dai
MLLM
VLM
33
454
0
18 May 2023
Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature
Ana Claudia Akemi Matsuki de Faria
Felype de Castro Bastos
Jose Victor Nogueira Alves da Silva
Vitor Lopes Fabris
Valeska Uchôa
Décio Gonccalves de Aguiar Neto
C. F. G. Santos
25
22
0
18 May 2023
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Zhenhailong Wang
Ansel Blume
Sha Li
Genglin Liu
Jaemin Cho
Zineng Tang
Mohit Bansal
Heng Ji
KELM
VGen
17
26
0
18 May 2023
Sequence-to-Sequence Pre-training with Unified Modality Masking for Visual Document Understanding
ShuWei Feng
Tianyang Zhan
Zhanming Jie
Trung Quoc Luong
Xiaoran Jin
13
1
0
16 May 2023
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
Linhui Xiao
Xiaoshan Yang
Fang Peng
Ming Yan
Yaowei Wang
Changsheng Xu
ObjD
VLM
29
30
0
15 May 2023
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Haixin Wang
Xinlong Yang
Jianlong Chang
Di Jin
Jinan Sun
Shikun Zhang
Xiao Luo
Qi Tian
22
22
0
15 May 2023
Simple Token-Level Confidence Improves Caption Correctness
Suzanne Petryk
Spencer Whitehead
Joseph E. Gonzalez
Trevor Darrell
Anna Rohrbach
Marcus Rohrbach
18
7
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Mohit Bansal
36
129
0
11 May 2023
Alternating Gradient Descent and Mixture-of-Experts for Integrated Multimodal Perception
Hassan Akbari
Dan Kondratyuk
Yin Cui
Rachel Hornung
H. Wang
Hartwig Adam
VLM
MoE
20
11
0
10 May 2023
Multi-Prompt with Depth Partitioned Cross-Modal Learning
Yingjie Tian
Yiqi Wang
Xianda Guo
Zheng Hua Zhu
Long Chen
VLM
18
0
0
10 May 2023
Visual Tuning
Bruce X. B. Yu
Jianlong Chang
Haixin Wang
Lin Liu
Shijie Wang
...
Lingxi Xie
Haojie Li
Zhouchen Lin
Qi Tian
Chang Wen Chen
VLM
39
38
0
10 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
12
21
0
06 May 2023
LMEye: An Interactive Perception Network for Large Language Models
Yunxin Li
Baotian Hu
Xinyu Chen
Lin Ma
Yong-mei Xu
M. Zhang
MLLM
VLM
25
24
0
05 May 2023
Multimodal Neural Databases
Giovanni Trappolini
Andrea Santilli
Emanuele Rodolà
A. Halevy
Fabrizio Silvestri
38
10
0
02 May 2023
VPGTrans: Transfer Visual Prompt Generator across LLMs
Ao Zhang
Hao Fei
Yuan Yao
Wei Ji
Li Li
Zhiyuan Liu
Tat-Seng Chua
MLLM
VLM
27
85
0
02 May 2023
ArK: Augmented Reality with Knowledge Interactive Emergent Ability
Qiuyuan Huang
J. Park
Abhinav Gupta
Paul N. Bennett
Ran Gong
...
Baolin Peng
O. Mohammed
C. Pal
Yejin Choi
Jianfeng Gao
73
6
0
01 May 2023
Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond
Jingfeng Yang
Hongye Jin
Ruixiang Tang
Xiaotian Han
Qizhang Feng
Haoming Jiang
Bing Yin
Xia Hu
LM&MA
125
614
0
26 Apr 2023
A Strong and Reproducible Object Detector with Only Public Datasets
Tianhe Ren
Jianwei Yang
Siyi Liu
Ailing Zeng
Feng Li
Hao Zhang
Hongyang Li
Zhaoyang Zeng
Lei Zhang
ObjD
28
11
0
25 Apr 2023
A Cookbook of Self-Supervised Learning
Randall Balestriero
Mark Ibrahim
Vlad Sobal
Ari S. Morcos
Shashank Shekhar
...
Pierre Fernandez
Amir Bar
Hamed Pirsiavash
Yann LeCun
Micah Goldblum
SyDa
FedML
SSL
31
272
0
24 Apr 2023
Previous
1
2
3
...
10
5
6
7
8
9
Next