Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2208.10442
Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
MLLM
VLM
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"
50 / 458 papers shown
Title
VIDiff: Translating Videos via Multi-Modal Instructions with Diffusion Models
Zhen Xing
Qi Dai
Zihao Zhang
Hui Zhang
Hang-Rui Hu
Zuxuan Wu
Yu-Gang Jiang
VGen
33
17
0
30 Nov 2023
ShapeGPT: 3D Shape Generation with A Unified Multi-modal Language Model
Fukun Yin
Xin Chen
C. Zhang
Biao Jiang
Zibo Zhao
Jiayuan Fan
Gang Yu
Taihao Li
Tao Chen
21
19
0
29 Nov 2023
Elucidating and Overcoming the Challenges of Label Noise in Supervised Contrastive Learning
Zijun Long
George Killick
Lipeng Zhuang
R. McCreadie
Gerardo Aragon Camarasa
Paul Henderson
20
5
0
25 Nov 2023
Robot Learning in the Era of Foundation Models: A Survey
Xuan Xiao
Jiahang Liu
Zhipeng Wang
Yanmin Zhou
Yong Qi
Qian Cheng
Bin He
Shuo Jiang
AI4CE
LM&Ro
16
26
0
24 Nov 2023
Invisible Relevance Bias: Text-Image Retrieval Models Prefer AI-Generated Images
Shicheng Xu
Danyang Hou
Liang Pang
Jingcheng Deng
Jun Xu
Huawei Shen
Xueqi Cheng
16
8
0
23 Nov 2023
De-fine: Decomposing and Refining Visual Programs with Auto-Feedback
Minghe Gao
Juncheng Li
Hao Fei
Liang Pang
Wei Ji
Guoming Wang
Wenqiao Zhang
Siliang Tang
Yueting Zhuang
21
8
0
21 Nov 2023
Deep Tensor Network
Yifan Zhang
16
0
0
18 Nov 2023
Leveraging Foundation Models to Improve Lightweight Clients in Federated Learning
Xidong Wu
Wan-Yi Lin
Devin Willmott
Filipe Condessa
Yufei Huang
Zhenzhen Li
Madan Ravi Ganesh
FedML
21
4
0
14 Nov 2023
Mirasol3B: A Multimodal Autoregressive model for time-aligned and contextual modalities
A. Piergiovanni
Isaac Noble
Dahun Kim
Michael S. Ryoo
Victor Gomes
A. Angelova
33
19
0
09 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
35
36
0
01 Nov 2023
Language Guided Visual Question Answering: Elevate Your Multimodal Language Model Using Knowledge-Enriched Prompts
Deepanway Ghosal
Navonil Majumder
Roy Ka-Wei Lee
Rada Mihalcea
Soujanya Poria
30
7
0
31 Oct 2023
Res-Tuning: A Flexible and Efficient Tuning Paradigm via Unbinding Tuner from Backbone
Zeyinzi Jiang
Chaojie Mao
Ziyuan Huang
Ao Ma
Yiliang Lv
Yujun Shen
Deli Zhao
Jingren Zhou
22
15
0
30 Oct 2023
Intra-Modal Proxy Learning for Zero-Shot Visual Categorization with CLIP
Qi Qian
Yuanhong Xu
Juhua Hu
VLM
CLIP
21
16
0
30 Oct 2023
Generating Context-Aware Natural Answers for Questions in 3D Scenes
Mohammed Munzer Dwedari
Matthias Niessner
Dave Zhenyu Chen
22
1
0
30 Oct 2023
Entity Embeddings : Perspectives Towards an Omni-Modality Era for Large Language Models
Eren Unlu
Unver Ciftci
28
0
0
27 Oct 2023
Text Augmented Spatial-aware Zero-shot Referring Image Segmentation
Yuchen Suo
Linchao Zhu
Yi Yang
21
12
0
27 Oct 2023
Modality-Agnostic Self-Supervised Learning with Meta-Learned Masked Auto-Encoder
Huiwon Jang
Jihoon Tack
Daewon Choi
Jongheon Jeong
Jinwoo Shin
11
2
0
25 Oct 2023
Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs
Jiarui Zhang
Mahyar Khayatkhoei
P. Chhikara
Filip Ilievski
24
2
0
24 Oct 2023
Non-Intrusive Adaptation: Input-Centric Parameter-efficient Fine-Tuning for Versatile Multimodal Modeling
Yaqing Wang
Jialin Wu
T. Dabral
Jiageng Zhang
Geoff Brown
...
Frederick Liu
Yi Liang
Bo Pang
Michael Bendersky
Radu Soricut
VLM
15
14
0
18 Oct 2023
Image Clustering with External Guidance
Yunfan Li
Peng Hu
Dezhong Peng
Jiancheng Lv
Jianping Fan
Xi Peng
15
10
0
18 Oct 2023
RoboLLM: Robotic Vision Tasks Grounded on Multimodal Large Language Models
Zijun Long
George Killick
R. McCreadie
Gerardo Aragon Camarasa
VLM
22
11
0
16 Oct 2023
Mastering Robot Manipulation with Multimodal Prompts through Pretraining and Multi-task Fine-tuning
Jiachen Li
Qiaozi Gao
Michael Johnston
Xiaofeng Gao
Xuehai He
Suhaila Shakiah
Hangjie Shi
R. Ghanadan
William Yang Wang
LM&Ro
19
12
0
14 Oct 2023
PaLI-3 Vision Language Models: Smaller, Faster, Stronger
Xi Chen
Xiao Wang
Lucas Beyer
Alexander Kolesnikov
Jialin Wu
...
Keran Rong
Tianli Yu
Daniel Keysers
Xiao-Qi Zhai
Radu Soricut
MLLM
VLM
30
92
0
13 Oct 2023
EasyGen: Easing Multimodal Generation with BiDiffuser and LLMs
Xiangyu Zhao
Bo Liu
Qijiong Liu
Guangyuan Shi
Xiao-Ming Wu
VLM
DiffM
21
7
0
13 Oct 2023
IMITATE: Clinical Prior Guided Hierarchical Vision-Language Pre-training
Che Liu
Sibo Cheng
Miaojing Shi
Anand Shah
Wenjia Bai
Rossella Arcucci
22
26
0
11 Oct 2023
Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen
Shuai Zhang
Boran Han
Jiaya Jia
13
2
0
08 Oct 2023
Assessing Large Language Models on Climate Information
Jannis Bulian
Mike S. Schäfer
Afra Amini
Heidi Lam
Massimiliano Ciaramita
...
Michelle Chen Huebscher
Christian Buck
Niels G. Mede
Markus Leippold
Nadine Strauss
ELM
12
19
0
04 Oct 2023
Large Scale Masked Autoencoding for Reducing Label Requirements on SAR Data
Matt Allen
Francisco Dorr
Joseph A. Gallego-Mejia
Laura Martínez-Ferrer
Anna Jungbluth
F. Kalaitzis
Raúl Ramos-Pollán
17
9
0
02 Oct 2023
Beyond Task Performance: Evaluating and Reducing the Flaws of Large Multimodal Models with In-Context Learning
Mustafa Shukor
Alexandre Ramé
Corentin Dancette
Matthieu Cord
LRM
MLLM
33
20
0
01 Oct 2023
Self-Supervised Open-Ended Classification with Small Visual Language Models
Mohammad Mahdi Derakhshani
Ivona Najdenkoska
Cees G. M. Snoek
M. Worring
Yuki M. Asano
VLM
14
0
0
30 Sep 2023
InstructCV: Instruction-Tuned Text-to-Image Diffusion Models as Vision Generalists
Yulu Gan
Sungwoo Park
Alexander Schubert
Anthony Philippakis
Ahmed Alaa
VLM
17
22
0
30 Sep 2023
DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention
Z. Yao
Xiaoxia Wu
Conglong Li
Minjia Zhang
Heyang Qi
Olatunji Ruwase
A. A. Awan
Samyam Rajbhandari
Yuxiong He
26
11
0
25 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai Le-Duc
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
21
5
0
23 Sep 2023
TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance
Kan Wu
Houwen Peng
Zhenghong Zhou
Bin Xiao
Mengchen Liu
...
Xi
Xi Chen
Xinggang Wang
Hongyang Chao
Han Hu
VLM
OODD
26
53
0
21 Sep 2023
Sentence Attention Blocks for Answer Grounding
Seyedalireza Khoshsirat
Chandra Kambhamettu
31
7
0
20 Sep 2023
Empowering Visually Impaired Individuals: A Novel Use of Apple Live Photos and Android Motion Photos
Seyedalireza Khoshsirat
Chandra Kambhamettu
23
9
0
14 Sep 2023
InstructDiffusion: A Generalist Modeling Interface for Vision Tasks
Zigang Geng
Binxin Yang
Tiankai Hang
Chen Li
Shuyang Gu
...
Jianmin Bao
Zheng-Wei Zhang
Han Hu
Dongdong Chen
Baining Guo
DiffM
VLM
38
92
0
07 Sep 2023
NICE: CVPR 2023 Challenge on Zero-shot Image Captioning
Taehoon Kim
Pyunghwan Ahn
Sangyun Kim
Sihaeng Lee
Mark A Marsden
...
Yujin Wang
Yimu Wang
Tiancheng Gu
Xingchang Lv
Mingmao Sun
VLM
12
4
0
05 Sep 2023
MultiWay-Adapater: Adapting large-scale multi-modal models for scalable image-text retrieval
Zijun Long
George Killick
R. McCreadie
Gerardo Aragon Camarasa
14
2
0
04 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
22
17
0
04 Sep 2023
BDC-Adapter: Brownian Distance Covariance for Better Vision-Language Reasoning
Yi Zhang
Ce Zhang
Zihan Liao
Yushun Tang
Zhihai He
BDL
VLM
16
10
0
03 Sep 2023
MAGMA: Music Aligned Generative Motion Autodecoder
Sohan Anisetty
Amit Raj
James Hays
24
0
0
03 Sep 2023
RevColV2: Exploring Disentangled Representations in Masked Image Modeling
Qi Han
Yuxuan Cai
Xiangyu Zhang
25
7
0
02 Sep 2023
ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Weihan Wang
Z. Yang
Bin Xu
Juanzi Li
Yankui Sun
VLM
20
8
0
31 Aug 2023
InstaTune: Instantaneous Neural Architecture Search During Fine-Tuning
S. N. Sridhar
Souvik Kundu
Sairam Sundaresan
Maciej Szankin
Anthony Sarah
14
3
0
29 Aug 2023
When hard negative sampling meets supervised contrastive learning
Zijun Long
George Killick
R. McCreadie
Gerardo Aragon Camarasa
Zaiqiao Meng
SSL
18
3
0
28 Aug 2023
Computation-efficient Deep Learning for Computer Vision: A Survey
Yulin Wang
Yizeng Han
Chaofei Wang
Shiji Song
Qi Tian
Gao Huang
VLM
26
20
0
27 Aug 2023
MLLM-DataEngine: An Iterative Refinement Approach for MLLM
Zhiyuan Zhao
Linke Ouyang
Bin Wang
Siyuan Huang
Pan Zhang
Xiao-wen Dong
Jiaqi Wang
Conghui He
MLLM
21
5
0
25 Aug 2023
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai
Shuai Bai
Shusheng Yang
Shijie Wang
Sinan Tan
Peng Wang
Junyang Lin
Chang Zhou
Jingren Zhou
MLLM
VLM
ObjD
43
790
0
24 Aug 2023
DLIP: Distilling Language-Image Pre-training
Huafeng Kuang
Jie Wu
Xiawu Zheng
Ming Li
Xuefeng Xiao
Rui Wang
Min Zheng
Rongrong Ji
VLM
36
4
0
24 Aug 2023
Previous
1
2
3
4
5
...
8
9
10
Next