ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2208.10442
  4. Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
    MLLM
    VLM
    ViT
ArXivPDFHTML

Papers citing "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"

50 / 458 papers shown
Title
Dataset Quantization
Dataset Quantization
Daquan Zhou
Kaixin Wang
Jianyang Gu
Xiang Peng
Dongze Lian
Yifan Zhang
Yang You
Jiashi Feng
DD
21
37
0
21 Aug 2023
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity
  Control
VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control
Zi-Yuan Hu
Yanyang Li
M. Lyu
Liwei Wang
VLM
19
15
0
18 Aug 2023
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
RLIPv2: Fast Scaling of Relational Language-Image Pre-training
Hangjie Yuan
Shiwei Zhang
Xiang Wang
Samuel Albanie
Yining Pan
Tao Feng
Jianwen Jiang
Dong Ni
Yingya Zhang
Deli Zhao
VLM
14
37
0
18 Aug 2023
GIT-Mol: A Multi-modal Large Language Model for Molecular Science with
  Graph, Image, and Text
GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text
Peng Liu
Yiming Ren
Jun Tao
Zhixiang Ren
AI4CE
25
78
0
14 Aug 2023
Temporally-Adaptive Models for Efficient Video Understanding
Temporally-Adaptive Models for Efficient Video Understanding
Ziyuan Huang
Shiwei Zhang
Liang Pan
Zhiwu Qing
Yingya Zhang
Ziwei Liu
Marcelo H. Ang
25
9
0
10 Aug 2023
Beyond Semantics: Learning a Behavior Augmented Relevance Model with
  Self-supervised Learning
Beyond Semantics: Learning a Behavior Augmented Relevance Model with Self-supervised Learning
Ze-jie Chen
Wei-Neng Chen
Jia Xu
Zhongyi Liu
Wei Zhang
RALM
21
4
0
10 Aug 2023
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner
  for Open-World Semantic Segmentation
MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation
Kaixin Cai
Pengzhen Ren
Yi Zhu
Hang Xu
Jian-zhuo Liu
Changlin Li
Guangrun Wang
Xiaodan Liang
VLM
22
14
0
09 Aug 2023
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment
Ziyu Zhu
Xiaojian Ma
Yixin Chen
Zhidong Deng
Siyuan Huang
Qing Li
LM&Ro
26
100
0
08 Aug 2023
Distributionally Robust Classification on a Data Budget
Distributionally Robust Classification on a Data Budget
Ben Feuer
Ameya Joshi
Minh Pham
C. Hegde
OOD
22
2
0
07 Aug 2023
ConvFormer: Revisiting Transformer for Sequential User Modeling
ConvFormer: Revisiting Transformer for Sequential User Modeling
Hao Wang
Jianxun Lian
M. Wu
Haoxuan Li
Jiajun Fan
Wanyue Xu
Chaozhuo Li
Xing Xie
17
3
0
05 Aug 2023
The All-Seeing Project: Towards Panoptic Visual Recognition and
  Understanding of the Open World
The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
Weiyun Wang
Min Shi
Qingyun Li
Wen Wang
Zhenhang Huang
...
Zhiguo Cao
Yushi Chen
Tong Lu
Jifeng Dai
Yu Qiao
LRM
MLLM
33
83
0
03 Aug 2023
Multimodal Adaptation of CLIP for Few-Shot Action Recognition
Multimodal Adaptation of CLIP for Few-Shot Action Recognition
Jiazheng Xing
Mengmeng Wang
Xiaojun Hou
Guangwen Dai
Jingdong Wang
Yong-Jin Liu
VLM
15
0
0
03 Aug 2023
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
UnIVAL: Unified Model for Image, Video, Audio and Language Tasks
Mustafa Shukor
Corentin Dancette
Alexandre Ramé
Matthieu Cord
MoMe
MLLM
36
42
0
30 Jul 2023
UniBriVL: Robust Universal Representation and Generation of Audio Driven
  Diffusion Models
UniBriVL: Robust Universal Representation and Generation of Audio Driven Diffusion Models
Sen Fang
Bowen Gao
Yangjian Wu
T. Teoh
DiffM
18
1
0
29 Jul 2023
Cross-Modal Concept Learning and Inference for Vision-Language Models
Cross-Modal Concept Learning and Inference for Vision-Language Models
Yi Zhang
Ce Zhang
Yushun Tang
Z. He
VLM
MLLM
CLIP
23
15
0
28 Jul 2023
BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers
  Models for Vietnamese Visual Question Answering
BARTPhoBEiT: Pre-trained Sequence-to-Sequence and Image Transformers Models for Vietnamese Visual Question Answering
Khiem Vinh Tran
Kiet Van Nguyen
N. Nguyen
ViT
20
2
0
28 Jul 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
18
117
0
25 Jul 2023
MIMONet: Multi-Input Multi-Output On-Device Deep Learning
MIMONet: Multi-Input Multi-Output On-Device Deep Learning
Zexin Li
Xiaoxi He
Yufei Li
Shahab Nikkhoo
Wei Yang
Lothar Thiele
Cong Liu
33
5
0
22 Jul 2023
UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with
  Vision-Language Models
UP-DP: Unsupervised Prompt Learning for Data Pre-Selection with Vision-Language Models
Xin Li
Sima Behpour
T. Doan
Wenbin He
Liangke Gou
Liu Ren
VLM
16
3
0
20 Jul 2023
Meta-Transformer: A Unified Framework for Multimodal Learning
Meta-Transformer: A Unified Framework for Multimodal Learning
Yiyuan Zhang
Kaixiong Gong
Kaipeng Zhang
Hongsheng Li
Yu Qiao
Wanli Ouyang
Xiangyu Yue
19
136
0
20 Jul 2023
DVPT: Dynamic Visual Prompt Tuning of Large Pre-trained Models for
  Medical Image Analysis
DVPT: Dynamic Visual Prompt Tuning of Large Pre-trained Models for Medical Image Analysis
Along He
Kai Wang
Zhihong Wang
Tao Li
H. Fu
MedIm
25
2
0
19 Jul 2023
Mining of Single-Class by Active Learning for Semantic Segmentation
Mining of Single-Class by Active Learning for Semantic Segmentation
Hugues Lambert
E. Slade
CLL
VLM
11
0
0
18 Jul 2023
Deficiency-Aware Masked Transformer for Video Inpainting
Deficiency-Aware Masked Transformer for Video Inpainting
Yongsheng Yu
Hengrui Fan
Libo Zhang
VGen
19
9
0
17 Jul 2023
PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language
  Pre-training via Prompting
PiTL: Cross-modal Retrieval with Weakly-supervised Vision-language Pre-training via Prompting
Zixin Guo
T. Wang
Selen Pehlivan
Abduljalil Radman
Jorma T. Laaksonen
VLM
25
2
0
14 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
167
1
0
14 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language
  Pre-training
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
19
25
0
13 Jul 2023
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
mBLIP: Efficient Bootstrapping of Multilingual Vision-LLMs
Gregor Geigle
Abhay Jain
Radu Timofte
Goran Glavavs
VLM
MLLM
13
29
0
13 Jul 2023
Multimodal Molecular Pretraining via Modality Blending
Multimodal Molecular Pretraining via Modality Blending
Qiying Yu
Yudi Zhang
Yuyan Ni
Shi Feng
Yanyan Lan
Hao Zhou
Jingjing Liu
21
12
0
12 Jul 2023
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the
  Backbone
EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone
Shraman Pramanick
Yale Song
Sayan Nag
Kevin Qinghong Lin
Hardik Shah
Mike Zheng Shou
Ramalingam Chellappa
Pengchuan Zhang
VLM
31
86
0
11 Jul 2023
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment
Chunhui Zhang
Xin Sun
Li Liu
Yiqian Yang
Qiong Liu
Xiaoping Zhou
Yanfeng Wang
33
15
0
07 Jul 2023
VideoGLUE: Video General Understanding Evaluation of Foundation Models
VideoGLUE: Video General Understanding Evaluation of Foundation Models
Liangzhe Yuan
N. B. Gundavarapu
Long Zhao
Hao Zhou
Yin Cui
...
Florian Schroff
Hartwig Adam
Ming Yang
Ting Liu
Boqing Gong
ELM
32
9
0
06 Jul 2023
Distilling Large Vision-Language Model with Out-of-Distribution
  Generalizability
Distilling Large Vision-Language Model with Out-of-Distribution Generalizability
Xuanlin Li
Yunhao Fang
Minghua Liu
Z. Ling
Z. Tu
Haoran Su
VLM
28
23
0
06 Jul 2023
Benchmarking Zero-Shot Recognition with Vision-Language Models:
  Challenges on Granularity and Specificity
Benchmarking Zero-Shot Recognition with Vision-Language Models: Challenges on Granularity and Specificity
Zhenlin Xu
Yi Zhu
Tiffany Deng
Abhay Mittal
Yanbei Chen
Manchen Wang
Paolo Favaro
Joseph Tighe
Davide Modolo
VLM
CoGe
11
7
0
28 Jun 2023
When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions
When Foundation Model Meets Federated Learning: Motivations, Challenges, and Future Directions
Weiming Zhuang
Chen Chen
Lingjuan Lyu
C. L. P. Chen
Yaochu Jin
Lingjuan Lyu
AIFin
AI4CE
86
85
0
27 Jun 2023
MotionGPT: Human Motion as a Foreign Language
MotionGPT: Human Motion as a Foreign Language
Biao Jiang
Xin Chen
Wen Liu
Jingyi Yu
Gang Yu
Tao Chen
MLLM
9
266
0
26 Jun 2023
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language
  Compositionality
SugarCrepe: Fixing Hackable Benchmarks for Vision-Language Compositionality
Cheng-Yu Hsieh
Jieyu Zhang
Zixian Ma
Aniruddha Kembhavi
Ranjay Krishna
CoGe
35
114
0
26 Jun 2023
Large Sequence Models for Sequential Decision-Making: A Survey
Large Sequence Models for Sequential Decision-Making: A Survey
Muning Wen
Runji Lin
Hanjing Wang
Yaodong Yang
Ying Wen
Luo Mai
J. Wang
Haifeng Zhang
Weinan Zhang
LM&Ro
LRM
29
34
0
24 Jun 2023
Generative Multimodal Entity Linking
Generative Multimodal Entity Linking
Senbao Shi
Zhenran Xu
Baotian Hu
M. Zhang
MLLM
VLM
19
5
0
22 Jun 2023
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text
  Documents
OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents
Hugo Laurenccon
Lucile Saulnier
Léo Tronchon
Stas Bekman
Amanpreet Singh
...
Siddharth Karamcheti
Alexander M. Rush
Douwe Kiela
Matthieu Cord
Victor Sanh
25
227
0
21 Jun 2023
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large
  Vision-Language Model for Remote Sensing
RS5M and GeoRSCLIP: A Large Scale Vision-Language Dataset and A Large Vision-Language Model for Remote Sensing
Zilun Zhang
Tiancheng Zhao
Yulong Guo
Jianwei Yin
DiffM
VLM
27
52
0
20 Jun 2023
LabelBench: A Comprehensive Framework for Benchmarking Adaptive
  Label-Efficient Learning
LabelBench: A Comprehensive Framework for Benchmarking Adaptive Label-Efficient Learning
Jifan Zhang
Yifang Chen
Gregory H. Canal
Stephen Mussmann
Arnav M. Das
...
Yinglun Zhu
Jeffrey Bilmes
S. Du
Kevin G. Jamieson
Robert D. Nowak
VLM
33
10
0
16 Jun 2023
Parameter-efficient is not sufficient: Exploring Parameter, Memory, and
  Time Efficient Adapter Tuning for Dense Predictions
Parameter-efficient is not sufficient: Exploring Parameter, Memory, and Time Efficient Adapter Tuning for Dense Predictions
Dongshuo Yin
Xueting Han
Bin Li
Hao Feng
Jinghua Bai
VPVLM
26
16
0
16 Jun 2023
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Investigating Prompting Techniques for Zero- and Few-Shot Visual Question Answering
Rabiul Awal
Le Zhang
Aishwarya Agrawal
LRM
38
12
0
16 Jun 2023
Transferring Knowledge for Food Image Segmentation using Transformers
  and Convolutions
Transferring Knowledge for Food Image Segmentation using Transformers and Convolutions
Grant Sinha
Krishna Parmar
Hilda Azimi
Chi-en Amy Tai
Yuhao Chen
A. Wong
Pengcheng Xi
ViT
26
4
0
15 Jun 2023
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and
  Text Integration
Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
Chenyang Lyu
Minghao Wu
Longyue Wang
Xinting Huang
Bingshuai Liu
Zefeng Du
Shuming Shi
Zhaopeng Tu
MLLM
AuLLM
29
160
0
15 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
J. Liu
VLM
CLIP
22
8
0
15 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to
  Enhance Visio-Linguistic Compositional Understanding
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGe
VLM
31
9
0
15 Jun 2023
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
ZeroForge: Feedforward Text-to-Shape Without 3D Supervision
Kelly O. Marshall
Minh Pham
Ameya Joshi
Anushrut Jignasu
Aditya Balu
Adarsh Krishnamurthy
A. Hegde
CLIP
18
3
0
14 Jun 2023
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for
  Histopathology Images
Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images
Ming Y. Lu
Bowen Chen
Andrew Zhang
Drew F. K. Williamson
Richard J. Chen
Tong Ding
L. Le
Yung-Sung Chuang
Faisal Mahmood
VLM
MedIm
25
96
0
13 Jun 2023
A Survey of Vision-Language Pre-training from the Lens of Multimodal
  Machine Translation
A Survey of Vision-Language Pre-training from the Lens of Multimodal Machine Translation
Jeremy Gwinnup
Kevin Duh
VLM
12
3
0
12 Jun 2023
Previous
123456...8910
Next