ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2208.10442
  4. Cited By
Image as a Foreign Language: BEiT Pretraining for All Vision and
  Vision-Language Tasks

Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks

22 August 2022
Wenhui Wang
Hangbo Bao
Li Dong
Johan Bjorck
Zhiliang Peng
Qiang Liu
Kriti Aggarwal
O. Mohammed
Saksham Singhal
Subhojit Som
Furu Wei
    MLLM
    VLM
    ViT
ArXivPDFHTML

Papers citing "Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks"

50 / 458 papers shown
Title
On the Robustness of Language Guidance for Low-Level Vision Tasks:
  Findings from Depth Estimation
On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
Agneet Chatterjee
Tejas Gokhale
Chitta Baral
Yezhou Yang
VLM
25
2
0
12 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
BRAVE: Broadening the visual encoding of vision-language models
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
42
25
0
10 Apr 2024
Monocular 3D lane detection for Autonomous Driving: Recent Achievements,
  Challenges, and Outlooks
Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks
Fulong Ma
Weiqing Qi
Guoyang Zhao
Linwei Zheng
Sheng Wang
Yuxuan Liu
Ming-Yu Liu
68
9
0
10 Apr 2024
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept
  Matching
CoMat: Aligning Text-to-Image Diffusion Model with Image-to-Text Concept Matching
Dongzhi Jiang
Guanglu Song
Xiaoshi Wu
Renrui Zhang
Dazhong Shen
Zhuofan Zong
Yu Liu
Hongsheng Li
VLM
30
20
0
04 Apr 2024
m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt
m3P: Towards Multimodal Multilingual Translation with Multimodal Prompt
Jian Yang
Hongcheng Guo
Yuwei Yin
Jiaqi Bai
Bing Wang
Jiaheng Liu
Xinnian Liang
Linzheng Cahi
Liqun Yang
Zhoujun Li
33
9
0
26 Mar 2024
VidLA: Video-Language Alignment at Scale
VidLA: Video-Language Alignment at Scale
Mamshad Nayeem Rizve
Fan Fei
Jayakrishnan Unnikrishnan
Son Tran
Benjamin Z. Yao
Belinda Zeng
Mubarak Shah
Trishul M. Chilimbi
VLM
AI4TS
43
4
0
21 Mar 2024
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot
  Visual Question Answering
Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering
Bowen Jiang
Zhijun Zhuang
Shreyas S. Shivakumar
Dan Roth
Camillo J. Taylor
LLMAG
34
2
0
21 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
29
0
0
15 Mar 2024
OneTracker: Unifying Visual Object Tracking with Foundation Models and
  Efficient Tuning
OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning
Lingyi Hong
Shilin Yan
Renrui Zhang
Wanyun Li
Xinyu Zhou
...
Kaixun Jiang
Yiting Chen
Jinglun Li
Zhaoyu Chen
Wenqiang Zhang
VLM
32
37
0
14 Mar 2024
Borrowing Treasures from Neighbors: In-Context Learning for Multimodal
  Learning with Missing Modalities and Data Scarcity
Borrowing Treasures from Neighbors: In-Context Learning for Multimodal Learning with Missing Modalities and Data Scarcity
Zhuo Zhi
Ziquan Liu
M. Elbadawi
Adam Daneshmend
Mine Orlu
Abdul Basit
Andreas Demosthenous
Miguel R. D. Rodrigues
24
2
0
14 Mar 2024
GiT: Towards Generalist Vision Transformer through Universal Language
  Interface
GiT: Towards Generalist Vision Transformer through Universal Language Interface
Haiyang Wang
Hao Tang
Li Jiang
Shaoshuai Shi
Muhammad Ferjad Naeem
Hongsheng Li
Bernt Schiele
Liwei Wang
VLM
30
10
0
14 Mar 2024
Understanding and Mitigating Human-Labelling Errors in Supervised
  Contrastive Learning
Understanding and Mitigating Human-Labelling Errors in Supervised Contrastive Learning
Zijun Long
Lipeng Zhuang
George Killick
R. McCreadie
Gerardo Aragon Camarasa
Paul Henderson
NoLa
20
1
0
10 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
62
12
0
05 Mar 2024
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
  Pre-training
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
VLM
27
0
0
01 Mar 2024
CricaVPR: Cross-image Correlation-aware Representation Learning for
  Visual Place Recognition
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition
Feng Lu
Xiangyuan Lan
Lijun Zhang
Dongmei Jiang
Yaowei Wang
Chun Yuan
42
29
0
29 Feb 2024
Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised
  Learning
Self-Guided Masked Autoencoders for Domain-Agnostic Self-Supervised Learning
Johnathan Xie
Yoonho Lee
Annie S. Chen
Chelsea Finn
25
3
0
22 Feb 2024
CLCE: An Approach to Refining Cross-Entropy and Contrastive Learning for
  Optimized Learning Fusion
CLCE: An Approach to Refining Cross-Entropy and Contrastive Learning for Optimized Learning Fusion
Zijun Long
George Killick
Lipeng Zhuang
Gerardo Aragon Camarasa
Zaiqiao Meng
R. McCreadie
VLM
37
2
0
22 Feb 2024
Towards Seamless Adaptation of Pre-trained Models for Visual Place
  Recognition
Towards Seamless Adaptation of Pre-trained Models for Visual Place Recognition
Feng Lu
Lijun Zhang
Xiangyuan Lan
Shuting Dong
Yaowei Wang
Chun Yuan
40
28
0
22 Feb 2024
Exploring the Frontier of Vision-Language Models: A Survey of Current
  Methodologies and Future Directions
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
Akash Ghosh
Arkadeep Acharya
Sriparna Saha
Vinija Jain
Aman Chadha
VLM
41
25
0
20 Feb 2024
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
MobileVLM V2: Faster and Stronger Baseline for Vision Language Model
Xiangxiang Chu
Limeng Qiao
Xinyu Zhang
Shuang Xu
Fei Wei
...
Xiaofei Sun
Yiming Hu
Xinyang Lin
Bo-Wen Zhang
Chunhua Shen
VLM
MLLM
22
94
0
06 Feb 2024
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
MoE-LLaVA: Mixture of Experts for Large Vision-Language Models
Bin Lin
Zhenyu Tang
Yang Ye
Jiaxi Cui
Bin Zhu
...
Jinfa Huang
Junwu Zhang
Yatian Pang
Munan Ning
Li-ming Yuan
VLM
MLLM
MoE
33
152
0
29 Jan 2024
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale
  Efficient Pretraining
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo
Furong Xu
Hanxiao Zhang
Wang Ren
Ziping Ma
Lin Ju
Jian Wang
Jingdong Chen
Ming Yang
VLM
MLLM
25
2
0
29 Jan 2024
GeoDecoder: Empowering Multimodal Map Understanding
GeoDecoder: Empowering Multimodal Map Understanding
Feng Qi
Mian Dai
Zixian Zheng
Chao Wang
20
1
0
26 Jan 2024
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other
  Modalities
Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities
Yiyuan Zhang
Xiaohan Ding
Kaixiong Gong
Yixiao Ge
Ying Shan
Xiangyu Yue
ViT
16
7
0
25 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
37
175
0
24 Jan 2024
The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large
  Language Models
The Curious Case of Nonverbal Abstract Reasoning with Multi-Modal Large Language Models
Kian Ahrabian
Zhivar Sourati
Kexuan Sun
Jiarui Zhang
Yifan Jiang
Fred Morstatter
Jay Pujara
LRM
26
9
0
22 Jan 2024
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval
Xiangpeng Yang
Linchao Zhu
Xiaohan Wang
Yi Yang
VLM
21
23
0
19 Jan 2024
Seek for Incantations: Towards Accurate Text-to-Image Diffusion
  Synthesis through Prompt Engineering
Seek for Incantations: Towards Accurate Text-to-Image Diffusion Synthesis through Prompt Engineering
Chang Yu
Junran Peng
Xiangyu Zhu
Zhaoxiang Zhang
Qi Tian
Zhen Lei
DiffM
22
4
0
12 Jan 2024
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for
  Multimodal Alignment
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma
Furong Xu
Jian Liu
Ming Yang
Qingpei Guo
VLM
34
3
0
04 Jan 2024
Few-shot Adaptation of Multi-modal Foundation Models: A Survey
Few-shot Adaptation of Multi-modal Foundation Models: A Survey
Fan Liu
Tianshu Zhang
Wenwen Dai
Wenwen Cai
Wenwen Cai Xiaocong Zhou
Delong Chen
VLM
OffRL
20
22
0
03 Jan 2024
Masked Modeling for Self-supervised Representation Learning on Vision
  and Beyond
Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
Siyuan Li
Luyuan Zhang
Zedong Wang
Di Wu
Lirong Wu
...
Jun-Xiong Xia
Cheng Tan
Yang Liu
Baigui Sun
Stan Z. Li
SSL
29
13
0
31 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi-Xin Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
27
17
0
25 Dec 2023
GeomVerse: A Systematic Evaluation of Large Models for Geometric
  Reasoning
GeomVerse: A Systematic Evaluation of Large Models for Geometric Reasoning
Mehran Kazemi
Hamidreza Alvari
Ankit Anand
Jialin Wu
Xi Chen
Radu Soricut
LRM
ReLM
20
53
0
19 Dec 2023
Mask Grounding for Referring Image Segmentation
Mask Grounding for Referring Image Segmentation
Yong Xien Chng
Henry Zheng
Yizeng Han
Xuchong Qiu
Gao Huang
ISeg
ObjD
22
15
0
19 Dec 2023
HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue
HEAR: Hearing Enhanced Audio Response for Video-grounded Dialogue
Sunjae Yoon
Dahyun Kim
Eunseop Yoon
Hee Suk Yoon
Junyeong Kim
C. Yoo
37
6
0
15 Dec 2023
VL-GPT: A Generative Pre-trained Transformer for Vision and Language
  Understanding and Generation
VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation
Jinguo Zhu
Xiaohan Ding
Yixiao Ge
Yuying Ge
Sijie Zhao
Hengshuang Zhao
Xiaohua Wang
Ying Shan
ViT
VLM
11
32
0
14 Dec 2023
Pixel Aligned Language Models
Pixel Aligned Language Models
Jiarui Xu
Xingyi Zhou
Shen Yan
Xiuye Gu
Anurag Arnab
Chen Sun
Xiaolong Wang
Cordelia Schmid
MLLM
VLM
43
14
0
14 Dec 2023
General Object Foundation Model for Images and Videos at Scale
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu
Yi-Xin Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOS
VLM
25
38
0
14 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
24
13
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human
  Pathology
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
26
20
0
13 Dec 2023
ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for
  Open-Vocabulary Object Detection
ProxyDet: Synthesizing Proxy Novel Classes via Classwise Mixup for Open-Vocabulary Object Detection
Joonhyun Jeong
Geondo Park
Jayeon Yoo
Hyungsik Jung
Heesu Kim
VLM
ObjD
35
10
0
12 Dec 2023
4M: Massively Multimodal Masked Modeling
4M: Massively Multimodal Masked Modeling
David Mizrahi
Roman Bachmann
Ouguzhan Fatih Kar
Teresa Yeo
Mingfei Gao
Afshin Dehghan
Amir Zamir
MLLM
39
62
0
11 Dec 2023
Transformer as Linear Expansion of Learngene
Transformer as Linear Expansion of Learngene
Shiyu Xia
Miaosen Zhang
Xu Yang
Ruiming Chen
Haokun Chen
Xin Geng
33
6
0
09 Dec 2023
User-Aware Prefix-Tuning is a Good Learner for Personalized Image
  Captioning
User-Aware Prefix-Tuning is a Good Learner for Personalized Image Captioning
Xuan Wang
Guanhong Wang
Wenhao Chai
Jiayu Zhou
Gaoang Wang
27
4
0
08 Dec 2023
SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited
  Scenarios
SYNC-CLIP: Synthetic Data Make CLIP Generalize Better in Data-Limited Scenarios
Mushui Liu
Weijie He
Ziqian Lu
Yunlong Yu
VLM
22
1
0
06 Dec 2023
Towards More Unified In-context Visual Understanding
Towards More Unified In-context Visual Understanding
Dianmo Sheng
Dongdong Chen
Zhentao Tan
Qiankun Liu
Qi Chu
Jianmin Bao
Tao Gong
Bin Liu
Shengwei Xu
Nenghai Yu
MLLM
VLM
24
10
0
05 Dec 2023
Foundation Models for Weather and Climate Data Understanding: A
  Comprehensive Survey
Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey
Shengchao Chen
Guodong Long
Jing Jiang
Dikai Liu
Chengqi Zhang
SyDa
AI4CE
26
23
0
05 Dec 2023
Rejuvenating image-GPT as Strong Visual Representation Learners
Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren
Zeyu Wang
Hongru Zhu
Junfei Xiao
Alan L. Yuille
Cihang Xie
VLM
42
7
0
04 Dec 2023
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of
  Low-rank Experts
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
Jialin Wu
Xia Hu
Yaqing Wang
Bo Pang
Radu Soricut
MoE
9
14
0
01 Dec 2023
VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video
  Internet of Things
VIoTGPT: Learning to Schedule Vision Tools towards Intelligent Video Internet of Things
Yaoyao Zhong
Mengshi Qi
Rui Wang
Yuhan Qiu
Yang Zhang
Huadong Ma
13
2
0
01 Dec 2023
Previous
123456...8910
Next