ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2212.08045
  4. Cited By
CLIPPO: Image-and-Language Understanding from Pixels Only
v1v2 (latest)

CLIPPO: Image-and-Language Understanding from Pixels Only

Computer Vision and Pattern Recognition (CVPR), 2022
15 December 2022
Michael Tschannen
Basil Mustafa
N. Houlsby
    CLIPVLM
ArXiv (abs)PDFHTMLGithub (3420★)

Papers citing "CLIPPO: Image-and-Language Understanding from Pixels Only"

44 / 44 papers shown
See the Text: From Tokenization to Visual Reading
See the Text: From Tokenization to Visual Reading
Ling Xing
Alex Jinpeng Wang
Rui Yan
Hongyu Qu
Zechao Li
VLM
191
6
0
27 Mar 2026
Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
Turning Adversaries into Allies: Reversing Typographic Attacks for Multimodal E-Commerce Product Retrieval
Janet Jenq
Hongda Shen
151
0
0
07 Nov 2025
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Exploring a Unified Vision-Centric Contrastive Alternatives on Multi-Modal Web Documents
Yiqi Lin
Alex Jinpeng Wang
Linjie Li
Zhengyuan Yang
Mike Zheng Shou
175
1
0
21 Oct 2025
Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration
Self-Supervised Cross-Modal Learning for Image-to-Point Cloud Registration
Xingmei Wang
Xiaoyu Hu
Chengkai Huang
Ziyan Zeng
Guohao Nie
Quan Z. Sheng
L. Yao
3DPC
154
1
0
19 Sep 2025
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
Dual Prompt Learning for Adapting Vision-Language Models to Downstream Image-Text Retrieval
Y. Wang
Tao Wang
Chenwei Tang
Caiyang Yu
Zhengqing Zang
Mengmi Zhang
Shudong Huang
Jiancheng Lv
VLM
140
0
0
06 Aug 2025
Parameter-Efficient Single Collaborative Branch for Recommendation
Parameter-Efficient Single Collaborative Branch for RecommendationACM Conference on Recommender Systems (RecSys), 2025
Marta Moscati
Shah Nawaz
Markus Schedl
BDL
229
1
0
05 Aug 2025
EvoVLMA: Evolutionary Vision-Language Model Adaptation
EvoVLMA: Evolutionary Vision-Language Model Adaptation
Kun Ding
Ying Wang
Shiming Xiang
VLM
196
1
0
03 Aug 2025
MLLMs are Deeply Affected by Modality Bias
MLLMs are Deeply Affected by Modality Bias
Xu Zheng
Chenfei Liao
Yuqian Fu
Kaiyu Lei
Yuanhuiyi Lyu
...
Yu Jiang
Andrii Zadaianchuk
Dacheng Tao
Luc Van Gool
Xuming Hu
386
23
0
24 May 2025
VoQA: Visual-only Question Answering
VoQA: Visual-only Question Answering
Jianing An
Luyang Jiang
Jie Luo
Wenjun Wu
Lei Huang
LRM
427
1
0
20 May 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation LearningIEEE Access (IEEE Access), 2025
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIPVLM
619
2
0
30 Apr 2025
Overcoming Vocabulary Constraints with Pixel-level Fallback
Overcoming Vocabulary Constraints with Pixel-level Fallback
Jonas F. Lotz
Hendra Setiawan
Stephan Peitz
Yova Kementchedjhieva
373
4
0
02 Apr 2025
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Beyond Words: Advancing Long-Text Image Generation via Multimodal Autoregressive Models
Alex Jinpeng Wang
Linjie Li
Zhiyong Yang
Lijuan Wang
Min Li
DiffM
322
2
0
26 Mar 2025
DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for Alignment
DGTRSD & DGTRS-CLIP: A Dual-Granularity Remote Sensing Image-Text Dataset and Vision Language Foundation Model for AlignmentIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE J-STARS), 2025
Weizhi Chen
Yupeng Deng
Jin Wei
Jingbo Chen
Jiansheng Chen
Yuman Feng
Zhihao Xi
Diyou Liu
Kai Li
Yu Meng
VLM
397
2
0
25 Mar 2025
DiffCLIP: Differential Attention Meets CLIP
DiffCLIP: Differential Attention Meets CLIP
Hasan Hammoud
Guohao Li
VLM
259
2
0
09 Mar 2025
Vision-centric Token Compression in Large Language Model
Vision-centric Token Compression in Large Language Model
Ling Xing
Alex Jinpeng Wang
Rui Yan
Xiangbo Shu
Jinhui Tang
VLM
783
12
0
02 Feb 2025
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
Audio-Language Models for Audio-Centric Tasks: A Systematic Survey
Yi Su
Jisheng Bai
Qisheng Xu
Kele Xu
Yong Dou
LM&MAAuLLM
444
15
0
25 Jan 2025
Uni-Mlip: Unified Self-supervision for Medical Vision Language
  Pre-training
Uni-Mlip: Unified Self-supervision for Medical Vision Language Pre-trainingBritish Machine Vision Conference (BMVC), 2024
Ameera Bawazir
Kebin Wu
Wenbin Li
CLIP
340
1
0
20 Nov 2024
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot
  Anomaly Detection
GlocalCLIP: Object-agnostic Global-Local Prompt Learning for Zero-shot Anomaly Detection
Jiyul Ham
Yonggon Jung
Jun-Geol Baek
VLM
468
5
0
09 Nov 2024
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLP
LogogramNLP: Comparing Visual and Textual Representations of Ancient Logographic Writing Systems for NLPAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Danlu Chen
Freda Shi
Aditi Agarwal
Jacobo Myerston
Taylor Berg-Kirkpatrick
230
3
0
08 Aug 2024
Chameleon: Images Are What You Need For Multimodal Learning Robust To
  Missing Modalities
Chameleon: Images Are What You Need For Multimodal Learning Robust To Missing Modalities
Muhammad Irzam Liaqat
Shah Nawaz
Muhammad Zaigham Zaheer
M. S. Saeed
Hassan Sajjad
Tom De Schepper
Karthik Nandakumar
Muhammad Haris Khan
362
1
0
23 Jul 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal
  Learning
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
328
13
0
04 Jun 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
360
12
0
28 Mar 2024
Improving Medical Multi-modal Contrastive Learning with Expert
  Annotations
Improving Medical Multi-modal Contrastive Learning with Expert AnnotationsEuropean Conference on Computer Vision (ECCV), 2024
Yogesh Kumar
Pekka Marttinen
MedImVLM
483
29
0
15 Mar 2024
A$^{3}$lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment
  for Dynamic Facial Expression Recognition with CLIP
A3^{3}3lign-DFER: Pioneering Comprehensive Dynamic Affective Alignment for Dynamic Facial Expression Recognition with CLIP
Zeng Tao
Yan Wang
Junxiong Lin
Haoran Wang
Xinji Mai
...
Ziheng Zhou
Shaoqi Yan
Qing Zhao
Liyuan Han
Wenqiang Zhang
292
18
0
07 Mar 2024
Improving Language Understanding from Screenshots
Improving Language Understanding from Screenshots
Tianyu Gao
Zirui Wang
Adithya Bhaskar
Danqi Chen
VLM
245
14
0
21 Feb 2024
Pixel Sentence Representation Learning
Pixel Sentence Representation Learning
Chenghao Xiao
Zhuoxu Huang
Danlu Chen
G. Hudson
Yi Zhou
Haoran Duan
Chenghua Lin
Jie Fu
Jungong Han
Noura Al Moubayed
SSL
286
7
0
13 Feb 2024
Text-Driven Traffic Anomaly Detection with Temporal High-Frequency
  Modeling in Driving Videos
Text-Driven Traffic Anomaly Detection with Temporal High-Frequency Modeling in Driving Videos
Rongqin Liang
Yuanman Li
Jiantao Zhou
Xia Li
418
19
0
07 Jan 2024
Parrot Captions Teach CLIP to Spot Text
Parrot Captions Teach CLIP to Spot Text
Yiqi Lin
Conghui He
Alex Jinpeng Wang
Sijin Yu
Weijia Li
Mike Zheng Shou
409
14
0
21 Dec 2023
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to
  Video
ZeroI2V: Zero-Cost Adaptation of Pre-trained Transformers from Image to VideoEuropean Conference on Computer Vision (ECCV), 2023
Xinhao Li
Yuhan Zhu
Limin Wang
VLM
356
20
0
02 Oct 2023
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for
  Vision-Language Models
LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language ModelsIEEE International Conference on Computer Vision (ICCV), 2023
Cheng Shi
Sibei Yang
VLM
372
35
0
03 Sep 2023
Unsupervised Camouflaged Object Segmentation as Domain Adaptation
Unsupervised Camouflaged Object Segmentation as Domain Adaptation
Yi Zhang
Chengyi Wu
256
7
0
08 Aug 2023
Exploring Multimodal Approaches for Alzheimer's Disease Detection Using
  Patient Speech Transcript and Audio Data
Exploring Multimodal Approaches for Alzheimer's Disease Detection Using Patient Speech Transcript and Audio Data
Hongmin Cai
Xiaoke Huang
Zheng Liu
Wenxiong Liao
Haixing Dai
...
Dajiang Zhu
Hui Ren
Shijie Zhao
Tianming Liu
Xiang Li
218
24
0
05 Jul 2023
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image
  Understanding
LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
Yanzhe Zhang
Ruiyi Zhang
Jiuxiang Gu
Jiuxiang Gu
Nedim Lipka
Diyi Yang
Tongfei Sun
VLMMLLM
386
303
0
29 Jun 2023
Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation
Prompt Ensemble Self-training for Open-Vocabulary Domain Adaptation
Jiaxing Huang
Jingyi Zhang
Han Qiu
Sheng Jin
Shijian Lu
VPVLMVLM
406
3
0
29 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Image Captioners Are Scalable Vision Learners TooNeural Information Processing Systems (NeurIPS), 2023
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLMCLIP
971
93
0
13 Jun 2023
On the Generalization of Multi-modal Contrastive Learning
On the Generalization of Multi-modal Contrastive LearningInternational Conference on Machine Learning (ICML), 2023
Tao Gui
Yifei Wang
Yisen Wang
227
32
0
07 Jun 2023
Learning without Forgetting for Vision-Language Models
Learning without Forgetting for Vision-Language ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Da-Wei Zhou
Yuanhan Zhang
Jingyi Ning
Jingyi Ning
De-Chuan Zhan
De-Chuan Zhan
Ziwei Liu
VLMCLL
505
90
0
30 May 2023
OneCAD: One Classifier for All image Datasets using multimodal learning
OneCAD: One Classifier for All image Datasets using multimodal learning
S. Wadekar
Eugenio Culurciello
395
0
0
11 May 2023
VicTR: Video-conditioned Text Representations for Activity Recognition
VicTR: Video-conditioned Text Representations for Activity RecognitionComputer Vision and Pattern Recognition (CVPR), 2023
Kumara Kahatapitiya
Anurag Arnab
Arsha Nagrani
Michael S. Ryoo
383
39
0
05 Apr 2023
Vision-Language Models for Vision Tasks: A Survey
Vision-Language Models for Vision Tasks: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Jingyi Zhang
Jiaxing Huang
Sheng Jin
Shijian Lu
VLM
782
1,229
0
03 Apr 2023
Self-Supervised Multimodal Learning: A Survey
Self-Supervised Multimodal Learning: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023
Yongshuo Zong
Oisin Mac Aodha
Timothy M. Hospedales
SSL
445
107
0
31 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
CoBIT: A Contrastive Bi-directional Image-Text Generation ModelInternational Conference on Learning Representations (ICLR), 2023
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
275
15
0
23 Mar 2023
Revisiting Class-Incremental Learning with Pre-Trained Models:
  Generalizability and Adaptivity are All You Need
Revisiting Class-Incremental Learning with Pre-Trained Models: Generalizability and Adaptivity are All You NeedInternational Journal of Computer Vision (IJCV), 2023
Da-Wei Zhou
Han-Jia Ye
De-Chuan Zhan
Ziwei Liu
CLL
291
211
0
13 Mar 2023
Training Vision-Language Transformers from Captions
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLMViT
497
11
0
19 May 2022
1
Page 1 of 1