Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.15389
Cited By
EVA-CLIP: Improved Training Techniques for CLIP at Scale
27 March 2023
Quan-Sen Sun
Yuxin Fang
Ledell Yu Wu
Xinlong Wang
Yue Cao
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"EVA-CLIP: Improved Training Techniques for CLIP at Scale"
50 / 357 papers shown
Title
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via
D
\mathbf{\texttt{D}}
D
ual-
H
\mathbf{\texttt{H}}
H
ead
O
\mathbf{\texttt{O}}
O
ptimization
Seongjae Kang
Dong Bok Lee
Hyungjoon Jang
Sung Ju Hwang
VLM
35
0
0
12 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
34
0
0
08 May 2025
FG-CLIP: Fine-Grained Visual and Textual Alignment
Chunyu Xie
Bin Wang
Fanjing Kong
Jincheng Li
Dawei Liang
Gengshen Zhang
Dawei Leng
Yuhui Yin
CLIP
VLM
42
0
0
08 May 2025
DeCLIP: Decoupled Learning for Open-Vocabulary Dense Perception
Junjie Wang
Bin Chen
Yulin Li
Bin Kang
Y. Chen
Zhuotao Tian
VLM
38
0
0
07 May 2025
Seeing the Abstract: Translating the Abstract Language for Vision Language Models
Davide Talon
Federico Girella
Ziyue Liu
Marco Cristani
Yiming Wang
VLM
44
0
0
06 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
X. Zhang
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
57
0
0
05 May 2025
Rethinking Visual Layer Selection in Multimodal LLMs
H. Chen
Junyan Lin
Xinhao Chen
Yue Fan
Xin Jin
Hui Su
Jianfeng Dong
Jinlan Fu
Xiaoyu Shen
VLM
93
0
0
30 Apr 2025
Platonic Grounding for Efficient Multimodal Language Models
Moulik Choraria
Xinbo Wu
Akhil Bhimaraju
Nitesh Sekhar
Yue Wu
Xu Zhang
Prateek Singhal
L. Varshney
54
0
0
27 Apr 2025
TimeSoccer: An End-to-End Multimodal Large Language Model for Soccer Commentary Generation
Ling You
Wenxuan Huang
Xinni Xie
Xiangyi Wei
Bangyan Li
Shaohui Lin
Yang Li
Changbo Wang
VGen
54
0
0
24 Apr 2025
Generative Multimodal Pretraining with Discrete Diffusion Timestep Tokens
Kaihang Pan
Wang Lin
Zhongqi Yue
Tenglong Ao
Liyu Jia
Wei Zhao
Juncheng Billy Li
Siliang Tang
Hanwang Zhang
39
1
0
20 Apr 2025
Bayesian Principles Improve Prompt Learning In Vision-Language Models
Mingyu Kim
Jongwoo Ko
Mijung Park
VLM
38
0
0
19 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis
Shravan Chaudhari
Trilokya Akula
Yoon Kim
Tom Blake
LRM
40
0
0
16 Apr 2025
QAVA: Query-Agnostic Visual Attack to Large Vision-Language Models
Yudong Zhang
Ruobing Xie
Jiansheng Chen
X. Sun
Zhanhui Kang
Yu Wang
AAML
29
0
0
15 Apr 2025
MMKB-RAG: A Multi-Modal Knowledge-Based Retrieval-Augmented Generation Framework
Zihan Ling
Zhiyao Guo
Yixuan Huang
Yi An
Shuai Xiao
Jinsong Lan
Xiaoyong Zhu
Bo Zheng
RALM
VLM
53
0
0
14 Apr 2025
Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models?
Yanbo Wang
Jiyang Guan
Jian Liang
Ran He
41
0
0
14 Apr 2025
MIEB: Massive Image Embedding Benchmark
Chenghao Xiao
Isaac Chung
Imene Kerboua
Jamie Stirling
Xin Zhang
Márton Kardos
Roman Solomatin
Noura Al Moubayed
K. Enevoldsen
Niklas Muennighoff
VLM
35
0
0
14 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
66
0
0
11 Apr 2025
Benchmarking Image Embeddings for E-Commerce: Evaluating Off-the Shelf Foundation Models, Fine-Tuning Strategies and Practical Trade-offs
Urszula Czerwinska
Cenk Bircanoglu
Jeremy Chamoux
33
0
0
10 Apr 2025
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen
Peng Liu
J. Li
Chunxin Fang
Yibo Ma
...
Zilun Zhang
Kangjia Zhao
Qianqian Zhang
Ruochen Xu
Tiancheng Zhao
VLM
LRM
71
0
0
10 Apr 2025
SARLANG-1M: A Benchmark for Vision-Language Modeling in SAR Image Understanding
Yimin Wei
Aoran Xiao
Yexian Ren
Yuting Zhu
Hongruixuan Chen
J. Xia
Naoto Yokoya
VLM
66
0
0
04 Apr 2025
Simultaneous Learning of Optimal Transports for Training All-to-All Flow-Based Condition Transfer Model
Kotaro Ikeda
Masanori Koyama
Jinzhe Zhang
Kohei Hayashi
Kenji Fukumizu
OT
63
0
0
04 Apr 2025
Refining CLIP's Spatial Awareness: A Visual-Centric Perspective
Congpei Qiu
Yanhao Wu
Wei Ke
Xiuxiu Bai
Tong Zhang
VLM
44
0
0
03 Apr 2025
UniViTAR: Unified Vision Transformer with Native Resolution
Limeng Qiao
Yiyang Gan
Bairui Wang
Jie Qin
Shuang Xu
Siqi Yang
Lin Ma
50
0
0
02 Apr 2025
Scaling Language-Free Visual Representation Learning
David Fan
Shengbang Tong
Jiachen Zhu
Koustuv Sinha
Zhuang Liu
...
Michael G. Rabbat
Nicolas Ballas
Yann LeCun
Amir Bar
Saining Xie
CLIP
VLM
56
2
0
01 Apr 2025
CIBR: Cross-modal Information Bottleneck Regularization for Robust CLIP Generalization
Yingrui Ji
Xi Xiao
Gaofei Chen
Hao Xu
Chenrui Ma
Lijing Zhu
Aokun Liang
Jiansheng Chen
VLM
48
0
0
31 Mar 2025
CA^2ST: Cross-Attention in Audio, Space, and Time for Holistic Video Recognition
Jongseo Lee
Joohyun Chang
Dongho Lee
Jinwoo Choi
51
0
0
30 Mar 2025
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Ziyue Huang
Hongxi Yan
Qiqi Zhan
Shuai Yang
Mingming Zhang
Chenkai Zhang
Yiming Lei
Zeming Liu
Qingjie Liu
Y. Wang
42
0
0
28 Mar 2025
An Empirical Study of Validating Synthetic Data for Text-Based Person Retrieval
Min Cao
Ziyin Zeng
YuXin Lu
Mang Ye
Dong Yi
Jinqiao Wang
SyDa
52
0
0
28 Mar 2025
DIFFER: Disentangling Identity Features via Semantic Cues for Clothes-Changing Person Re-ID
Xin Liang
Yogesh S Rawat
83
0
0
28 Mar 2025
Fwd2Bot: LVLM Visual Token Compression with Double Forward Bottleneck
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
62
0
0
27 Mar 2025
LangBridge: Interpreting Image as a Combination of Language Embeddings
Jiaqi Liao
Yuwei Niu
Fanqing Meng
Hao Li
Changyao Tian
...
Dianqi Li
X. Zhu
Li Yuan
Jifeng Dai
Yu Cheng
MLLM
72
0
0
25 Mar 2025
Breaking the Encoder Barrier for Seamless Video-Language Understanding
Handong Li
Yiyuan Zhang
Longteng Guo
Xiangyu Yue
Jing Liu
VLM
72
0
0
24 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
48
1
0
21 Mar 2025
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
H. Wang
Kai Hu
Liangcai Gao
82
0
0
20 Mar 2025
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
Jie M. Zhang
Zheng Yuan
Z. Wang
Bei Yan
Sibo Wang
Xiangkui Cao
Zonghui Guo
Shiguang Shan
Xilin Chen
ELM
36
0
0
20 Mar 2025
EditID: Training-Free Editable ID Customization for Text-to-Image Generation
Guandong Li
Zhaobin Chu
DiffM
55
0
0
16 Mar 2025
ProAPO: Progressively Automatic Prompt Optimization for Visual Classification
Xiangyan Qu
Gaopeng Gou
Jiamin Zhuang
Jing Yu
Kun Song
Qihao Wang
Yili Li
Gang Xiong
VLM
75
0
0
13 Mar 2025
HierarQ: Task-Aware Hierarchical Q-Former for Enhanced Video Understanding
Shehreen Azad
Vibhav Vineet
Y. S. Rawat
VLM
63
1
0
11 Mar 2025
SemHiTok: A Unified Image Tokenizer via Semantic-Guided Hierarchical Codebook for Multimodal Understanding and Generation
Z. Chen
Chunwei Wang
Xiuwei Chen
Hang Xu
J. Han
Xiandan Liang
VLM
69
1
0
09 Mar 2025
Language-Assisted Feature Transformation for Anomaly Detection
EungGu Yun
Heonjin Ha
Yeongwoo Nam
Bryan Dongik Lee
61
0
0
03 Mar 2025
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat
Zhipeng Huang
Shaobin Zhuang
Canmiao Fu
Binxin Yang
Ying Zhang
Chong Sun
Zhizheng Zhang
Yali Wang
Chen Li
Zheng-Jun Zha
DiffM
69
1
0
03 Mar 2025
ABC: Achieving Better Control of Multimodal Embeddings using VLMs
Benjamin Schneider
Florian Kerschbaum
Wenhu Chen
55
0
0
01 Mar 2025
Towards High-performance Spiking Transformers from ANN to SNN Conversion
Zihan Huang
Xinyu Shi
Zecheng Hao
Tong Bu
Jianhao Ding
Zhaofei Yu
Tiejun Huang
28
7
0
28 Feb 2025
Picking the Cream of the Crop: Visual-Centric Data Selection with Collaborative Agents
Zhenyu Liu
Yunxin Li
Baotian Hu
Wenhan Luo
Yaowei Wang
Min-Ling Zhang
60
0
0
27 Feb 2025
Improving Adversarial Transferability in MLLMs via Dynamic Vision-Language Alignment Attack
Chenhe Gu
Jindong Gu
Andong Hua
Yao Qin
AAML
42
0
0
27 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Chenyang Zhao
Kun Wang
J. H. Hsiao
Antoni B. Chan
CLIP
66
0
0
26 Feb 2025
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting
Haoyuan Li
Yanpeng Zhou
Tao Tang
Jifei Song
Yihan Zeng
Michael C. Kampffmeyer
Hang Xu
Xiaodan Liang
3DGS
57
1
0
25 Feb 2025
CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification
Mingkun Zhang
Keping Bi
Wei Chen
J. Guo
Xueqi Cheng
BDL
VLM
47
1
0
25 Feb 2025
Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review
Ufaq Khan
Umair Nawaz
A. Qayyum
Shazad Ashraf
Muhammad Bilal
Junaid Qadir
76
0
0
24 Feb 2025
1
2
3
4
5
6
7
8
Next