Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2303.15389
Cited By
EVA-CLIP: Improved Training Techniques for CLIP at Scale
27 March 2023
Quan-Sen Sun
Yuxin Fang
Ledell Yu Wu
Xinlong Wang
Yue Cao
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"EVA-CLIP: Improved Training Techniques for CLIP at Scale"
50 / 357 papers shown
Title
Directional Gradient Projection for Robust Fine-Tuning of Foundation Models
Chengyue Huang
Junjiao Tian
Brisa Maneechotesuwan
Shivang Chopra
Z. Kira
49
0
0
21 Feb 2025
InsightVision: A Comprehensive, Multi-Level Chinese-based Benchmark for Evaluating Implicit Visual Semantics in Large Vision Language Models
Xiaofei Yin
Y. Hong
Ya Guo
Yi Tu
Weiqiang Wang
Gongshen Liu
Huijia Zhu
VLM
58
0
0
19 Feb 2025
Black Sheep in the Herd: Playing with Spuriously Correlated Attributes for Vision-Language Recognition
Xinyu Tian
Shu Zou
Zhaoyuan Yang
Mengqi He
Jing Zhang
VLM
43
0
0
19 Feb 2025
Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information Retrieval
Ze Liu
Zhengyang Liang
Junjie Zhou
Zheng Liu
Defu Lian
OffRL
58
0
0
17 Feb 2025
VRoPE: Rotary Position Embedding for Video Large Language Models
Zikang Liu
Longteng Guo
Yepeng Tang
Junxian Cai
Kai Ma
Xi Chen
J. Liu
49
0
0
17 Feb 2025
GAIA: A Global, Multi-modal, Multi-scale Vision-Language Dataset for Remote Sensing Image Analysis
Angelos Zavras
Dimitrios Michail
Xiao Xiang Zhu
Begum Demir
Ioannis Papoutsis
VLM
81
0
0
13 Feb 2025
Learning Human Skill Generators at Key-Step Levels
Yilu Wu
Chenhui Zhu
Shuai Wang
Hanlin Wang
Jing Wang
Zhaoxiang Zhang
Limin Wang
VGen
112
0
0
12 Feb 2025
Intrinsic Bias is Predicted by Pretraining Data and Correlates with Downstream Performance in Vision-Language Encoders
Kshitish Ghate
Isaac Slaughter
Kyra Wilson
Mona Diab
Aylin Caliskan
76
0
0
11 Feb 2025
CBVLM: Training-free Explainable Concept-based Large Vision Language Models for Medical Image Classification
Cristiano Patrício
Isabel Rio-Torto
J. S. Cardoso
Luís F. Teixeira
João C. Neves
VLM
129
0
0
21 Jan 2025
VideoAuteur: Towards Long Narrative Video Generation
Junfei Xiao
Feng Cheng
Lu Qi
Liangke Gui
Jiepeng Cen
Zhibei Ma
Alan L. Yuille
Lu Jiang
VGen
54
2
0
10 Jan 2025
Advancements in Visual Language Models for Remote Sensing: Datasets, Capabilities, and Enhancement Techniques
Lijie Tao
H. Zhang
Haizhao Jing
Yu Liu
Kelu Yao
Guoting Wei
Xizhe Xue
33
0
0
03 Jan 2025
YOLO-UniOW: Efficient Universal Open-World Object Detection
Lihao Liu
Juexiao Feng
Hui Chen
Ao Wang
Lin Song
J. Han
Guiguang Ding
ObjD
VLM
33
2
0
31 Dec 2024
Towards Visual Grounding: A Survey
Linhui Xiao
Xiaoshan Yang
X. Lan
Yaowei Wang
Changsheng Xu
ObjD
46
3
0
31 Dec 2024
How Panel Layouts Define Manga: Insights from Visual Ablation Experiments
Siyuan Feng
Teruya Yoshinaga
Katsuhiko Hayashi
Koki Washio
Hidetaka Kamigaito
28
0
0
26 Dec 2024
Retention Score: Quantifying Jailbreak Risks for Vision Language Models
Zaitang Li
Pin-Yu Chen
Tsung-Yi Ho
AAML
28
0
0
23 Dec 2024
HyperCLIP: Adapting Vision-Language models with Hypernetworks
Victor Akinwande
Mohammad Sadegh Norouzzadeh
Devin Willmott
Anna Bair
Madan Ravi Ganesh
J. Zico Kolter
CLIP
VLM
84
0
0
21 Dec 2024
DINOv2 Meets Text: A Unified Framework for Image- and Pixel-Level Vision-Language Alignment
Cijo Jose
Théo Moutakanni
Dahyun Kang
Federico Baldassarre
Timothée Darcet
...
Maxime Oquab
Oriane Siméoni
Huy V. Vo
Patrick Labatut
Piotr Bojanowski
CLIP
VLM
88
6
0
20 Dec 2024
Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic Segmentation
J. Zhang
Li Zhang
Shijian Li
VLM
71
0
0
18 Dec 2024
Unlocking the Potential of Weakly Labeled Data: A Co-Evolutionary Learning Framework for Abnormality Detection and Report Generation
Jinghan Sun
Dong-mei Wei
Zhe Xu
Donghuan Lu
Hong Liu
Hong Wang
Sotirios A. Tsaftaris
Steven G. McDonagh
Yefeng Zheng
Liansheng Wang
MedIm
92
0
0
18 Dec 2024
DINO-Foresight
\texttt{DINO-Foresight}
DINO-Foresight
: Looking into the Future with DINO
Efstathios Karypidis
Ioannis Kakogeorgiou
Spyros Gidaris
N. Komodakis
AI4CE
79
1
0
16 Dec 2024
Gramian Multimodal Representation Learning and Alignment
Giordano Cicchetti
Eleonora Grassucci
Luigi Sigillo
Danilo Comminiello
78
0
0
16 Dec 2024
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
Xinshuai Song
Weixing Chen
Y. Liu
Weikai Chen
Guanbin Li
Liang Lin
117
3
0
12 Dec 2024
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
97
6
0
11 Dec 2024
A Stitch in Time Saves Nine: Small VLM is a Precise Guidance for Accelerating Large VLMs
Wangbo Zhao
Yizeng Han
Jiasheng Tang
Z. Li
Yibing Song
K. Wang
Zhangyang Wang
Yang You
77
7
0
04 Dec 2024
Video LLMs for Temporal Reasoning in Long Videos
Fawad Javed Fateh
Umer Ahmed
Hamza Khan
M. Zia
Quoc-Huy Tran
VLM
81
0
0
04 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
88
0
0
04 Dec 2024
VideoLLM Knows When to Speak: Enhancing Time-Sensitive Video Comprehension with Video-Text Duet Interaction Format
Yueqian Wang
Xiaojun Meng
Y. Wang
Jianxin Liang
Jiansheng Wei
Huishuai Zhang
Dongyan Zhao
VGen
75
8
0
27 Nov 2024
NEMO: Can Multimodal LLMs Identify Attribute-Modified Objects?
Jiaxuan Li
Junwen Mo
MinhDuc Vo
Akihiro Sugimoto
Hideki Nakayama
79
0
0
26 Nov 2024
Beyond Sight: Towards Cognitive Alignment in LVLM via Enriched Visual Knowledge
Yaqi Zhao
Yuanyang Yin
Lin Li
Mingan Lin
Victor Shea-Jay Huang
Siwei Chen
Weipeng Chen
Baoqun Yin
Zenan Zhou
Wentao Zhang
67
0
0
25 Nov 2024
A Study on Unsupervised Domain Adaptation for Semantic Segmentation in the Era of Vision-Language Models
Manuel Schwonberg
Claus Werner
Hanno Gottschalk
Carsten Meyer
VLM
85
0
0
25 Nov 2024
ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration
Haozhan Shen
Kangjia Zhao
Tiancheng Zhao
Ruochen Xu
Zilun Zhang
Mingwei Zhu
Jianwei Yin
84
4
0
25 Nov 2024
DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving
Xianda Guo
Ruijun Zhang
Yiqun Duan
Yuhang He
Chenming Zhang
Shuai Liu
Long Chen
LRM
66
11
0
20 Nov 2024
Anatomy-Guided Radiology Report Generation with Pathology-Aware Regional Prompts
Yijian Gao
D. C. Marshall
Xiaodan Xing
Junzhi Ning
G. Papanastasiou
G. Yang
M. Komorowski
MedIm
19
0
0
16 Nov 2024
Thinking Before Looking: Improving Multimodal LLM Reasoning via Mitigating Visual Hallucination
Haojie Zheng
Tianyang Xu
Hanchi Sun
Shu Pu
Ruoxi Chen
Lichao Sun
MLLM
LRM
64
8
0
15 Nov 2024
Harnessing Vision Foundation Models for High-Performance, Training-Free Open Vocabulary Segmentation
Yuheng Shi
Minjing Dong
Chang Xu
VLM
27
1
0
14 Nov 2024
Silver medal Solution for Image Matching Challenge 2024
Yian Wang
3DV
3DPC
31
0
0
04 Nov 2024
Towards Unifying Understanding and Generation in the Era of Vision Foundation Models: A Survey from the Autoregression Perspective
Shenghao Xie
Wenqiang Zu
Mingyang Zhao
Duo Su
Shilong Liu
Ruohua Shi
Guoqi Li
Shanghang Zhang
Lei Ma
LRM
40
3
0
29 Oct 2024
Improving Generalization in Visual Reasoning via Self-Ensemble
Tien-Huy Nguyen
Quang-Khai Tran
Anh-Tuan Quang-Hoang
VLM
LRM
47
5
0
28 Oct 2024
Multi-path Exploration and Feedback Adjustment for Text-to-Image Person Retrieval
Bin Kang
Bin Chen
J. T. Wang
Yong Xu
19
0
0
26 Oct 2024
GiVE: Guiding Visual Encoder to Perceive Overlooked Information
Junjie Li
Jianghong Ma
Xiaofeng Zhang
Yuhang Li
Jianyang Shi
23
0
0
26 Oct 2024
Parameter-Efficient Fine-Tuning in Large Models: A Survey of Methodologies
L. Wang
Sheng Chen
Linnan Jiang
Shu Pan
Runze Cai
Sen Yang
Fei Yang
44
3
0
24 Oct 2024
PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context
Maximilian Augustin
Syed Shakib Sarwar
Mostafa Elhoushi
Sai Qian Zhang
Yuecheng Li
B. D. Salvo
20
0
0
23 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
62
22
0
21 Oct 2024
Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly
Junsheng Zhou
Yu-Shen Liu
Zhizhong Han
ViT
32
9
0
21 Oct 2024
Visual Motif Identification: Elaboration of a Curated Comparative Dataset and Classification Methods
Adam Phillips
Daniel Grandes Rodriguez
Miriam Sánchez-Manzano
Alan Salvadó
Manuel Garin
G. Haro
C. Ballester
17
0
0
21 Oct 2024
TIPS: Text-Image Pretraining with Spatial awareness
Kevis-Kokitsi Maninis
Kaifeng Chen
Soham Ghosh
Arjun Karpur
Koert Chen
...
Jan Dlabal
Dan Gnanapragasam
Mojtaba Seyedhosseini
Howard Zhou
Andre Araujo
VLM
30
3
0
21 Oct 2024
Open-vocabulary vs. Closed-set: Best Practice for Few-shot Object Detection Considering Text Describability
Yusuke Hosoya
Masanori Suganuma
Takayuki Okatani
ObjD
16
0
0
20 Oct 2024
Zero-shot Action Localization via the Confidence of Large Vision-Language Models
Josiah Aklilu
Xiaohan Wang
Serena Yeung-Levy
54
1
0
18 Oct 2024
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Chengyue Wu
Xiaokang Chen
Z. F. Wu
Yiyang Ma
Xingchao Liu
...
Wen Liu
Zhenda Xie
Xingkai Yu
Chong Ruan
Ping Luo
AI4TS
52
70
0
17 Oct 2024
MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding
Yue Cao
Yangzhou Liu
Zhe Chen
Guangchen Shi
Wenhai Wang
Danhuai Zhao
Tong Lu
41
5
0
15 Oct 2024
Previous
1
2
3
4
5
6
7
8
Next