Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2211.07636
Cited By
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
14 November 2022
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
CLIP
Re-assign community
ArXiv
PDF
HTML
Papers citing
"EVA: Exploring the Limits of Masked Visual Representation Learning at Scale"
50 / 507 papers shown
Title
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
34
0
0
08 May 2025
Interleave-VLA: Enhancing Robot Manipulation with Interleaved Image-Text Instructions
Cunxin Fan
Xiaosong Jia
Yihang Sun
Yixiao Wang
Jianglan Wei
...
Xiangyu Zhao
M. Tomizuka
Xue Yang
Junchi Yan
Mingyu Ding
LM&Ro
VLM
54
2
0
04 May 2025
DEEMO: De-identity Multimodal Emotion Recognition and Reasoning
Deng Li
Bohao Xing
Xin Liu
Baiqiang Xia
Bihan Wen
H. Kalviainen
VLM
68
0
0
28 Apr 2025
MP-Mat: A 3D-and-Instance-Aware Human Matting and Editing Framework with Multiplane Representation
Siyi Jiao
Wenzheng Zeng
Y. Li
H. Zhang
Changxin Gao
Nong Sang
Mike Zheng Shou
12
0
0
20 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
Stronger, Steadier & Superior: Geometric Consistency in Depth VFM Forges Domain Generalized Semantic Segmentation
Siyu Chen
Ting Han
Changshe Zhang
Xin Luo
Meiliu Wu
Guorong Cai
Jinhe Su
MDE
32
0
0
17 Apr 2025
Multimodal LLM Augmented Reasoning for Interpretable Visual Perception Analysis
Shravan Chaudhari
Trilokya Akula
Yoon Kim
Tom Blake
LRM
40
0
0
16 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
X. Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
26
1
0
14 Apr 2025
CleanMAP: Distilling Multimodal LLMs for Confidence-Driven Crowdsourced HD Map Updates
Ankit Kumar Shaw
Kun Jiang
Tuopu Wen
Chandan Kumar Sah
Yining Shi
Mengmeng Yang
D. Yang
Xiaoli Lian
26
0
0
14 Apr 2025
Enhancing Multi-task Learning Capability of Medical Generalist Foundation Model via Image-centric Multi-annotation Data
Xun Zhu
Fanbin Mo
Zheng Zhang
J. Wang
Yiming Shi
Ming Wu
Chuang Zhang
Miao Li
Ji Wu
24
0
0
14 Apr 2025
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
Cheng-Yu Hsieh
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Hadi Pouransari
VLM
61
0
0
11 Apr 2025
VC-LLM: Automated Advertisement Video Creation from Raw Footage using Multi-modal LLMs
Dongjun Qian
Kai Su
Yiming Tan
Qishuai Diao
Xian Wu
Chang Liu
Bingyue Peng
Zehuan Yuan
VGen
16
0
0
08 Apr 2025
REEF: Relevance-Aware and Efficient LLM Adapter for Video Understanding
Sakib Reza
Xiyun Song
Heather Yu
Zongfang Lin
Mohsen Moghaddam
Octavia Camps
23
0
0
07 Apr 2025
Delineate Anything: Resolution-Agnostic Field Boundary Delineation on Satellite Imagery
Mykola Lavreniuk
Nataliia Kussul
Andrii Shelestov
Bohdan Yailymov
Yevhenii Salii
Volodymyr Kuzin
Zoltan Szantoi
29
0
0
03 Apr 2025
Rip Current Segmentation: A Novel Benchmark and YOLOv8 Baseline Results
Andrei Dumitriu
Florin Tatui
Florin Miron
Radu Tudor Ionescu
Radu Timofte
37
19
0
03 Apr 2025
Enhanced Cross-modal 3D Retrieval via Tri-modal Reconstruction
Junlong Ren
Hao Wang
36
0
0
02 Apr 2025
RipVIS: Rip Currents Video Instance Segmentation Benchmark for Beach Monitoring and Safety
Andrei Dumitriu
Florin Tatui
Florin Miron
Aakash Ralhan
Radu Tudor Ionescu
Radu Timofte
26
0
0
01 Apr 2025
Evaluating Text-to-Image Synthesis with a Conditional Fréchet Distance
Jaywon Koo
J. Hernandez
Moayed Haji-Ali
Ziyan Yang
Vicente Ordonez
EGVM
67
0
0
27 Mar 2025
Vision as LoRA
Han Wang
Yongjie Ye
Bingru Li
Yuxiang Nie
Jinghui Lu
Jingqun Tang
Yanjie Wang
Can Huang
86
0
0
26 Mar 2025
Scaling Vision Pre-Training to 4K Resolution
Baifeng Shi
Boyi Li
Han Cai
Y. Lu
Sifei Liu
...
Jan Kautz
Song Han
Trevor Darrell
Pavlo Molchanov
Hongxu Yin
CLIP
53
0
0
25 Mar 2025
Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders
Paul Koch
Jörg Krüger
Ankit Chowdhury
O. Heimann
MDE
51
0
0
25 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
48
1
0
21 Mar 2025
Enhancing Zero-Shot Image Recognition in Vision-Language Models through Human-like Concept Guidance
Hui Liu
Wenya Wang
Kecheng Chen
Jie Liu
Yibing Liu
Tiexin Qin
Peisong He
Xinghao Jiang
Haoliang Li
BDL
VLM
78
0
0
20 Mar 2025
REVAL: A Comprehension Evaluation on Reliability and Values of Large Vision-Language Models
Jie M. Zhang
Zheng Yuan
Z. Wang
Bei Yan
Sibo Wang
Xiangkui Cao
Zonghui Guo
Shiguang Shan
Xilin Chen
ELM
36
0
0
20 Mar 2025
Visual Position Prompt for MLLM based Visual Grounding
Wei Tang
Yanpeng Sun
Qinying Gu
Zechao Li
VLM
45
0
0
19 Mar 2025
Exploring Disparity-Accuracy Trade-offs in Face Recognition Systems: The Role of Datasets, Architectures, and Loss Functions
S. Jaiswal
Sagnik Basu
Sandipan Sikdar
Animesh Mukherjee
33
0
0
18 Mar 2025
CalliReader: Contextualizing Chinese Calligraphy via an Embedding-Aligned Vision-Language Model
Yuxuan Luo
Jiaqi Tang
Chenyi Huang
Feiyang Hao
Zhouhui Lian
VLM
56
0
0
13 Mar 2025
Measure Twice, Cut Once: Grasping Video Structures and Event Semantics with LLMs for Video Temporal Localization
Zongshang Pang
Mayu Otani
Yuta Nakashima
51
0
0
12 Mar 2025
Multi-Modal Foundation Models for Computational Pathology: A Survey
Dong Li
Guihong Wan
Xintao Wu
Xinyu Wu
Xiaohui Chen
Yi He
Christine G. Lian
Peter K. Sorger
Yevgeniy R. Semenov
Chen Zhao
MedIm
44
0
0
12 Mar 2025
Scale-Aware Pre-Training for Human-Centric Visual Perception: Enabling Lightweight and Generalizable Models
Xuanhan Wang
Huimin Deng
Lianli Gao
Jingkuan Song
VLM
47
0
0
11 Mar 2025
Similarity-Guided Layer-Adaptive Vision Transformer for UAV Tracking
Chaocan Xue
Bineng Zhong
Qihua Liang
Yaozong Zheng
Ning Li
Yuanliang Xue
Shuxiang Song
36
0
0
09 Mar 2025
StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition
Xin Ding
Hao Wu
Y. Yang
Shiqi Jiang
Donglin Bai
Zhibo Chen
Ting Cao
50
0
0
08 Mar 2025
Beyond Cosine Decay: On the effectiveness of Infinite Learning Rate Schedule for Continual Pre-training
Paul Janson
Vaibhav Singh
Paria Mehrbod
Adam Ibrahim
Irina Rish
Eugene Belilovsky
Benjamin Thérien
CLL
73
0
0
04 Mar 2025
Generalizable Prompt Learning of CLIP: A Brief Overview
Fangming Cui
Yonggang Zhang
Xuan Wang
Xule Wang
Liang Xiao
VPVLM
VLM
73
0
0
03 Mar 2025
MedUnifier: Unifying Vision-and-Language Pre-training on Medical Data with Vision Generation Task using Discrete Visual Representations
Ziyang Zhang
Yang Yu
Yucheng Chen
Xulei Yang
S. Yeo
MedIm
45
1
0
02 Mar 2025
Re-Imagining Multimodal Instruction Tuning: A Representation View
Yiyang Liu
James Liang
Ruixiang Tang
Yugyung Lee
Majid Rabbani
...
Raghuveer M. Rao
Lifu Huang
Dongfang Liu
Qifan Wang
Cheng Han
48
0
0
02 Mar 2025
Streaming Video Question-Answering with In-context Video KV-Cache Retrieval
Shangzhe Di
Zhelun Yu
Guanghao Zhang
Haoyuan Li
Tao Zhong
Hao Cheng
Bolin Li
Wanggui He
Fangxun Shu
Hao Jiang
63
4
0
01 Mar 2025
Towards High-performance Spiking Transformers from ANN to SNN Conversion
Zihan Huang
Xinyu Shi
Zecheng Hao
Tong Bu
Jianhao Ding
Zhaofei Yu
Tiejun Huang
28
7
0
28 Feb 2025
Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language Models
Zhaoyi Liu
Huan Zhang
AAML
72
0
0
25 Feb 2025
UniGS: Unified Language-Image-3D Pretraining with Gaussian Splatting
Haoyuan Li
Yanpeng Zhou
Tao Tang
Jifei Song
Yihan Zeng
Michael C. Kampffmeyer
Hang Xu
Xiaodan Liang
3DGS
57
1
0
25 Feb 2025
Pretrained Image-Text Models are Secretly Video Captioners
Chunhui Zhang
Yiren Jian
Z. Ouyang
Soroush Vosoughi
VLM
63
3
0
20 Feb 2025
VAQUUM: Are Vague Quantifiers Grounded in Visual Data?
Hugh Mee Wong
Rick Nouwen
Albert Gatt
46
0
0
17 Feb 2025
Demystifying Hateful Content: Leveraging Large Multimodal Models for Hateful Meme Detection with Explainable Decisions
Ming Shan Hee
Roy Ka-Wei Lee
VLM
75
0
0
16 Feb 2025
I Think, Therefore I Diffuse: Enabling Multimodal In-Context Reasoning in Diffusion Models
Zhenxing Mi
Kuan-Chieh Jackson Wang
Guocheng Qian
Hanrong Ye
Runtao Liu
Sergey Tulyakov
Kfir Aberman
Dan Xu
LRM
42
0
0
12 Feb 2025
HCMRM: A High-Consistency Multimodal Relevance Model for Search Ads
Guobing Gan
Kaiming Gao
Li Wang
Shen Jiang
Peng Jiang
59
0
0
09 Feb 2025
UNIP: Rethinking Pre-trained Attention Patterns for Infrared Semantic Segmentation
Tao Zhang
Jinyong Wen
Zhen Chen
Kun Ding
S. Xiang
Chunhong Pan
70
1
0
04 Feb 2025
Towards Robust Multimodal Large Language Models Against Jailbreak Attacks
Ziyi Yin
Yuanpu Cao
Han Liu
Ting Wang
Jinghui Chen
Fenhlong Ma
AAML
47
0
0
02 Feb 2025
Vision-Language Model Selection and Reuse for Downstream Adaptation
Hao-Zhe Tan
Zhi-Hua Zhou
Lan-Zhe Guo
Yu-Feng Li
VLM
88
0
0
30 Jan 2025
Rethinking Encoder-Decoder Flow Through Shared Structures
Frederik Laboyrie
M. K. Yucel
Albert Saà-Garriga
AI4CE
40
0
0
24 Jan 2025
Patent Figure Classification using Large Vision-language Models
Sushil Awale
Eric Müller-Budack
Ralph Ewerth
31
0
0
22 Jan 2025
1
2
3
4
...
9
10
11
Next