Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2211.07636
Cited By
v1
v2 (latest)
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Computer Vision and Pattern Recognition (CVPR), 2022
14 November 2022
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
CLIP
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (2496★)
Papers citing
"EVA: Exploring the Limits of Masked Visual Representation Learning at Scale"
50 / 579 papers shown
Video-Text Dataset Construction from Multi-AI Feedback: Promoting Weak-to-Strong Preference Learning for Video Large Language Models
Hao Yi
Qingyang Li
Yihan Hu
Fuzheng Zhang
Di Zhang
Yong Liu
VGen
355
0
0
25 Nov 2024
LibraGrad: Balancing Gradient Flow for Universally Better Vision Transformer Attributions
Computer Vision and Pattern Recognition (CVPR), 2024
Faridoun Mehri
Mahdieh Soleymani Baghshah
Mohammad Taher Pilehvar
293
3
0
24 Nov 2024
ReWind: Understanding Long Videos with Instructed Learnable Memory
Computer Vision and Pattern Recognition (CVPR), 2024
Anxhelo Diko
Tinghuai Wang
Wassim Swaileh
Shiyan Sun
Ioannis Patras
KELM
VLM
370
4
0
23 Nov 2024
Devils in Middle Layers of Large Vision-Language Models: Interpreting, Detecting and Mitigating Object Hallucinations via Attention Lens
Computer Vision and Pattern Recognition (CVPR), 2024
Zhangqi Jiang
Junkai Chen
Beier Zhu
Tingjin Luo
Yankun Shen
Xu Yang
528
49
0
23 Nov 2024
Chanel-Orderer: A Channel-Ordering Predictor for Tri-Channel Natural Images
Shen Li
Lei Jiang
Wei Wang
Hongwei Hu
Liang Li
347
0
0
20 Nov 2024
Generative Timelines for Instructed Visual Assembly
Alejandro Pardo
Jui-hsien Wang
Guohao Li
Josef Sivic
Bryan C. Russell
Fabian Caba Heilbron
VGen
272
0
0
19 Nov 2024
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Dengke Zhang
Fagui Liu
Quan Tang
VLM
664
2
0
15 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Neural Information Processing Systems (NeurIPS), 2024
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIP
VLM
419
7
0
05 Nov 2024
UniGuard: Towards Universal Safety Guardrails for Jailbreak Attacks on Multimodal Large Language Models
Sejoon Oh
Yiqiao Jin
Megha Sharma
Donghyun Kim
Eric Ma
Gaurav Verma
Srijan Kumar
332
12
0
03 Nov 2024
Tracking one-in-a-million: Large-scale benchmark for microbial single-cell tracking with experiment-aware robustness metrics
J. Seiffarth
L. Blöbaum
R. D. Paul
N. Friederich
A. J. Yamachui Sitcheu
R. Mikut
H. Scharr
A. Grünberger
K. Nöh
221
5
0
01 Nov 2024
Parameter-Efficient Fine-Tuning Medical Multimodal Large Language Models for Medical Visual Grounding
IEEE International Symposium on Biomedical Imaging (ISBI), 2024
Jinlong He
Pengfei Li
Gang Liu
Shenjun Zhong
233
5
0
31 Oct 2024
Multilingual Vision-Language Pre-training for the Remote Sensing Domain
João Daniel Silva
João Magalhães
D. Tuia
Bruno Martins
CLIP
VLM
236
6
0
30 Oct 2024
Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
Bo Jiang
Shaoyu Chen
Bencheng Liao
Xingyu Zhang
Wei Yin
Qian Zhang
Chang Huang
Wen Liu
Xinyu Wang
VLM
MLLM
LRM
307
77
0
29 Oct 2024
Your Image is Secretly the Last Frame of a Pseudo Video
Wenlong Chen
Wenlin Chen
Lapo Rastrelli
Yingzhen Li
DiffM
VGen
377
0
0
26 Oct 2024
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
International Conference on Learning Representations (ICLR), 2024
Kim Sung-Bin
Oh Hyun-Bin
JungMok Lee
Arda Senocak
Joon Son Chung
Tae-Hyun Oh
MLLM
VLM
445
15
0
23 Oct 2024
PETAH: Parameter Efficient Task Adaptation for Hybrid Transformers in a resource-limited Context
Maximilian Augustin
Syed Shakib Sarwar
Mostafa Elhoushi
Sai Qian Zhang
Yuecheng Li
B. D. Salvo
253
1
0
23 Oct 2024
Towards Real Zero-Shot Camouflaged Object Segmentation without Camouflaged Annotations
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Cheng Lei
Jie Fan
Xinran Li
Tianzhu Xiang
Ao Li
Ce Zhu
Le Zhang
277
4
0
22 Oct 2024
Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly
Neural Information Processing Systems (NeurIPS), 2024
Junsheng Zhou
Yu-Shen Liu
Zhizhong Han
ViT
284
21
0
21 Oct 2024
TIPS: Text-Image Pretraining with Spatial awareness
International Conference on Learning Representations (ICLR), 2024
Kevis-Kokitsi Maninis
Kaifeng Chen
Soham Ghosh
Arjun Karpur
Koert Chen
...
Jan Dlabal
Dan Gnanapragasam
Mojtaba Seyedhosseini
Howard Zhou
Andre Araujo
VLM
439
17
0
21 Oct 2024
A Survey of Hallucination in Large Visual Language Models
Wei Lan
Wenyi Chen
Qingfeng Chen
Shirui Pan
Huiyu Zhou
Yi-Lun Pan
LRM
315
11
0
20 Oct 2024
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
IEEE transactions on multimedia (IEEE TMM), 2024
Muhe Ding
Yang Ma
Pengda Qin
Yue Yu
Yuhong Li
Liqiang Nie
244
4
0
18 Oct 2024
ViCToR: Improving Visual Comprehension via Token Reconstruction for Pretraining LMMs
Yin Xie
Kaicheng Yang
Ninghua Yang
Weimo Deng
Xiangzi Dai
Tiancheng Gu
Yumeng Wang
Xiang An
Yongle Zhao
Ziyong Feng
MLLM
VLM
360
1
0
18 Oct 2024
Fusion from Decomposition: A Self-Supervised Approach for Image Fusion and Beyond
Pengwei Liang
Junjun Jiang
Qing Ma
Xianming Liu
Jiayi Ma
219
5
0
16 Oct 2024
VidCompress: Memory-Enhanced Temporal Compression for Video Understanding in Large Language Models
Xiaohan Lan
Yitian Yuan
Zequn Jie
Lin Ma
VLM
181
4
0
15 Oct 2024
Automatically Generating Visual Hallucination Test Cases for Multimodal Large Language Models
Zhongye Liu
Hongbin Liu
Yuepeng Hu
Zedian Shao
Neil Zhenqiang Gong
VLM
MLLM
158
1
0
15 Oct 2024
Browsing without Third-Party Cookies: What Do You See?
ACM/SIGCOMM Internet Measurement Conference (IMC), 2024
Maxwell Lin
Shihan Lin
Helen Wu
Karen Wang
Xiaowei Yang
BDL
477
45
0
14 Oct 2024
Locality Alignment Improves Vision-Language Models
International Conference on Learning Representations (ICLR), 2024
Ian Covert
Tony Sun
James Zou
Tatsunori Hashimoto
VLM
592
11
0
14 Oct 2024
Large-Scale 3D Medical Image Pre-training with Geometric Context Priors
Linshan Wu
Jiaxin Zhuang
Hao Chen
220
20
0
13 Oct 2024
Large Model for Small Data: Foundation Model for Cross-Modal RF Human Activity Recognition
ACM International Conference on Embedded Networked Sensor Systems (SenSys), 2024
Yuxuan Weng
Guoquan Wu
Tianyue Zheng
Yanbing Yang
Jun Luo
261
17
0
13 Oct 2024
Conjugated Semantic Pool Improves OOD Detection with Pre-trained Vision-Language Models
Neural Information Processing Systems (NeurIPS), 2024
Mengyuan Chen
Junyu Gao
Changsheng Xu
VLM
OODD
332
12
0
11 Oct 2024
OneRef: Unified One-tower Expression Grounding and Segmentation with Mask Referring Modeling
Neural Information Processing Systems (NeurIPS), 2024
Linhui Xiao
Xiaoshan Yang
Fang Peng
Yaowei Wang
Changsheng Xu
ObjD
434
22
0
10 Oct 2024
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
International Conference on Learning Representations (ICLR), 2024
Haoyi Zhu
Honghui Yang
Yating Wang
Jiange Yang
Limin Wang
Tong He
3DH
381
23
0
10 Oct 2024
Break the Visual Perception: Adversarial Attacks Targeting Encoded Visual Tokens of Large Vision-Language Models
ACM Multimedia (MM), 2024
Yubo Wang
Chaohu Liu
Yanqiu Qu
Haoyu Cao
Deqiang Jiang
Linli Xu
MLLM
AAML
150
15
0
09 Oct 2024
From Pixels to Tokens: Revisiting Object Hallucinations in Large Vision-Language Models
Yuying Shang
Xinyi Zeng
Yutao Zhu
Xiao Yang
Zhengwei Fang
Jingyuan Zhang
Jiawei Chen
Zinan Liu
Yu Tian
VLM
MLLM
817
2
0
09 Oct 2024
Treat Visual Tokens as Text? But Your MLLM Only Needs Fewer Efforts to See
Phu Pham
Phu Pham
Kun Wan
Yu-Jhe Li
Zeliang Zhang
Daniel Miranda
Ajinkya Kale
Ajinkya Kale
Chenliang Xu
249
1
0
08 Oct 2024
AnyAttack: Towards Large-scale Self-supervised Adversarial Attacks on Vision-language Models
Computer Vision and Pattern Recognition (CVPR), 2024
Jiaming Zhang
Junhong Ye
Xingjun Ma
Yige Li
Yunfan Yang
Jitao Sang
Dit-Yan Yeung
Dit-Yan Yeung
AAML
VLM
302
0
0
07 Oct 2024
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
International Conference on Learning Representations (ICLR), 2024
Wanpeng Zhang
Zilong Xie
Yicheng Feng
Yijiang Li
Xingrun Xing
Sipeng Zheng
Zongqing Lu
MLLM
355
10
0
03 Oct 2024
UlcerGPT: A Multimodal Approach Leveraging Large Language and Vision Models for Diabetic Foot Ulcer Image Transcription
International Conference on Pattern Recognition (ICPR), 2024
Reza Basiri
Ali Abedi
Chau Nguyen
Milos R. Popovic
Shehroz S. Khan
LM&MA
MedIm
98
4
0
02 Oct 2024
EMMA: Efficient Visual Alignment in Multi-Modal LLMs
Sara Ghazanfari
Alexandre Araujo
Prashanth Krishnamurthy
Siddharth Garg
Farshad Khorrami
VLM
301
7
0
02 Oct 2024
HELPD: Mitigating Hallucination of LVLMs by Hierarchical Feedback Learning with Vision-enhanced Penalty Decoding
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Fan Yuan
Chi Qin
Xiaogang Xu
Piji Li
VLM
MLLM
162
9
0
30 Sep 2024
E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
Neural Information Processing Systems (NeurIPS), 2024
Ye Liu
Zongyang Ma
Chen Ma
Yang Wu
Ying Shan
Chang Wen Chen
267
52
0
26 Sep 2024
VLM's Eye Examination: Instruct and Inspect Visual Competency of Vision Language Models
Nam Hyeon-Woo
Moon Ye-Bin
Wonseok Choi
Lee Hyun
Tae-Hyun Oh
CoGe
249
3
0
23 Sep 2024
Effectively Enhancing Vision Language Large Models by Prompt Augmentation and Caption Utilization
Minyi Zhao
Jie Wang
Zerui Li
Jiyuan Zhang
Zhenbang Sun
Shuigeng Zhou
MLLM
VLM
319
3
0
22 Sep 2024
NEVLP: Noise-Robust Framework for Efficient Vision-Language Pre-training
Yiyi Tao
Zhuoyue Wang
Hang Zhang
Lun Wang
VLM
326
28
0
15 Sep 2024
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
Dingxin Cheng
Mingda Li
Jingyu Liu
Yongxin Guo
Bin Jiang
Qingbin Liu
Xi Chen
Bo Zhao
271
12
0
10 Sep 2024
Revisiting Prompt Pretraining of Vision-Language Models
Zhenyuan Chen
Lingfeng Yang
Shuo Chen
Zhaowei Chen
Jiajun Liang
Xiang Li
MLLM
VPVLM
VLM
363
4
0
10 Sep 2024
Seeing Through the Mask: Rethinking Adversarial Examples for CAPTCHAs
Yahya Jabary
Andreas Plesner
Turlan Kuzhagaliyev
Roger Wattenhofer
AAML
195
2
0
09 Sep 2024
Top-GAP: Integrating Size Priors in CNNs for more Interpretability, Robustness, and Bias Mitigation
Lars Nieradzik
Henrike Stephani
Janis Keuper
FAtt
AAML
249
1
0
07 Sep 2024
Optimizing CLIP Models for Image Retrieval with Maintained Joint-Embedding Alignment
Similarity Search and Applications (SISAP), 2024
Konstantin Schall
Kai Uwe Barthel
Nico Hezel
Klaus Jung
VLM
291
8
0
03 Sep 2024
Understanding Multimodal Hallucination with Parameter-Free Representation Alignment
Yueqian Wang
Jianxin Liang
Yuxuan Wang
Huishuai Zhang
Dongyan Zhao
237
2
0
02 Sep 2024
Previous
1
2
3
4
5
...
10
11
12
Next