Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2211.07636
Cited By
v1
v2 (latest)
EVA: Exploring the Limits of Masked Visual Representation Learning at Scale
Computer Vision and Pattern Recognition (CVPR), 2022
14 November 2022
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
VLM
CLIP
Re-assign community
ArXiv (abs)
PDF
HTML
HuggingFace (1 upvotes)
Github (2496★)
Papers citing
"EVA: Exploring the Limits of Masked Visual Representation Learning at Scale"
50 / 579 papers shown
EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning
Mingjie Ma
Zhihuan Yu
Yichao Ma
Guohui Li
LRM
194
2
0
22 Apr 2024
Lost in Space: Probing Fine-grained Spatial Understanding in Vision and Language Resamplers
Georgios Pantazopoulos
Alessandro Suglia
Oliver Lemon
Arash Eshghi
VLM
204
8
0
21 Apr 2024
Dynamic in Static: Hybrid Visual Correspondence for Self-Supervised Video Object Segmentation
Gensheng Pei
Yazhou Yao
Jianbo Jiao
Wenguan Wang
Liqiang Nie
Jinhui Tang
VOS
246
1
0
21 Apr 2024
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu
Yushi Hu
Bangzheng Li
Yu Feng
Haoyu Wang
Xudong Lin
Dan Roth
Noah A. Smith
Wei-Chiu Ma
Ranjay Krishna
VLM
LRM
MLLM
564
305
0
18 Apr 2024
Semantic-Based Active Perception for Humanoid Visual Tasks with Foveal Sensors
Joao Luzio
Alexandre Bernardino
Plinio Moreno
159
0
0
16 Apr 2024
MEEL: Multi-Modal Event Evolution Learning
Zhengwei Tao
Zhi Jin
Junqiang Huang
Xiancai Chen
Xiaoying Bai
Haiyan Zhao
Yifan Zhang
Chongyang Tao
174
1
0
16 Apr 2024
HOI-Ref: Hand-Object Interaction Referral in Egocentric Vision
Siddhant Bansal
Michael Wray
Dima Damen
216
10
0
15 Apr 2024
GLID: Pre-training a Generalist Encoder-Decoder Vision Model
Jihao Liu
Jinliang Zheng
Yu Liu
Jiaming Song
VLM
202
6
0
11 Apr 2024
BRAVE: Broadening the visual encoding of vision-language models
European Conference on Computer Vision (ECCV), 2024
Ouguzhan Fatih Kar
A. Tonioni
Petra Poklukar
Achin Kulshrestha
Amir Zamir
Federico Tombari
MLLM
VLM
296
57
0
10 Apr 2024
SparseAD: Sparse Query-Centric Paradigm for Efficient End-to-End Autonomous Driving
Diankun Zhang
Guoan Wang
Runwen Zhu
Jianbo Zhao
Xiwu Chen
...
Haotian Yao
Chi Zhang
Xiaojun Liu
Xiaoguang Di
Bin Li
238
33
0
10 Apr 2024
Monocular 3D lane detection for Autonomous Driving: Recent Achievements, Challenges, and Outlooks
Fulong Ma
Weiqing Qi
Guoyang Zhao
Linwei Zheng
Sheng Wang
Yuxuan Liu
Ming-Yuan Liu
292
16
0
10 Apr 2024
MoReVQA: Exploring Modular Reasoning Models for Video Question Answering
Juhong Min
Shyamal Buch
Arsha Nagrani
Minsu Cho
Cordelia Schmid
LRM
418
62
0
09 Apr 2024
MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding
Bo He
Hengduo Li
Young Kyun Jang
Menglin Jia
Xuefei Cao
Ashish Shah
Abhinav Shrivastava
Ser-Nam Lim
MLLM
356
180
0
08 Apr 2024
Progressive Alignment with VLM-LLM Feature to Augment Defect Classification for the ASE Dataset
Chih-Chung Hsu
Chia-Ming Lee
Chun-Hung Sun
Kuang-Ming Wu
156
0
0
08 Apr 2024
RoboMP
2
^2
2
: A Robotic Multimodal Perception-Planning Framework with Multimodal Large Language Models
Qi Lv
Haochuan Li
Xiang Deng
Rui Shao
Michael Yu Wang
Liqiang Nie
LRM
LM&Ro
231
4
0
07 Apr 2024
Cross-Modal Conditioned Reconstruction for Language-guided Medical Image Segmentation
IEEE Transactions on Medical Imaging (IEEE TMI), 2024
Xiaoshuang Huang
Hongxiang Li
Meng Cao
Long Chen
Chenyu You
Dong An
VLM
264
17
0
03 Apr 2024
What Are We Measuring When We Evaluate Large Vision-Language Models? An Analysis of Latent Factors and Biases
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
A. M. H. Tiong
Junqi Zhao
Boyang Albert Li
Junnan Li
Guosheng Lin
Caiming Xiong
255
12
0
03 Apr 2024
ViTamin: Designing Scalable Vision Models in the Vision-Language Era
Computer Vision and Pattern Recognition (CVPR), 2024
Jienneg Chen
Qihang Yu
Xiaohui Shen
Yaoyao Liu
Liang-Chieh Chen
3DV
VLM
411
50
0
02 Apr 2024
Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning
Rongjie Li
Yu Wu
Xuming He
MLLM
LRM
VLM
208
3
0
01 Apr 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
267
10
0
28 Mar 2024
Toward Interactive Regional Understanding in Vision-Large Language Models
Jungbeom Lee
Sanghyuk Chun
Sangdoo Yun
VLM
304
4
0
27 Mar 2024
Elysium: Exploring Object-level Perception in Videos via MLLM
Hang Wang
Yanjie Wang
Yongjie Ye
Yuxiang Nie
Can Huang
MLLM
315
38
0
25 Mar 2024
If CLIP Could Talk: Understanding Vision-Language Model Representations Through Their Preferred Concept Descriptions
Reza Esfandiarpoor
Cristina Menghini
Stephen H. Bach
CoGe
VLM
305
15
0
25 Mar 2024
A Multimodal Approach for Cross-Domain Image Retrieval
Lucas Iijima
Tania Stathaki
213
1
0
22 Mar 2024
MMIDR: Teaching Large Language Model to Interpret Multimodal Misinformation via Knowledge Distillation
Longzheng Wang
Xiaohan Xu
Lei Zhang
Jiarui Lu
Yongxiu Xu
Hongbo Xu
Xuancheng Huang
Chuang Zhang
304
8
0
21 Mar 2024
Improved Baselines for Data-efficient Perceptual Augmentation of LLMs
Théophane Vallaeys
Mustafa Shukor
Matthieu Cord
Jakob Verbeek
313
16
0
20 Mar 2024
SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models
Tongtian Yue
Jie Cheng
Longteng Guo
Xingyuan Dai
Zijia Zhao
Xingjian He
Gang Xiong
Yisheng Lv
Jing Liu
216
13
0
20 Mar 2024
When Do We Not Need Larger Vision Models?
Baifeng Shi
Ziyang Wu
Maolin Mao
Xin Wang
Trevor Darrell
VLM
LRM
407
70
0
19 Mar 2024
VisualCritic: Making LMMs Perceive Visual Quality Like Humans
Zhipeng Huang
Zhizheng Zhang
Yiting Lu
Zheng-Jun Zha
Zhibo Chen
Baining Guo
MLLM
243
15
0
19 Mar 2024
ViTGaze: Gaze Following with Interaction Features in Vision Transformers
Yuehao Song
Xinggang Wang
Jingfeng Yao
Wenyu Liu
Jinglin Zhang
Xiangmin Xu
ViT
216
15
0
19 Mar 2024
Fusion Transformer with Object Mask Guidance for Image Forgery Analysis
Dimitrios Karageorgiou
Giorgos Kordopatis-Zilos
Symeon Papadopoulos
ViT
195
13
0
18 Mar 2024
Better (pseudo-)labels for semi-supervised instance segmentation
Franccois Porcher
Camille Couprie
Marc Szafraniec
Jakob Verbeek
ISeg
171
3
0
18 Mar 2024
Depth-induced Saliency Comparison Network for Diagnosis of Alzheimer's Disease via Jointly Analysis of Visual Stimuli and Eye Movements
Yu Liu
Wenlin Zhang
Shaochu Wang
Fangyu Zuo
Peiguang Jing
Yong Ji
124
3
0
15 Mar 2024
Knowledge Condensation and Reasoning for Knowledge-based VQA
Dongze Hao
Jian Jia
Longteng Guo
Qunbo Wang
Te Yang
...
Yanhua Cheng
Bo Wang
Quan Chen
Han Li
Jing Liu
186
3
0
15 Mar 2024
UniCode: Learning a Unified Codebook for Multimodal Large Language Models
European Conference on Computer Vision (ECCV), 2024
Sipeng Zheng
Bohan Zhou
Yicheng Feng
Ye Wang
Zongqing Lu
VLM
MLLM
219
14
0
14 Mar 2024
MIM4D: Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning
Jialv Zou
Bencheng Liao
Qian Zhang
Wenyu Liu
Xinggang Wang
239
6
0
13 Mar 2024
ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions
Computer Vision and Pattern Recognition (CVPR), 2024
Chunlong Xia
Xinliang Wang
Feng Lv
Xin Hao
Yifeng Shi
ViT
417
127
0
12 Mar 2024
FocusCLIP: Multimodal Subject-Level Guidance for Zero-Shot Transfer in Human-Centric Tasks
Muhammad Gul Zain Ali Khan
Muhammad Ferjad Naeem
F. Tombari
Luc Van Gool
Didier Stricker
Muhammad Zeshan Afzal
VLM
CLIP
198
0
0
11 Mar 2024
VLM-PL: Advanced Pseudo Labeling Approach for Class Incremental Object Detection via Vision-Language Model
Junsu Kim
Yunhoe Ku
Jihyeon Kim
Junuk Cha
Seungryul Baek
ObjD
VLM
330
24
0
08 Mar 2024
Spatiotemporal Predictive Pre-training for Robotic Motor Control
Jiange Yang
Bei Liu
Jianlong Fu
Bocheng Pan
Gangshan Wu
Limin Wang
369
20
0
08 Mar 2024
Embodied Understanding of Driving Scenarios
European Conference on Computer Vision (ECCV), 2024
Yunsong Zhou
Linyan Huang
Qingwen Bu
Jia Zeng
Tianyu Li
Hang Qiu
Hongzi Zhu
Minyi Guo
Yu Qiao
Hongyang Li
LM&Ro
254
53
0
07 Mar 2024
Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers
Tsai-Shien Chen
Aliaksandr Siarohin
Willi Menapace
Ekaterina Deyneka
Hsiang-wei Chao
...
Yuwei Fang
Hsin-Ying Lee
Jian Ren
Ming-Hsuan Yang
Sergey Tulyakov
VGen
366
338
0
29 Feb 2024
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
318
85
0
29 Feb 2024
VideoMAC: Video Masked Autoencoders Meet ConvNets
Gensheng Pei
Tao Chen
XiRuo Jiang
Huafeng Liu
Zeren Sun
Yazhou Yao
VGen
243
19
0
29 Feb 2024
Vision Transformers with Natural Language Semantics
Young-Kyung Kim
Matías Di Martino
Guillermo Sapiro
ViT
153
7
0
27 Feb 2024
RoboCodeX: Multimodal Code Generation for Robotic Behavior Synthesis
Yao Mu
Junting Chen
Qinglong Zhang
Shoufa Chen
Qiaojun Yu
...
Wenhai Wang
Jifeng Dai
Yu Qiao
Mingyu Ding
Ping Luo
249
46
0
25 Feb 2024
Uncertainty-Aware Evaluation for Vision-Language Models
Vasily Kostumov
Bulat Nutfullin
Oleg Pilipenko
Eugene Ilyushin
ELM
436
16
0
22 Feb 2024
SoMeLVLM: A Large Vision Language Model for Social Media Processing
Xinnong Zhang
Haoyu Kuang
Xinyi Mou
Hanjia Lyu
Kun Wu
Siming Chen
Jiebo Luo
Xuanjing Huang
Zhongyu Wei
MLLM
217
13
0
20 Feb 2024
VideoPrism: A Foundational Visual Encoder for Video Understanding
Long Zhao
N. B. Gundavarapu
Liangzhe Yuan
Hao Zhou
Shen Yan
...
Huisheng Wang
Hartwig Adam
Mikhail Sirotenko
Ting Liu
Boqing Gong
VGen
386
67
0
20 Feb 2024
Pushing Auto-regressive Models for 3D Shape Generation at Capacity and Scalability
Xue-Qing Qian
Yu Wang
Simian Luo
Yinda Zhang
Ying Tai
...
Xiangyang Xue
Bo Zhao
Tiejun Huang
Yunsheng Wu
Yanwei Fu
251
7
0
19 Feb 2024
Previous
1
2
3
...
6
7
8
...
10
11
12
Next