ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2211.07636
  4. Cited By
EVA: Exploring the Limits of Masked Visual Representation Learning at
  Scale
v1v2 (latest)

EVA: Exploring the Limits of Masked Visual Representation Learning at Scale

Computer Vision and Pattern Recognition (CVPR), 2022
14 November 2022
Yuxin Fang
Wen Wang
Binhui Xie
Quan-Sen Sun
Ledell Yu Wu
Xinggang Wang
Tiejun Huang
Xinlong Wang
Yue Cao
    VLMCLIP
ArXiv (abs)PDFHTMLHuggingFace (1 upvotes)Github (2496★)

Papers citing "EVA: Exploring the Limits of Masked Visual Representation Learning at Scale"

50 / 579 papers shown
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Can Large Multimodal Models Uncover Deep Semantics Behind Images?
Yixin Yang
Zheng Li
Qingxiu Dong
Heming Xia
Zhifang Sui
VLM
186
20
0
17 Feb 2024
II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in
  Visual Question Answering
II-MMR: Identifying and Improving Multi-modal Multi-hop Reasoning in Visual Question Answering
Jihyung Kil
Farideh Tavazoee
Luan Tuyen Chau
Joo-Kyung Kim
LRM
210
6
0
16 Feb 2024
Question-Instructed Visual Descriptions for Zero-Shot Video Question
  Answering
Question-Instructed Visual Descriptions for Zero-Shot Video Question Answering
David Romero
Thamar Solorio
292
5
0
16 Feb 2024
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance
Mitigating Object Hallucination in Large Vision-Language Models via Image-Grounded Guidance
Linxi Zhao
Yihe Deng
Weitong Zhang
Q. Gu
MLLM
321
30
0
13 Feb 2024
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language
  Models with Autonomous Instruction Optimization
VisLingInstruct: Elevating Zero-Shot Learning in Multi-Modal Language Models with Autonomous Instruction OptimizationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024
Dongsheng Zhu
Xunzhu Tang
Weidong Han
Jinghui Lu
Yukun Zhao
Guoliang Xing
Junfeng Wang
D. Yin
VLMMLLM
298
17
0
12 Feb 2024
Open-ended VQA benchmarking of Vision-Language models by exploiting
  Classification datasets and their semantic hierarchy
Open-ended VQA benchmarking of Vision-Language models by exploiting Classification datasets and their semantic hierarchyInternational Conference on Learning Representations (ICLR), 2024
Simon Ging
M. A. Bravo
Thomas Brox
VLM
401
19
0
11 Feb 2024
Large Language Models for Captioning and Retrieving Remote Sensing
  Images
Large Language Models for Captioning and Retrieving Remote Sensing Images
João Daniel Silva
João Magalhães
D. Tuia
Bruno Martins
210
40
0
09 Feb 2024
Examining Gender and Racial Bias in Large Vision-Language Models Using a
  Novel Dataset of Parallel Images
Examining Gender and Racial Bias in Large Vision-Language Models Using a Novel Dataset of Parallel Images
Kathleen C. Fraser
S. Kiritchenko
271
64
0
08 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
299
36
0
08 Feb 2024
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters
Quan-Sen Sun
Jinsheng Wang
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Xinlong Wang
VLMCLIPMLLM
317
80
0
06 Feb 2024
Video-LaVIT: Unified Video-Language Pre-training with Decoupled
  Visual-Motional Tokenization
Video-LaVIT: Unified Video-Language Pre-training with Decoupled Visual-Motional TokenizationInternational Conference on Machine Learning (ICML), 2024
Yang Jin
Zhicheng Sun
Kun Xu
Kun Xu
Liwei Chen
...
Yuliang Liu
Chen Zhang
Yang Song
Kun Gai
Yadong Mu
VGen
262
78
0
05 Feb 2024
Delving into Multi-modal Multi-task Foundation Models for Road Scene
  Understanding: From Learning Paradigm Perspectives
Delving into Multi-modal Multi-task Foundation Models for Road Scene Understanding: From Learning Paradigm PerspectivesIEEE Transactions on Intelligent Vehicles (TIV), 2024
Sheng Luo
Wei Chen
Wanxin Tian
Rui Liu
Luanxuan Hou
...
Ling Shao
Yi Yang
Bojun Gao
Qun Li
Guobin Wu
411
28
0
05 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual
  Question Answering
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
259
8
0
04 Feb 2024
Region-Based Representations Revisited
Region-Based Representations Revisited
Michal Shlapentokh-Rothman
Ansel Blume
Yao Xiao
Yuqun Wu
TV Sethuraman
Heyi Tao
Jae Yong Lee
Wilfredo Torres
Yu-Xiong Wang
Derek Hoiem
487
14
0
04 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
263
14
0
02 Feb 2024
Hybrid Quantum Vision Transformers for Event Classification in High
  Energy Physics
Hybrid Quantum Vision Transformers for Event Classification in High Energy Physics
Eyup B. Unlu
Marçal Comajoan Cara
Gopal Ramesh Dahale
Zhongtian Dong
Roy T. Forestano
...
Daniel Justice
Kyoungchul Kong
Tom Magorsch
Konstantin T. Matchev
Katia Matcheva
294
13
0
01 Feb 2024
ControlCap: Controllable Region-level Captioning
ControlCap: Controllable Region-level Captioning
Yuzhong Zhao
Yue Liu
Zonghao Guo
Weijia Wu
Chen Gong
Fang Wan
QiXiang Ye
425
14
0
31 Jan 2024
Computer Vision for Primate Behavior Analysis in the Wild
Computer Vision for Primate Behavior Analysis in the Wild
Richard Vogg
Timo Lüddecke
Jonathan Henrich
Sharmita Dey
Matthias Nuske
...
Alexander Gail
Stefan Treue
H. Scherberger
Florentin Wörgötter
Alexander S. Ecker
406
15
0
29 Jan 2024
InternLM-XComposer2: Mastering Free-form Text-Image Composition and
  Comprehension in Vision-Language Large Model
InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
Xiao-wen Dong
Pan Zhang
Yuhang Zang
Yuhang Cao
Sijin Yu
...
Conghui He
Xingcheng Zhang
Yu Qiao
Dahua Lin
Yuan Liu
VLMMLLM
370
344
0
29 Jan 2024
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large
  Models
VIALM: A Survey and Benchmark of Visually Impaired Assistance with Large Models
Yi Zhao
Yilin Zhang
Rong Xiang
Jing Li
Hillming Li
337
26
0
29 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRLLRM
512
335
0
24 Jan 2024
STICKERCONV: Generating Multimodal Empathetic Responses from Scratch
STICKERCONV: Generating Multimodal Empathetic Responses from ScratchAnnual Meeting of the Association for Computational Linguistics (ACL), 2024
Yiqun Zhang
Fanheng Kong
Peidong Wang
Shuang Sun
Lingshuai Wang
Shi Feng
Daling Wang
Yifei Zhang
Kaisong Song
197
32
0
20 Jan 2024
Image Safeguarding: Reasoning with Conditional Vision Language Model and
  Obfuscating Unsafe Content Counterfactually
Image Safeguarding: Reasoning with Conditional Vision Language Model and Obfuscating Unsafe Content CounterfactuallyAAAI Conference on Artificial Intelligence (AAAI), 2024
Mazal Bethany
Brandon Wherry
Nishant Vishwamitra
Peyman Najafirad
DiffM
134
8
0
19 Jan 2024
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal
  Models for Video Question Answering
Weakly Supervised Gaussian Contrastive Grounding with Large Multimodal Models for Video Question Answering
Haibo Wang
Chenghang Lai
Yixuan Sun
Weifeng Ge
390
13
0
19 Jan 2024
OMG-Seg: Is One Model Good Enough For All Segmentation?
OMG-Seg: Is One Model Good Enough For All Segmentation?
Xiangtai Li
Haobo Yuan
Wei Li
Henghui Ding
Size Wu
Wenwei Zhang
Yining Li
Kai Chen
Chen Change Loy
VLMMLLMViT
311
106
0
18 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLMCLIP
253
14
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via
  Multi-modal Feature Synchronizer
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Jiaming Song
Yu Qiao
Jifeng Dai
AuLLM
240
70
0
18 Jan 2024
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction
  Tuning with Large Language Model
SkyEyeGPT: Unifying Remote Sensing Vision-Language Tasks via Instruction Tuning with Large Language Model
Yangfan Zhan
Zhitong Xiong
Yuan. Yuan
MLLM
254
120
0
18 Jan 2024
Vision Mamba: Efficient Visual Representation Learning with
  Bidirectional State Space Model
Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space ModelInternational Conference on Machine Learning (ICML), 2024
Lianghui Zhu
Bencheng Liao
Qian Zhang
Xinlong Wang
Wenyu Liu
Xinggang Wang
Mamba
485
1,378
0
17 Jan 2024
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with
  Positive Forward Transfer
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer
Junhao Zheng
Qianli Ma
Zhen Liu
Binquan Wu
Hu Feng
CLL
344
26
0
17 Jan 2024
UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World UnderstandingEuropean Conference on Computer Vision (ECCV), 2024
Bowen Shi
Peisen Zhao
Zichen Wang
Yuhang Zhang
Yaoming Wang
...
Wenrui Dai
Junni Zou
Hongkai Xiong
Qi Tian
Xiaopeng Zhang
VLM
193
13
0
12 Jan 2024
Video Anomaly Detection and Explanation via Large Language Models
Video Anomaly Detection and Explanation via Large Language Models
Hui Lv
Qianru Sun
253
52
0
11 Jan 2024
Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic
  Dataset and New Metrics
Latency-aware Road Anomaly Segmentation in Videos: A Photorealistic Dataset and New Metrics
Beiwen Tian
Huan-ang Gao
Leiyao Cui
Yupeng Zheng
Lan Luo
Baofeng Wang
Rong Zhi
Guyue Zhou
Hao Zhao
215
6
0
10 Jan 2024
Revisiting Adversarial Training at Scale
Revisiting Adversarial Training at ScaleComputer Vision and Pattern Recognition (CVPR), 2024
Zeyu Wang
Xianhang Li
Hongru Zhu
Cihang Xie
427
32
0
09 Jan 2024
Effective pruning of web-scale datasets based on complexity of concept
  clusters
Effective pruning of web-scale datasets based on complexity of concept clustersInternational Conference on Learning Representations (ICLR), 2024
Amro Abbas
E. Rusak
Kushal Tirumala
Wieland Brendel
Kamalika Chaudhuri
Ari S. Morcos
VLMCLIP
297
28
0
09 Jan 2024
Denoising Vision Transformers
Denoising Vision Transformers
Jiawei Yang
Katie Z Luo
Jie Li
Kilian Q. Weinberger
Yonglong Tian
Yue Wang
DiffM
247
30
0
05 Jan 2024
BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model
BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything ModelComputer Vision and Pattern Recognition (CVPR), 2024
Yiran Song
Qianyu Zhou
Hefei Ling
Deng-Ping Fan
Xuequan Lu
Lizhuang Ma
VLM
510
20
0
04 Jan 2024
Masked Modeling for Self-supervised Representation Learning on Vision
  and Beyond
Masked Modeling for Self-supervised Representation Learning on Vision and Beyond
Siyuan Li
Luyuan Zhang
Zedong Wang
Di Wu
Lirong Wu
...
Jun Xia
Cheng Tan
Yang Liu
Baigui Sun
Stan Z. Li
SSL
300
28
0
31 Dec 2023
FerKD: Surgical Label Adaptation for Efficient Distillation
FerKD: Surgical Label Adaptation for Efficient DistillationIEEE International Conference on Computer Vision (ICCV), 2023
Zhiqiang Shen
272
4
0
29 Dec 2023
Video Understanding with Large Language Models: A Survey
Video Understanding with Large Language Models: A Survey
Yunlong Tang
Jing Bi
Siting Xu
Luchuan Song
Susan Liang
...
Feng Zheng
Jianguo Zhang
Chenliang Xu
Jiebo Luo
Chenliang Xu
VLM
720
170
0
29 Dec 2023
Learning Vision from Models Rivals Learning Vision from Data
Learning Vision from Models Rivals Learning Vision from DataComputer Vision and Pattern Recognition (CVPR), 2023
Yonglong Tian
Lijie Fan
Kaifeng Chen
Dina Katabi
Dilip Krishnan
Phillip Isola
279
73
0
28 Dec 2023
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile
  Devices
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Xiangxiang Chu
Limeng Qiao
Xinyang Lin
Shuang Xu
Yang Yang
...
Fei Wei
Xinyu Zhang
Bo Zhang
Xiaolin Wei
Chunhua Shen
MLLM
312
70
0
28 Dec 2023
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
ChartBench: A Benchmark for Complex Visual Reasoning in Charts
Zhengzhuo Xu
Sinan Du
Yiyan Qi
Chengjin Xu
Chun Yuan
Jian Guo
440
89
0
26 Dec 2023
FoodLMM: A Versatile Food Assistant using Large Multi-modal Model
FoodLMM: A Versatile Food Assistant using Large Multi-modal Model
Yuehao Yin
Huiyan Qi
B. Zhu
Yue Yu
Yu-Gang Jiang
Chong-Wah Ngo
267
40
0
22 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic
  Visual-Linguistic Tasks
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLMMLLM
641
2,210
0
21 Dec 2023
GSVA: Generalized Segmentation via Multimodal Large Language Models
GSVA: Generalized Segmentation via Multimodal Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Zhuofan Xia
Dongchen Han
Yizeng Han
Xuran Pan
Shiji Song
Gao Huang
VLM
597
125
0
15 Dec 2023
General Object Foundation Model for Images and Videos at Scale
General Object Foundation Model for Images and Videos at ScaleComputer Vision and Pattern Recognition (CVPR), 2023
Junfeng Wu
Yi Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOSVLM
343
79
0
14 Dec 2023
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving
Wenhai Wang
Jiangwei Xie
ChuanYang Hu
Haoming Zou
Jianan Fan
Wenwen Tong
Yang Wen
Silei Wu
Hanming Deng
Zhiqi Li
363
217
0
14 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
ViLA: Efficient Video-Language Alignment for Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
325
22
0
13 Dec 2023
Building Universal Foundation Models for Medical Image Analysis with
  Spatially Adaptive Networks
Building Universal Foundation Models for Medical Image Analysis with Spatially Adaptive Networks
Lingxiao Luo
Xuanzhong Chen
Bingda Tang
Xinsheng Chen
Rong Han
Chengpeng Hu
Yujiang Li
Ting Chen
MedIm
218
3
0
12 Dec 2023
Previous
123...101112789
Next
Page 8 of 12
Pageof 12