ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.15389
  4. Cited By
EVA-CLIP: Improved Training Techniques for CLIP at Scale

EVA-CLIP: Improved Training Techniques for CLIP at Scale

27 March 2023
Quan-Sen Sun
Yuxin Fang
Ledell Yu Wu
Xinlong Wang
Yue Cao
    CLIP
    VLM
ArXivPDFHTML

Papers citing "EVA-CLIP: Improved Training Techniques for CLIP at Scale"

50 / 357 papers shown
Title
Side4Video: Spatial-Temporal Side Network for Memory-Efficient
  Image-to-Video Transfer Learning
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
Huanjin Yao
Wenhao Wu
Zhiheng Li
VLM
87
9
0
27 Nov 2023
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
GPT4Vis: What Can GPT-4 Do for Zero-shot Visual Recognition?
Wenhao Wu
Huanjin Yao
Mengxi Zhang
Yuxin Song
Wanli Ouyang
Jingdong Wang
VLM
22
29
0
27 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
24
56
0
16 Nov 2023
Chat-UniVi: Unified Visual Representation Empowers Large Language Models
  with Image and Video Understanding
Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding
Peng Jin
Ryuichi Takanobu
Caiwan Zhang
Xiaochun Cao
Li-ming Yuan
MLLM
32
217
0
14 Nov 2023
InfMLLM: A Unified Framework for Visual-Language Tasks
InfMLLM: A Unified Framework for Visual-Language Tasks
Qiang-feng Zhou
Zhibin Wang
Wei Chu
Yinghui Xu
Hao Li
Yuan Qi
MLLM
16
11
0
12 Nov 2023
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts
Yichen Gong
Delong Ran
Jinyuan Liu
Conglei Wang
Tianshuo Cong
Anyu Wang
Sisi Duan
Xiaoyun Wang
MLLM
129
117
0
09 Nov 2023
CogVLM: Visual Expert for Pretrained Language Models
CogVLM: Visual Expert for Pretrained Language Models
Weihan Wang
Qingsong Lv
Wenmeng Yu
Wenyi Hong
Ji Qi
...
Bin Xu
Juanzi Li
Yuxiao Dong
Ming Ding
Jie Tang
VLM
MLLM
17
444
0
06 Nov 2023
CapsFusion: Rethinking Image-Text Data at Scale
CapsFusion: Rethinking Image-Text Data at Scale
Qiying Yu
Quan-Sen Sun
Xiaosong Zhang
Yufeng Cui
Fan Zhang
Yue Cao
Xinlong Wang
Jingjing Liu
VLM
21
53
0
31 Oct 2023
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial
  Understanding
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
Haoxiang Wang
Pavan Kumar Anasosalu Vasu
Fartash Faghri
Raviteja Vemulapalli
Mehrdad Farajtabar
Sachin Mehta
Mohammad Rastegari
Oncel Tuzel
Hadi Pouransari
VLM
11
65
0
23 Oct 2023
SILC: Improving Vision Language Pretraining with Self-Distillation
SILC: Improving Vision Language Pretraining with Self-Distillation
Muhammad Ferjad Naeem
Yongqin Xian
Xiaohua Zhai
Lukas Hoyer
Luc Van Gool
F. Tombari
VLM
17
32
0
20 Oct 2023
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
  Models
From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models
Dongsheng Jiang
Yuchen Liu
Songlin Liu
Jiné Zhao
Hao Zhang
Zhen Gao
Xiaopeng Zhang
Jin Li
Hongkai Xiong
MLLM
VLM
31
34
0
13 Oct 2023
Can We Edit Multimodal Large Language Models?
Can We Edit Multimodal Large Language Models?
Siyuan Cheng
Bo Tian
Qingbin Liu
Xi Chen
Yongheng Wang
Huajun Chen
Ningyu Zhang
MLLM
28
28
0
12 Oct 2023
Uni3D: Exploring Unified 3D Representation at Scale
Uni3D: Exploring Unified 3D Representation at Scale
Junsheng Zhou
Jinsheng Wang
Baorui Ma
Yu-Shen Liu
Tiejun Huang
Xinlong Wang
26
86
0
10 Oct 2023
BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex
  Selectivity
BrainSCUBA: Fine-Grained Natural Language Captions of Visual Cortex Selectivity
Andrew F. Luo
Margaret M. Henderson
Michael J. Tarr
Brian Karrer
21
15
0
06 Oct 2023
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language
  Models
ECoFLaP: Efficient Coarse-to-Fine Layer-Wise Pruning for Vision-Language Models
Yi-Lin Sung
Jaehong Yoon
Mohit Bansal
VLM
15
14
0
04 Oct 2023
ReForm-Eval: Evaluating Large Vision Language Models via Unified
  Re-Formulation of Task-Oriented Benchmarks
ReForm-Eval: Evaluating Large Vision Language Models via Unified Re-Formulation of Task-Oriented Benchmarks
Zejun Li
Ye Wang
Mengfei Du
Qingwen Liu
Binhao Wu
...
Zhihao Fan
Jie Fu
Jingjing Chen
Xuanjing Huang
Zhongyu Wei
20
13
0
04 Oct 2023
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative
  Vokens
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
Kaizhi Zheng
Xuehai He
Xin Eric Wang
MLLM
17
91
0
03 Oct 2023
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense
  Prediction
CLIPSelf: Vision Transformer Distills Itself for Open-Vocabulary Dense Prediction
Size Wu
Wenwei Zhang
Lumin Xu
Sheng Jin
Xiangtai Li
Wentao Liu
Chen Change Loy
CLIP
VLM
16
67
0
02 Oct 2023
Making LLaMA SEE and Draw with SEED Tokenizer
Making LLaMA SEE and Draw with SEED Tokenizer
Yuying Ge
Sijie Zhao
Ziyun Zeng
Yixiao Ge
Chen Li
Xintao Wang
Ying Shan
22
127
0
02 Oct 2023
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data
  Filtering
The Devil is in the Details: A Deep Dive into the Rabbit Hole of Data Filtering
Hai-ping Yu
Yu Tian
Sateesh Kumar
Linjie Yang
Heng Wang
VLM
28
17
0
27 Sep 2023
BT-Adapter: Video Conversation is Feasible Without Video Instruction
  Tuning
BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
Ruyang Liu
Chen Li
Yixiao Ge
Ying Shan
Thomas H. Li
Ge Li
22
29
0
27 Sep 2023
DreamLLM: Synergistic Multimodal Comprehension and Creation
DreamLLM: Synergistic Multimodal Comprehension and Creation
Runpei Dong
Chunrui Han
Yuang Peng
Zekun Qi
Zheng Ge
...
Hao-Ran Wei
Xiangwen Kong
Xiangyu Zhang
Kaisheng Ma
Li Yi
MLLM
28
168
0
20 Sep 2023
Large Content And Behavior Models To Understand, Simulate, And Optimize
  Content And Behavior
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
Ashmit Khandelwal
Aditya Agrawal
Aanisha Bhattacharyya
Yaman Kumar Singla
Somesh Singh
...
Ishita Dasgupta
Stefano Petrangeli
R. Shah
Changyou Chen
Balaji Krishnamurthy
11
8
0
01 Sep 2023
Decoding Natural Images from EEG for Object Recognition
Decoding Natural Images from EEG for Object Recognition
Yonghao Song
Bingchuan Liu
Xiang Li
Nanlin Shi
Yijun Wang
Xiaorong Gao
14
22
0
25 Aug 2023
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual
  Questions
BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
Wenbo Hu
Y. Xu
Y. Li
W. Li
Z. Chen
Z. Tu
MLLM
VLM
28
120
0
19 Aug 2023
Applicability of scaling laws to vision encoding models
Applicability of scaling laws to vision encoding models
Takuya Matsuyama
K. Sasaki
Shinji Nishimoto
MedIm
6
4
0
01 Aug 2023
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Foundational Models Defining a New Era in Vision: A Survey and Outlook
Muhammad Awais
Muzammal Naseer
Salman Khan
Rao Muhammad Anwer
Hisham Cholakkal
M. Shah
Ming Yang
F. Khan
VLM
18
116
0
25 Jul 2023
What Can Simple Arithmetic Operations Do for Temporal Modeling?
What Can Simple Arithmetic Operations Do for Temporal Modeling?
Wenhao Wu
Yuxin Song
Zhun Sun
Jingdong Wang
Chang Xu
Wanli Ouyang
38
8
0
18 Jul 2023
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Fine-grained Text-Video Retrieval with Frozen Image Encoders
Zuozhuo Dai
Fang Shao
Qingkun Su
Zilong Dong
Siyu Zhu
162
1
0
14 Jul 2023
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding
  and Generation
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
Yi Wang
Yinan He
Yizhuo Li
Kunchang Li
Jiashuo Yu
...
Ping Luo
Ziwei Liu
Yali Wang
Limin Wang
Yu Qiao
VLM
VGen
19
241
0
13 Jul 2023
Emu: Generative Pretraining in Multimodality
Emu: Generative Pretraining in Multimodality
Quan-Sen Sun
Qiying Yu
Yufeng Cui
Fan Zhang
Xiaosong Zhang
Yueze Wang
Hongcheng Gao
Jingjing Liu
Tiejun Huang
Xinlong Wang
MLLM
27
126
0
11 Jul 2023
What Matters in Training a GPT4-Style Language Model with Multimodal
  Inputs?
What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng
Hanbo Zhang
Jiani Zheng
Jiangnan Xia
Guoqiang Wei
Yang Wei
Yuchen Zhang
Tao Kong
MLLM
19
71
0
05 Jul 2023
End-to-end Autonomous Driving: Challenges and Frontiers
End-to-end Autonomous Driving: Challenges and Frontiers
Li Chen
Peng Wu
Kashyap Chitta
Bernhard Jaeger
Andreas Geiger
Hongyang Li
3DV
32
260
0
29 Jun 2023
Towards Open Vocabulary Learning: A Survey
Towards Open Vocabulary Learning: A Survey
Jianzong Wu
Xiangtai Li
Shilin Xu
Haobo Yuan
Henghui Ding
...
Jiangning Zhang
Yu Tong
Xudong Jiang
Bernard Ghanem
Dacheng Tao
ObjD
VLM
27
134
0
28 Jun 2023
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy
  within a \$10,000 Budget; An Extra \$4,000 Unlocks 81.8% Accuracy
CLIPA-v2: Scaling CLIP Training with 81.1% Zero-shot ImageNet Accuracy within a \10,000 Budget; An Extra \4,000 Unlocks 81.8% Accuracy
Xianhang Li
Zeyu Wang
Cihang Xie
CLIP
VLM
40
19
0
27 Jun 2023
A Survey on Multimodal Large Language Models
A Survey on Multimodal Large Language Models
Shukang Yin
Chaoyou Fu
Sirui Zhao
Ke Li
Xing Sun
Tong Bill Xu
Enhong Chen
MLLM
LRM
33
551
0
23 Jun 2023
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
Sihan Chen
Xingjian He
Handong Li
Xiaojie Jin
Jiashi Feng
J. Liu
VLM
CLIP
22
8
0
15 Jun 2023
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to
  Enhance Visio-Linguistic Compositional Understanding
Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding
Le Zhang
Rabiul Awal
Aishwarya Agrawal
CoGe
VLM
18
9
0
15 Jun 2023
Towards AGI in Computer Vision: Lessons Learned from GPT and Large
  Language Models
Towards AGI in Computer Vision: Lessons Learned from GPT and Large Language Models
Lingxi Xie
Longhui Wei
Xiaopeng Zhang
Kaifeng Bi
Xiaotao Gu
Jianlong Chang
Qi Tian
29
6
0
14 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
16
53
0
13 Jun 2023
Fine-Grained Visual Prompting
Fine-Grained Visual Prompting
Lingfeng Yang
Yueze Wang
Xiang Li
Xinlong Wang
Jian Yang
ObjD
VLM
16
59
0
07 Jun 2023
Brain Diffusion for Visual Exploration: Cortical Discovery using Large
  Scale Generative Models
Brain Diffusion for Visual Exploration: Cortical Discovery using Large Scale Generative Models
Andrew F. Luo
Margaret M. Henderson
Leila Wehbe
Michael J. Tarr
DiffM
14
16
0
05 Jun 2023
Evaluating the Capabilities of Multi-modal Reasoning Models with
  Synthetic Task Data
Evaluating the Capabilities of Multi-modal Reasoning Models with Synthetic Task Data
Nathan Vaska
Victoria Helus
LRM
7
1
0
01 Jun 2023
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and
  Dataset
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
Sihan Chen
Handong Li
Qunbo Wang
Zijia Zhao
Ming-Ting Sun
Xinxin Zhu
J. Liu
30
95
0
29 May 2023
On Evaluating Adversarial Robustness of Large Vision-Language Models
On Evaluating Adversarial Robustness of Large Vision-Language Models
Yunqing Zhao
Tianyu Pang
Chao Du
Xiao Yang
Chongxuan Li
Ngai-man Cheung
Min-Bin Lin
VLM
AAML
MLLM
6
166
0
26 May 2023
ChatBridge: Bridging Modalities with Large Language Model as a Language
  Catalyst
ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
Zijia Zhao
Longteng Guo
Tongtian Yue
Si-Qing Chen
Shuai Shao
Xinxin Zhu
Zehuan Yuan
Jing Liu
MLLM
27
51
0
25 May 2023
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based
  Text-to-Image Generation by Selection
If at First You Don't Succeed, Try, Try Again: Faithful Diffusion-based Text-to-Image Generation by Selection
Shyamgopal Karthik
Karsten Roth
Massimiliano Mancini
Zeynep Akata
24
20
0
22 May 2023
What Makes for Good Visual Tokenizers for Large Language Models?
What Makes for Good Visual Tokenizers for Large Language Models?
Guangzhi Wang
Yixiao Ge
Xiaohan Ding
Mohan S. Kankanhalli
Ying Shan
MLLM
VLM
14
38
0
20 May 2023
An Inverse Scaling Law for CLIP Training
An Inverse Scaling Law for CLIP Training
Xianhang Li
Zeyu Wang
Cihang Xie
VLM
CLIP
35
54
0
11 May 2023
Self-Chained Image-Language Model for Video Localization and Question
  Answering
Self-Chained Image-Language Model for Video Localization and Question Answering
Shoubin Yu
Jaemin Cho
Prateek Yadav
Mohit Bansal
31
129
0
11 May 2023
Previous
12345678
Next