ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2303.15389
  4. Cited By
EVA-CLIP: Improved Training Techniques for CLIP at Scale

EVA-CLIP: Improved Training Techniques for CLIP at Scale

27 March 2023
Quan-Sen Sun
Yuxin Fang
Ledell Yu Wu
Xinlong Wang
Yue Cao
    CLIP
    VLM
ArXivPDFHTML

Papers citing "EVA-CLIP: Improved Training Techniques for CLIP at Scale"

50 / 357 papers shown
Title
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Naoya Sogi
Takashi Shibata
Makoto Terao
VLM
28
1
0
17 Jul 2024
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
OmniBind: Large-scale Omni Multimodal Representation via Binding Spaces
Zehan Wang
Ziang Zhang
Hang Zhang
Luping Liu
Rongjie Huang
Xize Cheng
Hengshuang Zhao
Zhou Zhao
30
7
0
16 Jul 2024
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer
  from Text to Image via CLIP Inversion
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion
Philipp Allgeuer
Kyra Ahrens
Stefan Wermter
CLIP
VLM
27
3
0
15 Jul 2024
Textual Query-Driven Mask Transformer for Domain Generalized
  Segmentation
Textual Query-Driven Mask Transformer for Domain Generalized Segmentation
Byeonghyun Pak
Byeongju Woo
Sunghwan Kim
Dae-Hwan Kim
Hoseong Kim
32
3
0
12 Jul 2024
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large
  Vision-Language Models
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models
Runhui Huang
Xinpeng Ding
Chunwei Wang
J. N. Han
Yulong Liu
Hengshuang Zhao
Hang Xu
Lu Hou
Wei Zhang
Xiaodan Liang
VLM
23
8
0
11 Jul 2024
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal
  Perception
DenseFusion-1M: Merging Vision Experts for Comprehensive Multimodal Perception
Xiaotong Li
Fan Zhang
Haiwen Diao
Yueze Wang
Xinlong Wang
Ling-yu Duan
VLM
24
25
0
11 Jul 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
58
4
0
09 Jul 2024
Multimodal Language Models for Domain-Specific Procedural Video
  Summarization
Multimodal Language Models for Domain-Specific Procedural Video Summarization
Nafisa Hussain
23
0
0
07 Jul 2024
AWT: Transferring Vision-Language Models via Augmentation, Weighting,
  and Transportation
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
Yuhan Zhu
Yuyang Ji
Zhiyu Zhao
Gangshan Wu
Limin Wang
VLM
39
7
0
05 Jul 2024
MiniGPT-Med: Large Language Model as a General Interface for Radiology
  Diagnosis
MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis
Asma Alkhaldi
Raneem Alnajim
Layan Alabdullatef
Rawan Alyahya
Jun Chen
Deyao Zhu
Ahmed Z. Alsinan
Mohamed Elhoseiny
LM&MA
MedIm
51
22
0
04 Jul 2024
Lateralization LoRA: Interleaved Instruction Tuning with
  Modality-Specialized Adaptations
Lateralization LoRA: Interleaved Instruction Tuning with Modality-Specialized Adaptations
Zhiyang Xu
Minqian Liu
Ying Shen
Joy Rimchala
Jiaxin Zhang
Qifan Wang
Yu Cheng
Lifu Huang
VLM
37
2
0
04 Jul 2024
FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training
  with Limited Resources
FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources
Xiyuan Wei
Fanjiang Ye
Ori Yonay
Xingyu Chen
Baixi Sun
Dingwen Tao
Tianbao Yang
VLM
CLIP
46
2
0
01 Jul 2024
Coding for Intelligence from the Perspective of Category
Coding for Intelligence from the Perspective of Category
Wenhan Yang
Zixuan Hu
Lilang Lin
Jiaying Liu
Ling-Yu Duan
AI4CE
33
1
0
01 Jul 2024
Learning Robust 3D Representation from CLIP via Dual Denoising
Learning Robust 3D Representation from CLIP via Dual Denoising
Shuqing Luo
Bowen Qu
Wei-Nan Gao
37
1
0
01 Jul 2024
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models
  via Counterfactual Probing
GenderBias-\emph{VL}: Benchmarking Gender Bias in Vision Language Models via Counterfactual Probing
Yisong Xiao
Aishan Liu
QianJia Cheng
Zhenfei Yin
Siyuan Liang
Jiapeng Li
Jing Shao
Xianglong Liu
Dacheng Tao
26
4
0
30 Jun 2024
Data curation via joint example selection further accelerates multimodal
  learning
Data curation via joint example selection further accelerates multimodal learning
Talfan Evans
Nikhil Parthasarathy
Hamza Merzic
Olivier J. Hénaff
32
12
0
25 Jun 2024
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Cambrian-1: A Fully Open, Vision-Centric Exploration of Multimodal LLMs
Shengbang Tong
Ellis L Brown
Penghao Wu
Sanghyun Woo
Manoj Middepogu
...
Xichen Pan
Austin Wang
Rob Fergus
Yann LeCun
Saining Xie
3DV
MLLM
37
278
0
24 Jun 2024
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Prism: A Framework for Decoupling and Assessing the Capabilities of VLMs
Yuxuan Qiao
Haodong Duan
Xinyu Fang
Junming Yang
Lin Chen
Songyang Zhang
Jiaqi Wang
Dahua Lin
Kai Chen
LRM
32
18
0
20 Jun 2024
VGA: Vision GUI Assistant -- Minimizing Hallucinations through
  Image-Centric Fine-Tuning
VGA: Vision GUI Assistant -- Minimizing Hallucinations through Image-Centric Fine-Tuning
Ziyang Meng
Yu Dai
Zezheng Gong
Shaoxiong Guo
Minglong Tang
Tongquan Wei
VLM
16
3
0
20 Jun 2024
Is AI fun? HumorDB: a curated dataset and benchmark to investigate
  graphical humor
Is AI fun? HumorDB: a curated dataset and benchmark to investigate graphical humor
Veedant Jain
Felipe dos Santos Alves Feitosa
Gabriel Kreiman
VLM
33
2
0
19 Jun 2024
Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
Disturbing Image Detection Using LMM-Elicited Emotion Embeddings
Maria Tzelepi
Vasileios Mezaris
18
3
0
18 Jun 2024
Unveiling Encoder-Free Vision-Language Models
Unveiling Encoder-Free Vision-Language Models
Haiwen Diao
Yufeng Cui
Xiaotong Li
Yueze Wang
Huchuan Lu
Xinlong Wang
VLM
32
27
0
17 Jun 2024
Hallucination Mitigation Prompts Long-term Video Understanding
Hallucination Mitigation Prompts Long-term Video Understanding
Yiwei Sun
Zhihang Liu
Chuanbin Liu
Bowei Pu
Zhihan Zhang
Hongtao Xie
VLM
MLLM
33
2
0
17 Jun 2024
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension
Tianren Ma
Lingxi Xie
Yunjie Tian
Boyu Yang
Yuan Zhang
37
0
0
17 Jun 2024
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal
  Dataset with One Trillion Tokens
MINT-1T: Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens
Anas Awadalla
Le Xue
Oscar Lo
Manli Shu
Hannah Lee
...
Silvio Savarese
Caiming Xiong
Ran Xu
Yejin Choi
Ludwig Schmidt
67
23
0
17 Jun 2024
Exploring the Benefits of Vision Foundation Models for Unsupervised
  Domain Adaptation
Exploring the Benefits of Vision Foundation Models for Unsupervised Domain Adaptation
B. B. Englert
Fabrizio J. Piva
Tommie Kerssies
Daan de Geus
Gijs Dubbelman
16
10
0
14 Jun 2024
Industrial Language-Image Dataset (ILID): Adapting Vision Foundation
  Models for Industrial Settings
Industrial Language-Image Dataset (ILID): Adapting Vision Foundation Models for Industrial Settings
Keno Moenck
Duc Trung Thieu
Julian Koch
Thorsten Schuppstuhl
VLM
27
0
0
14 Jun 2024
MuirBench: A Comprehensive Benchmark for Robust Multi-image
  Understanding
MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding
Fei Wang
Xingyu Fu
James Y. Huang
Zekun Li
Qin Liu
...
Kai-Wei Chang
Dan Roth
Sheng Zhang
Hoifung Poon
Muhao Chen
VLM
31
47
0
13 Jun 2024
Towards Vision-Language Geo-Foundation Model: A Survey
Towards Vision-Language Geo-Foundation Model: A Survey
Yue Zhou
Litong Feng
Yiping Ke
Xue Jiang
Junchi Yan
Xue Yang
Wayne Zhang
35
15
0
13 Jun 2024
Comparison Visual Instruction Tuning
Comparison Visual Instruction Tuning
Wei Lin
M. Jehanzeb Mirza
Sivan Doveh
Rogerio Feris
Raja Giryes
Sepp Hochreiter
Leonid Karlinsky
46
4
0
13 Jun 2024
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images
  Interleaved with Text
OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li
Zhe Chen
Weiyun Wang
Wenhai Wang
Shenglong Ye
...
Dahua Lin
Yu Qiao
Botian Shi
Conghui He
Jifeng Dai
VLM
OffRL
45
19
0
12 Jun 2024
LVBench: An Extreme Long Video Understanding Benchmark
LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang
Zehai He
Wenyi Hong
Yean Cheng
Xiaohan Zhang
...
Shiyu Huang
Bin Xu
Yuxiao Dong
Ming Ding
Jie Tang
ELM
VLM
40
63
0
12 Jun 2024
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for
  Vision Tasks
MoE Jetpack: From Dense Checkpoints to Adaptive Mixture of Experts for Vision Tasks
Xingkui Zhu
Yiran Guan
Dingkang Liang
Yuchao Chen
Yuliang Liu
Xiang Bai
MoE
35
5
0
07 Jun 2024
OVMR: Open-Vocabulary Recognition with Multi-Modal References
OVMR: Open-Vocabulary Recognition with Multi-Modal References
Zehong Ma
Shiliang Zhang
Longhui Wei
Qi Tian
VLM
28
0
0
07 Jun 2024
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
Junjie Zhou
Zheng Liu
Shitao Xiao
Bo Zhao
Yongping Xiong
36
20
0
06 Jun 2024
Parrot: Multilingual Visual Instruction Tuning
Parrot: Multilingual Visual Instruction Tuning
Hai-Long Sun
Da-Wei Zhou
Y. Li
Shiyin Lu
Chao Yi
...
Zhao Xu
Weihua Luo
Kaifu Zhang
De-Chuan Zhan
Han-Jia Ye
MLLM
23
9
0
04 Jun 2024
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Model
An-Chieh Cheng
Hongxu Yin
Yang Fu
Qiushan Guo
Ruihan Yang
Jan Kautz
Xiaolong Wang
Sifei Liu
LRM
46
43
0
03 Jun 2024
UniQA: Unified Vision-Language Pre-training for Image Quality and
  Aesthetic Assessment
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Hantao Zhou
Longxiang Tang
Rui Yang
Guanyi Qin
Yan Zhang
Runze Hu
Xiu Li
29
5
0
03 Jun 2024
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Ovis: Structural Embedding Alignment for Multimodal Large Language Model
Shiyin Lu
Yang Li
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
Han-Jia Ye
VLM
MLLM
45
35
0
31 May 2024
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision
  Models For Video Captioning and Summarization
Shotluck Holmes: A Family of Efficient Small-Scale Large Language Vision Models For Video Captioning and Summarization
Richard Luo
Austin Peng
Adithya Vasudev
Rishabh Jain
34
2
0
31 May 2024
Scaling White-Box Transformers for Vision
Scaling White-Box Transformers for Vision
Jinrui Yang
Xianhang Li
Druv Pai
Yuyin Zhou
Yi-An Ma
Yaodong Yu
Cihang Xie
ViT
34
9
0
30 May 2024
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Jina CLIP: Your CLIP Model Is Also Your Text Retriever
Andreas Koukounas
Georgios Mastrapas
Michael Gunther
Bo Wang
Scott Martens
...
Saahil Ognawala
Susana Guzman
Maximilian Werk
Nan Wang
Han Xiao
VLM
19
13
0
30 May 2024
Typography Leads Semantic Diversifying: Amplifying Adversarial
  Transferability across Multimodal Large Language Models
Typography Leads Semantic Diversifying: Amplifying Adversarial Transferability across Multimodal Large Language Models
Hao-Ran Cheng
Erjia Xiao
Jiahang Cao
Le Yang
Kaidi Xu
Jindong Gu
Renjing Xu
AAML
52
7
0
30 May 2024
Evaluating Vision-Language Models on Bistable Images
Evaluating Vision-Language Models on Bistable Images
Artemis Panagopoulou
Coby Melkin
Chris Callison-Burch
41
0
0
29 May 2024
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval
  from Linguistically Complex Descriptions
ContextBLIP: Doubly Contextual Alignment for Contrastive Image Retrieval from Linguistically Complex Descriptions
Honglin Lin
Siyu Li
Gu Nan
Chaoyue Tang
Xueting Wang
...
Yankai Rong
Zhili Zhou
Yutong Gao
Qimei Cui
Xiaofeng Tao
25
0
0
29 May 2024
Enhancing Vision-Language Model with Unmasked Token Alignment
Enhancing Vision-Language Model with Unmasked Token Alignment
Jihao Liu
Jinliang Zheng
Boxiao Liu
Yu Liu
Hongsheng Li
CLIP
18
0
0
29 May 2024
ViG: Linear-complexity Visual Sequence Learning with Gated Linear
  Attention
ViG: Linear-complexity Visual Sequence Learning with Gated Linear Attention
Bencheng Liao
Xinggang Wang
Lianghui Zhu
Qian Zhang
Chang Huang
45
3
0
28 May 2024
Why are Visually-Grounded Language Models Bad at Image Classification?
Why are Visually-Grounded Language Models Bad at Image Classification?
Yuhui Zhang
Alyssa Unell
Xiaohan Wang
Dhruba Ghosh
Yuchang Su
Ludwig Schmidt
Serena Yeung-Levy
VLM
35
27
0
28 May 2024
OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and
  Open-World Unknown Objects Supervision
OV-DQUO: Open-Vocabulary DETR with Denoising Text Query Training and Open-World Unknown Objects Supervision
Junjie Wang
Bin Chen
Bin Kang
Yulin Li
Yichi Chen
Weizhi Xian
Huifeng Chang
VLM
ObjD
18
7
0
28 May 2024
Seeing the Image: Prioritizing Visual Correlation by Contrastive
  Alignment
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Xin Xiao
Bohong Wu
Jiacong Wang
Chunyuan Li
Xun Zhou
Haoyuan Guo
VLM
34
7
0
28 May 2024
Previous
12345678
Next