Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2207.12661
Cited By
Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training
26 July 2022
Haoxuan You
Luowei Zhou
Bin Xiao
Noel Codella
Yu Cheng
Ruochen Xu
Shih-Fu Chang
Lu Yuan
CLIP
VLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Learning Visual Representation from Modality-Shared Contrastive Language-Image Pre-training"
34 / 34 papers shown
Title
Post-pre-training for Modality Alignment in Vision-Language Foundation Models
Shinýa Yamaguchi
Dewei Feng
Sekitoshi Kanai
Kazuki Adachi
Daiki Chijiwa
VLM
34
0
0
17 Apr 2025
Continual Cross-Modal Generalization
Yan Xia
Hai Huang
Minghui Fang
Zhou Zhao
CLL
52
0
0
01 Apr 2025
Adaptive Perception for Unified Visual Multi-modal Object Tracking
Xiantao Hu
Bineng Zhong
Qihua Liang
Zhiyi Mo
Liangtao Shi
Ying Tai
Jian Yang
38
1
0
10 Feb 2025
Advanced Knowledge Transfer: Refined Feature Distillation for Zero-Shot Quantization in Edge Computing
Inpyo Hong
Youngwan Jo
Hyojeong Lee
Sunghyun Ahn
Sanghyun Park
MQ
49
2
0
26 Dec 2024
Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning
Ilaria Manco
Justin Salamon
Oriol Nieto
23
1
0
17 Sep 2024
Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks
Hunmin Yang
Jongoh Jeong
Kuk-Jin Yoon
AAML
VLM
58
4
0
30 Jul 2024
Mitigate the Gap: Investigating Approaches for Improving Cross-Modal Alignment in CLIP
Sedigheh Eslami
Gerard de Melo
VLM
33
3
0
25 Jun 2024
RAP: Efficient Text-Video Retrieval with Sparse-and-Correlated Adapter
Meng Cao
Haoran Tang
Jinfa Huang
Peng Jin
Can Zhang
Ruyang Liu
Long Chen
Xiaodan Liang
Li-ming Yuan
Ge Li
93
11
0
29 May 2024
FFF: Fixing Flawed Foundations in contrastive pre-training results in very strong Vision-Language models
Adrian Bulat
Yassine Ouali
Georgios Tzimiropoulos
VLM
35
4
0
16 May 2024
Siamese Vision Transformers are Scalable Audio-visual Learners
Yan-Bo Lin
Gedas Bertasius
37
5
0
28 Mar 2024
Unlocking the Potential of Multimodal Unified Discrete Representation through Training-Free Codebook Optimization and Hierarchical Alignment
Hai Huang
Yan Xia
Shengpeng Ji
Shulei Wang
Hanting Wang
Jieming Zhu
Zhenhua Dong
Zhou Zhao
22
6
0
08 Mar 2024
Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization
Jixuan Leng
Yijiang Li
Haohan Wang
VLM
19
0
0
26 Nov 2023
GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding
Zekun Li
Wenxuan Zhou
Yao-Yi Chiang
Muhao Chen
SyDa
26
26
0
23 Oct 2023
A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance
Zeyi Huang
Andy Zhou
Zijian Lin
Mu Cai
Haohan Wang
Yong Jae Lee
VLM
OOD
19
27
0
21 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
17
17
0
04 Sep 2023
GrowCLIP: Data-aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-training
Xi Deng
Han Shi
Runhu Huang
Changlin Li
Hang Xu
Jianhua Han
James T. Kwok
Shen Zhao
Wei Zhang
Xiaodan Liang
CLIP
VLM
24
3
0
22 Aug 2023
IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models
Haoxuan You
Rui Sun
Zhecan Wang
Long Chen
Gengyu Wang
Hammad A. Ayyubi
Kai-Wei Chang
Shih-Fu Chang
VLM
MLLM
LRM
39
43
0
24 May 2023
Improved baselines for vision-language pre-training
Enrico Fini
Pietro Astolfi
Adriana Romero Soriano
Jakob Verbeek
M. Drozdzal
SSL
CLIP
VLM
45
22
0
15 May 2023
On Robustness in Multimodal Learning
Brandon McKinzie
Joseph Cheng
Vaishaal Shankar
Yinfei Yang
Jonathon Shlens
Alexander Toshev
25
2
0
10 Apr 2023
Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation
Yaowei Li
Bang-ju Yang
Xuxin Cheng
Zhihong Zhu
Hongxiang Li
Yuexian Zou
11
31
0
28 Mar 2023
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Yuxiao Chen
Jianbo Yuan
Yu Tian
Shijie Geng
Xinyu Li
Ding Zhou
Dimitris N. Metaxas
Hongxia Yang
14
33
0
27 Mar 2023
CoBIT: A Contrastive Bi-directional Image-Text Generation Model
Haoxuan You
Mandy Guo
Zhecan Wang
Kai-Wei Chang
Jason Baldridge
Jiahui Yu
DiffM
37
12
0
23 Mar 2023
Learning Customized Visual Models with Retrieval-Augmented Knowledge
Haotian Liu
Kilho Son
Jianwei Yang
Ce Liu
Jianfeng Gao
Yong Jae Lee
Chunyuan Li
VLM
30
51
0
17 Jan 2023
RILS: Masked Visual Reconstruction in Language Semantic Space
Shusheng Yang
Yixiao Ge
Kun Yi
Dian Li
Ying Shan
Xiaohu Qie
Xinggang Wang
CLIP
24
11
0
17 Jan 2023
CLIPPO: Image-and-Language Understanding from Pixels Only
Michael Tschannen
Basil Mustafa
N. Houlsby
CLIP
VLM
19
47
0
15 Dec 2022
NLIP: Noise-robust Language-Image Pre-training
Runhu Huang
Yanxin Long
Jianhua Han
Hang Xu
Xiwen Liang
Chunjing Xu
Xiaodan Liang
VLM
26
30
0
14 Dec 2022
SuS-X: Training-Free Name-Only Transfer of Vision-Language Models
Vishaal Udandarao
Ankush Gupta
Samuel Albanie
VLM
MLLM
22
103
0
28 Nov 2022
Leveraging per Image-Token Consistency for Vision-Language Pre-training
Yunhao Gou
Tom Ko
Hansi Yang
James T. Kwok
Yu Zhang
Mingxuan Wang
VLM
14
9
0
20 Nov 2022
Cross-Modal Adapter for Text-Video Retrieval
Haojun Jiang
Jianke Zhang
Rui Huang
Chunjiang Ge
Zanlin Ni
Jiwen Lu
Jie Zhou
S. Song
Gao Huang
40
36
0
17 Nov 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
24
21
0
09 Oct 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
388
4,110
0
28 Jan 2022
VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text
Hassan Akbari
Liangzhe Yuan
Rui Qian
Wei-Hong Chuang
Shih-Fu Chang
Yin Cui
Boqing Gong
ViT
240
573
0
22 Apr 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
293
3,683
0
11 Feb 2021
Unified Vision-Language Pre-Training for Image Captioning and VQA
Luowei Zhou
Hamid Palangi
Lei Zhang
Houdong Hu
Jason J. Corso
Jianfeng Gao
MLLM
VLM
250
926
0
24 Sep 2019
1