Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2205.01917
Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
VLM
CLIP
OffRL
Re-assign community
ArXiv
PDF
HTML
Papers citing
"CoCa: Contrastive Captioners are Image-Text Foundation Models"
50 / 910 papers shown
Title
SyCoCa: Symmetrizing Contrastive Captioners with Attentive Masking for Multimodal Alignment
Ziping Ma
Furong Xu
Jian Liu
Ming Yang
Qingpei Guo
VLM
34
3
0
04 Jan 2024
Data-Centric Foundation Models in Computational Healthcare: A Survey
Yunkun Zhang
Jin Gao
Zheling Tan
Lingfeng Zhou
Kexin Ding
Mu Zhou
Shaoting Zhang
Dequan Wang
AI4CE
21
22
0
04 Jan 2024
Towards Truly Zero-shot Compositional Visual Reasoning with LLMs as Programmers
Aleksandar Stanić
Sergi Caelles
Michael Tschannen
LRM
VLM
23
9
0
03 Jan 2024
Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition
Kyle Buettner
Sina Malakouti
Xiang Lorraine Li
Adriana Kovashka
39
3
0
03 Jan 2024
AliFuse: Aligning and Fusing Multi-modal Medical Data for Computer-Aided Diagnosis
Qiuhui Chen
Yi Hong
MedIm
15
1
0
02 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
K. Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLM
VGen
16
12
0
01 Jan 2024
Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels
Haoning Wu
Zicheng Zhang
Weixia Zhang
Chaofeng Chen
Liang Liao
...
Wenxiu Sun
Qiong Yan
Xiongkuo Min
Guangtao Zhai
Weisi Lin
14
131
0
28 Dec 2023
Prompt Expansion for Adaptive Text-to-Image Generation
Siddhartha Datta
Alexander Ku
Deepak Ramachandran
Peter Anderson
DiffM
19
8
0
27 Dec 2023
LeanVec: Searching vectors faster by making them fit
Mariano Tepper
Ishwar Bhati
Cecilia Aguerrebere
Mark Hildebrand
Ted Willke
VLM
OODD
13
1
0
26 Dec 2023
UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces
Jiannan Wu
Yi-Xin Jiang
Bin Yan
Huchuan Lu
Zehuan Yuan
Ping Luo
VOS
27
17
0
25 Dec 2023
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
A. Hengel
VLM
34
1
0
22 Dec 2023
Leveraging Habitat Information for Fine-grained Bird Identification
Tin Nguyen
Anh Nguyen
Anh Nguyen
VLM
33
0
0
22 Dec 2023
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
Zhe Chen
Jiannan Wu
Wenhai Wang
Weijie Su
Guo Chen
...
Bin Li
Ping Luo
Tong Lu
Yu Qiao
Jifeng Dai
VLM
MLLM
156
918
0
21 Dec 2023
LingoQA: Video Question Answering for Autonomous Driving
Ana-Maria Marcu
Long Chen
Jan Hünermann
Alice Karnsund
Benoît Hanotte
...
Vijay Badrinarayanan
Alex Kendall
Jamie Shotton
Elahe Arani
Oleg Sinavski
21
31
0
21 Dec 2023
Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast
Guangyin Bao
Qi Zhang
Duoqian Miao
Zixuan Gong
Liang Hu
Ke Liu
Yang Liu
Chongyang Shi
29
8
0
21 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
31
1
0
21 Dec 2023
Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue
Kumar Ashutosh
Kristen Grauman
VGen
25
18
0
19 Dec 2023
Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning
Bingchen Zhao
Haoqin Tu
Chen Wei
Jieru Mei
Cihang Xie
6
31
0
18 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
G. Loaiza-Ganem
M. Volkovs
35
3
0
15 Dec 2023
General Object Foundation Model for Images and Videos at Scale
Junfeng Wu
Yi-Xin Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOS
VLM
25
38
0
14 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIP
VLM
27
10
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
11
4
0
14 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
24
13
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
26
20
0
13 Dec 2023
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
Shuyang Sun
Runjia Li
Philip H. S. Torr
Xiuye Gu
Siyang Li
VLM
CLIP
20
32
0
12 Dec 2023
Domain Prompt Learning with Quaternion Networks
Qinglong Cao
Zhengqin Xu
Yuntian Chen
Chao Ma
Xiaokang Yang
VLM
27
10
0
12 Dec 2023
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha
Wooyoung Kang
Jonghwan Mun
Byungseok Roh
MLLM
24
111
0
11 Dec 2023
4M: Massively Multimodal Masked Modeling
David Mizrahi
Roman Bachmann
Ouguzhan Fatih Kar
Teresa Yeo
Mingfei Gao
Afshin Dehghan
Amir Zamir
MLLM
39
62
0
11 Dec 2023
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
Jiashuo Fan
Yaoyuan Liang
Leyao Liu
Shao-Lun Huang
Lei Zhang
30
2
0
11 Dec 2023
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models
Shitian Zhao
Zhuowan Li
Yadong Lu
Alan L. Yuille
Yan Wang
LRM
60
5
0
09 Dec 2023
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual Understanding
Talfan Evans
Shreya Pathak
Hamza Merzic
Jonathan Schwarz
Ryutaro Tanno
Olivier J. Hénaff
8
16
0
08 Dec 2023
AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making
Shusen Liu
Haichao Miao
Zhimin Li
M. Olson
Valerio Pascucci
P. Bremer
22
8
0
07 Dec 2023
TokenCompose: Text-to-Image Diffusion with Token-level Supervision
Zirui Wang
Zhizhou Sha
Zheng Ding
Yilin Wang
Zhuowen Tu
DiffM
27
21
0
06 Dec 2023
Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey
Shengchao Chen
Guodong Long
Jing Jiang
Dikai Liu
Chengqi Zhang
SyDa
AI4CE
26
23
0
05 Dec 2023
Rejuvenating image-GPT as Strong Visual Representation Learners
Sucheng Ren
Zeyu Wang
Hongru Zhu
Junfei Xiao
Alan L. Yuille
Cihang Xie
VLM
42
7
0
04 Dec 2023
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
Dixuan Lin
Yi-Xing Peng
Jingke Meng
Wei-Shi Zheng
27
5
0
04 Dec 2023
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
Feng Wang
Jieru Mei
Alan L. Yuille
VLM
19
54
0
04 Dec 2023
PixelLM: Pixel Reasoning with Large Multimodal Model
Zhongwei Ren
Zhicheng Huang
Yunchao Wei
Yao-Min Zhao
Dongmei Fu
Jiashi Feng
Xiaojie Jin
VLM
MLLM
LRM
28
78
0
04 Dec 2023
How to Configure Good In-Context Sequence for Visual Question Answering
Li Li
Jiawei Peng
Huiyi Chen
Chongyang Gao
Xu Yang
MLLM
15
20
0
04 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLM
VLM
18
9
0
03 Dec 2023
A Comprehensive Study of Vision Transformers in Image Classification Tasks
Mahmoud Khalil
Ahmad Khalil
A. Ngom
ViT
8
8
0
02 Dec 2023
Grounding Everything: Emerging Localization Properties in Vision-Language Transformers
Walid Bousselham
Felix Petersen
Vittorio Ferrari
Hilde Kuehne
ObjD
VLM
29
39
0
01 Dec 2023
Segment and Caption Anything
Xiaoke Huang
Jianfeng Wang
Yansong Tang
Zheng Zhang
Han Hu
Jiwen Lu
Lijuan Wang
Zicheng Liu
MLLM
VLM
21
17
0
01 Dec 2023
Infrared Image Super-Resolution via GAN
Y. Huang
S. Omachi
GAN
21
0
0
01 Dec 2023
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models
Ying Nie
Wei He
Kai Han
Yehui Tang
Tianyu Guo
Fanyi Du
Yunhe Wang
VLM
13
3
0
01 Dec 2023
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video Retrieval
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
13
0
0
01 Dec 2023
Green Edge AI: A Contemporary Survey
Yuyi Mao
X. Yu
Kaibin Huang
Ying-Jun Angela Zhang
Jun Zhang
26
16
0
01 Dec 2023
Brainformer: Mimic Human Visual Brain Functions to Machine Vision Models via fMRI
Xuan-Bac Nguyen
Xin Li
Pawan Sinha
Samee U. Khan
Khoa Luu
ViT
MedIm
22
0
0
30 Nov 2023
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu
Kai Wang
Wenqi Shao
Ping Luo
Yu Qiao
Mike Zheng Shou
Kaipeng Zhang
Yang You
VLM
21
11
0
30 Nov 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLM
CoGe
MLLM
21
17
0
30 Nov 2023
Previous
1
2
3
...
7
8
9
...
17
18
19
Next