ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,041 papers shown
Title
LingoQA: Video Question Answering for Autonomous Driving
LingoQA: Video Question Answering for Autonomous Driving
Ana-Maria Marcu
Long Chen
Jan Hünermann
Alice Karnsund
Benoît Hanotte
...
Vijay Badrinarayanan
Alex Kendall
Jamie Shotton
Elahe Arani
Oleg Sinavski
120
89
0
21 Dec 2023
Multimodal Federated Learning with Missing Modality via Prototype Mask
  and Contrast
Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast
Guangyin Bao
Tao Gui
Duoqian Miao
Zixuan Gong
Liang Hu
Ke Liu
Yang Liu
Chongyang Shi
152
16
0
21 Dec 2023
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large
  Multimodal and Language Models
InfoVisDial: An Informative Visual Dialogue Dataset by Bridging Large Multimodal and Language Models
Bingbing Wen
Zhengyuan Yang
Jianfeng Wang
Zhe Gan
Bill Howe
Lijuan Wang
MLLM
164
3
0
21 Dec 2023
Learning Object State Changes in Videos: An Open-World Perspective
Learning Object State Changes in Videos: An Open-World Perspective
Zihui Xue
Kumar Ashutosh
Kristen Grauman
VGen
292
33
0
19 Dec 2023
Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM
  Finetuning
Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning
Bingchen Zhao
Haoqin Tu
Chen Wei
Jieru Mei
Cihang Xie
263
51
0
18 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Data-Efficient Multimodal Fusion on a Single GPUComputer Vision and Pattern Recognition (CVPR), 2023
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
Anthony L. Caterini
409
8
0
15 Dec 2023
General Object Foundation Model for Images and Videos at Scale
General Object Foundation Model for Images and Videos at ScaleComputer Vision and Pattern Recognition (CVPR), 2023
Junfeng Wu
Yi Jiang
Qihao Liu
Zehuan Yuan
Xiang Bai
Song Bai
VOSVLM
300
74
0
14 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image
  Captioning
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image CaptioningAAAI Conference on Artificial Intelligence (AAAI), 2023
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIPVLM
212
20
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIPVLM
262
6
0
14 Dec 2023
ViLA: Efficient Video-Language Alignment for Video Question Answering
ViLA: Efficient Video-Language Alignment for Video Question AnsweringEuropean Conference on Computer Vision (ECCV), 2023
Xijun Wang
Junbang Liang
Chun-Kai Wang
Kenan Deng
Yu Lou
Ming-Chyuan Lin
Shan Yang
289
21
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human
  Pathology
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedImLM&MA
178
28
0
13 Dec 2023
CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor
CLIP as RNN: Segment Countless Visual Concepts without Training EndeavorComputer Vision and Pattern Recognition (CVPR), 2023
Shuyang Sun
Runjia Li
Juil Sock
Xiuye Gu
Siyang Li
VLMCLIP
419
54
0
12 Dec 2023
Domain Prompt Learning with Quaternion Networks
Domain Prompt Learning with Quaternion NetworksComputer Vision and Pattern Recognition (CVPR), 2023
Qinglong Cao
Zhengqin Xu
Yuntian Chen
Chao Ma
Xiaokang Yang
VLM
228
20
0
12 Dec 2023
Honeybee: Locality-enhanced Projector for Multimodal LLM
Honeybee: Locality-enhanced Projector for Multimodal LLM
Junbum Cha
Wooyoung Kang
Jonghwan Mun
Byungseok Roh
MLLM
333
194
0
11 Dec 2023
4M: Massively Multimodal Masked Modeling
4M: Massively Multimodal Masked Modeling
David Mizrahi
Roman Bachmann
Ouguzhan Fatih Kar
Teresa Yeo
Mingfei Gao
Afshin Dehghan
Amir Zamir
MLLM
254
106
0
11 Dec 2023
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
Jiashuo Fan
Yaoyuan Liang
Leyao Liu
Shao-Lun Huang
Lei Zhang
255
6
0
11 Dec 2023
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting
  Multi-modal Language Models
Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Shitian Zhao
Zhuowan Li
Yadong Lu
Yaoyao Liu
Yan Wang
LRM
167
14
0
09 Dec 2023
Bad Students Make Great Teachers: Active Learning Accelerates
  Large-Scale Visual Understanding
Bad Students Make Great Teachers: Active Learning Accelerates Large-Scale Visual UnderstandingEuropean Conference on Computer Vision (ECCV), 2023
Talfan Evans
Shreya Pathak
Hamza Merzic
Jonathan Schwarz
Ryutaro Tanno
Olivier J. Hénaff
252
25
0
08 Dec 2023
AVA: Towards Autonomous Visualization Agents through Visual
  Perception-Driven Decision-Making
AVA: Towards Autonomous Visualization Agents through Visual Perception-Driven Decision-Making
Shusen Liu
Haichao Miao
Zhimin Li
M. Olson
Valerio Pascucci
P. Bremer
290
15
0
07 Dec 2023
TokenCompose: Text-to-Image Diffusion with Token-level Supervision
TokenCompose: Text-to-Image Diffusion with Token-level Supervision
Zirui Wang
Zhizhou Sha
Zheng Ding
Yilin Wang
Zhuowen Tu
DiffM
255
14
0
06 Dec 2023
Foundation Models for Weather and Climate Data Understanding: A
  Comprehensive Survey
Foundation Models for Weather and Climate Data Understanding: A Comprehensive Survey
Shengchao Chen
Guodong Long
Jing Jiang
Dikai Liu
Chengqi Zhang
SyDaAI4CE
296
35
0
05 Dec 2023
Rejuvenating image-GPT as Strong Visual Representation Learners
Rejuvenating image-GPT as Strong Visual Representation LearnersInternational Conference on Machine Learning (ICML), 2023
Sucheng Ren
Zeyu Wang
Hongru Zhu
Junfei Xiao
Yaoyao Liu
Cihang Xie
VLM
254
11
0
04 Dec 2023
Cross-Modal Adaptive Dual Association for Text-to-Image Person Retrieval
Cross-Modal Adaptive Dual Association for Text-to-Image Person RetrievalIEEE transactions on multimedia (IEEE TMM), 2023
Dixuan Lin
Yi-Xing Peng
Jingke Meng
Wei-Shi Zheng
167
25
0
04 Dec 2023
SCLIP: Rethinking Self-Attention for Dense Vision-Language Inference
SCLIP: Rethinking Self-Attention for Dense Vision-Language InferenceEuropean Conference on Computer Vision (ECCV), 2023
Feng Wang
Jieru Mei
Yaoyao Liu
VLM
329
116
0
04 Dec 2023
PixelLM: Pixel Reasoning with Large Multimodal Model
PixelLM: Pixel Reasoning with Large Multimodal ModelComputer Vision and Pattern Recognition (CVPR), 2023
Zhongwei Ren
Zhicheng Huang
Yunchao Wei
Yao-Min Zhao
Dongmei Fu
Jiashi Feng
Xiaojie Jin
VLMMLLMLRM
341
184
0
04 Dec 2023
How to Configure Good In-Context Sequence for Visual Question Answering
How to Configure Good In-Context Sequence for Visual Question AnsweringComputer Vision and Pattern Recognition (CVPR), 2023
Li Li
Jiawei Peng
Huiyi Chen
Chongyang Gao
Xu Yang
MLLM
210
36
0
04 Dec 2023
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Behind the Magic, MERLIM: Multi-modal Evaluation Benchmark for Large Image-Language Models
Andrés Villa
Juan Carlos León Alcázar
Alvaro Soto
Bernard Ghanem
MLLMVLM
256
18
0
03 Dec 2023
A Comprehensive Study of Vision Transformers in Image Classification
  Tasks
A Comprehensive Study of Vision Transformers in Image Classification Tasks
Mahmoud Khalil
Ahmad Khalil
A. Ngom
ViT
185
14
0
02 Dec 2023
Grounding Everything: Emerging Localization Properties in
  Vision-Language Transformers
Grounding Everything: Emerging Localization Properties in Vision-Language TransformersComputer Vision and Pattern Recognition (CVPR), 2023
Walid Bousselham
Felix Petersen
Vittorio Ferrari
Hilde Kuehne
ObjDVLM
319
72
0
01 Dec 2023
Segment and Caption Anything
Segment and Caption AnythingComputer Vision and Pattern Recognition (CVPR), 2023
Xiaoke Huang
Jianfeng Wang
Yansong Tang
Zheng Zhang
Han Hu
Jiwen Lu
Lijuan Wang
Zicheng Liu
MLLMVLM
202
32
0
01 Dec 2023
Infrared Image Super-Resolution via GAN
Infrared Image Super-Resolution via GAN
Y. Huang
S. Omachi
GAN
263
0
0
01 Dec 2023
LightCLIP: Learning Multi-Level Interaction for Lightweight
  Vision-Language Models
LightCLIP: Learning Multi-Level Interaction for Lightweight Vision-Language Models
Ying Nie
Wei He
Kai Han
Yehui Tang
Tianyu Guo
Fanyi Du
Yunhe Wang
VLM
200
5
0
01 Dec 2023
Vision-Language Models Learn Super Images for Efficient Partially
  Relevant Video Retrieval
Vision-Language Models Learn Super Images for Efficient Partially Relevant Video RetrievalACM Transactions on Multimedia Computing, Communications, and Applications (TOMCCAP) (TOMM), 2023
Taichi Nishimura
Shota Nakada
Masayoshi Kondo
VLM
283
6
0
01 Dec 2023
Green Edge AI: A Contemporary Survey
Green Edge AI: A Contemporary SurveyProceedings of the IEEE (Proc. IEEE), 2023
Yuyi Mao
X. Yu
Kaibin Huang
Ying-Jun Angela Zhang
Jun Zhang
358
51
0
01 Dec 2023
Brainformer: Mimic Human Visual Brain Functions to Machine Vision Models
  via fMRI
Brainformer: Mimic Human Visual Brain Functions to Machine Vision Models via fMRI
Xuan-Bac Nguyen
Pawan Sinha
Arabinda Kumar Choudhary
Samee U. Khan
Khoa Luu
ViTMedIm
324
4
0
30 Nov 2023
MLLMs-Augmented Visual-Language Representation Learning
MLLMs-Augmented Visual-Language Representation Learning
Yanqing Liu
Kai Wang
Wenqi Shao
Ping Luo
Yu Qiao
Mike Zheng Shou
Kaipeng Zhang
Yang You
VLM
224
19
0
30 Nov 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language
  Understanding
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language UnderstandingComputer Vision and Pattern Recognition (CVPR), 2023
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLMCoGeMLLM
486
39
0
30 Nov 2023
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Spacewalk-18: A Benchmark for Multimodal and Long-form Procedural Video Understanding in Novel Domains
Rohan Myer Krishnan
Zitian Tang
Zhiqiu Yu
Chen Sun
451
2
0
30 Nov 2023
GELDA: A generative language annotation framework to reveal visual
  biases in datasets
GELDA: A generative language annotation framework to reveal visual biases in datasets
Krish Kabra
Kathleen M. Lewis
Guha Balakrishnan
VLM
148
1
0
29 Nov 2023
CLIPC8: Face liveness detection algorithm based on image-text pairs and
  contrastive learning
CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning
Xu Liu
Shu Zhou
Yurong Song
Wenzhe Luo
Xin Zhang
132
2
0
29 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction
  Learner
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
159
4
0
29 Nov 2023
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced
  Training
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced TrainingComputer Vision and Pattern Recognition (CVPR), 2023
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Raviteja Vemulapalli
Oncel Tuzel
CLIPVLM
567
81
0
28 Nov 2023
The curse of language biases in remote sensing VQA: the role of spatial
  attributes, language diversity, and the need for clear evaluation
The curse of language biases in remote sensing VQA: the role of spatial attributes, language diversity, and the need for clear evaluation
Christel Chappuis
Eliot Walt
Vincent Mendez
Sylvain Lobry
B. L. Saux
D. Tuia
226
7
0
28 Nov 2023
Large Model Based Referring Camouflaged Object Detection
Large Model Based Referring Camouflaged Object Detection
Shupeng Cheng
Ge-Peng Ji
Pengda Qin
Deng-Ping Fan
Bowen Zhou
Peng Xu
ObjD
230
13
0
28 Nov 2023
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf
  Vision-Language Models
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023
Jiayun Luo
Siddhesh Khandelwal
Leonid Sigal
Boyang Albert Li
MLLMVLM
550
11
0
28 Nov 2023
IG Captioner: Information Gain Captioners are Strong Zero-shot
  Classifiers
IG Captioner: Information Gain Captioners are Strong Zero-shot ClassifiersEuropean Conference on Computer Vision (ECCV), 2023
Chenglin Yang
Siyuan Qiao
Yuan Cao
Yu Zhang
Tao Zhu
Yaoyao Liu
Jiahui Yu
VLM
145
3
0
27 Nov 2023
ViT-Lens: Towards Omni-modal Representations
ViT-Lens: Towards Omni-modal RepresentationsComputer Vision and Pattern Recognition (CVPR), 2023
Weixian Lei
Yixiao Ge
Kun Yi
Jianfeng Zhang
Difei Gao
Dylan Sun
Yuying Ge
Ying Shan
Mike Zheng Shou
179
32
0
27 Nov 2023
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning
  Benchmark for Expert AGI
MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGIComputer Vision and Pattern Recognition (CVPR), 2023
Xiang Yue
Yuansheng Ni
Kai Zhang
Tianyu Zheng
Ruoqi Liu
...
Yibo Liu
Wenhao Huang
Huan Sun
Yu-Chuan Su
Wenhu Chen
OSLMELMVLM
841
1,575
0
27 Nov 2023
Efficient Pre-training for Localized Instruction Generation of Videos
Efficient Pre-training for Localized Instruction Generation of Videos
Anil Batra
Davide Moltisanti
Laura Sevilla-Lara
Marcus Rohrbach
Frank Keller
356
0
0
27 Nov 2023
Align before Adapt: Leveraging Entity-to-Region Alignments for
  Generalizable Video Action Recognition
Align before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action RecognitionComputer Vision and Pattern Recognition (CVPR), 2023
Yifei Chen
Dapeng Chen
Ruijin Liu
Sai Zhou
Wenyuan Xue
Wei Peng
240
15
0
27 Nov 2023
Previous
123...101112...192021
Next