ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLM
    CLIP
    OffRL
ArXivPDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 910 papers shown
Title
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via $\mathbf{\texttt{D}}$ual-$\mathbf{\texttt{H}}$ead $\mathbf{\texttt{O}}$ptimization
Simple Semi-supervised Knowledge Distillation from Vision-Language Models via D\mathbf{\texttt{D}}Dual-H\mathbf{\texttt{H}}Head O\mathbf{\texttt{O}}Optimization
Seongjae Kang
Dong Bok Lee
Hyungjoon Jang
Sung Ju Hwang
VLM
35
0
0
12 May 2025
Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning
Batch Augmentation with Unimodal Fine-tuning for Multimodal Learning
H. M. D. Kabir
S. Mondal
Mohammad Ali Moni
21
0
0
10 May 2025
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
ULFine: Unbiased Lightweight Fine-tuning for Foundation-Model-Assisted Long-Tailed Semi-Supervised Learning
Enhao Zhang
Chaohua Li
Chuanxing Geng
Songcan Chen
52
0
0
08 May 2025
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
X-Transfer Attacks: Towards Super Transferable Adversarial Attacks on CLIP
Hanxun Huang
Sarah Monazam Erfani
Yige Li
Xingjun Ma
James Bailey
AAML
34
0
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Y. Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
49
0
0
07 May 2025
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction
HapticVLM: VLM-Driven Texture Recognition Aimed at Intelligent Haptic Interaction
Muhammad Haris Khan
Miguel Altamirano Cabrera
Dmitrii Iarchuk
Yara Mahmoud
Daria Trinitatova
Issatay Tokmurziyev
Dzmitry Tsetserukou
VLM
34
0
0
05 May 2025
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models
Mitigating Group-Level Fairness Disparities in Federated Visual Language Models
Chaomeng Chen
Zitong Yu
J. Dong
Sen Su
L. Shen
Shutao Xia
Xiaochun Cao
FedML
VLM
62
0
0
03 May 2025
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling
Scalability Matters: Overcoming Challenges in InstructGLM with Similarity-Degree-Based Sampling
Hyun Lee
Chris Yi
Maminur Islam
B.D.S. Aritra
22
0
0
02 May 2025
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Dual-Forecaster: A Multimodal Time Series Model Integrating Descriptive and Predictive Texts
Wenfa Wu
Guanyu Zhang
Zheng Tan
Yi Wang
Hongsheng Qi
AI4TS
35
1
0
02 May 2025
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
SpatialLLM: A Compound 3D-Informed Design towards Spatially-Intelligent Large Multimodal Models
Wufei Ma
Luoxin Ye
Nessa McWeeney
Celso M de Melo
A. Yuille
Jieneng Chen
LRM
57
1
0
01 May 2025
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design
Investigating Zero-Shot Diagnostic Pathology in Vision-Language Models with Efficient Prompt Design
Vasudev Sharma
Ahmed Alagha
Abdelhakim Khellaf
Vincent Quoc-Huy Trinh
Mahdi S. Hosseini
33
0
0
30 Apr 2025
Bayesian Principles Improve Prompt Learning In Vision-Language Models
Bayesian Principles Improve Prompt Learning In Vision-Language Models
Mingyu Kim
Jongwoo Ko
Mijung Park
VLM
38
0
0
19 Apr 2025
Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization
Chain-of-Thought Textual Reasoning for Few-shot Temporal Action Localization
Hongwei Ji
Wulian Yun
Mengshi Qi
Huadong Ma
LRM
54
0
0
18 Apr 2025
Can Masked Autoencoders Also Listen to Birds?
Can Masked Autoencoders Also Listen to Birds?
Lukas Rauch
Ilyass Moummad
René Heinrich
Alexis Joly
Bernhard Sick
Christoph Scholz
24
0
0
17 Apr 2025
AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification
AdaptoVision: A Multi-Resolution Image Recognition Model for Robust and Scalable Classification
Md. Sanaullah Chowdhury Lameya Sabrin
VLM
30
0
0
17 Apr 2025
Perception Encoder: The best visual embeddings are not at the output of the network
Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya
Po-Yao (Bernie) Huang
Peize Sun
Jang Hyun Cho
Andrea Madotto
...
Shiyu Dong
Nikhila Ravi
Daniel Li
Piotr Dollár
Christoph Feichtenhofer
ObjD
VOS
103
0
0
17 Apr 2025
GFT: Gradient Focal Transformer
GFT: Gradient Focal Transformer
Boris Kriuk
Simranjit Kaur Gill
Shoaib Aslam
Amir Fakhrutdinov
29
0
0
14 Apr 2025
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
Weixian Lei
Jiacong Wang
Haochen Wang
X. Li
Jun Hao Liew
Jiashi Feng
Zilong Huang
26
1
0
14 Apr 2025
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Multimodal Representation Learning Techniques for Comprehensive Facial State Analysis
Kaiwen Zheng
Xuri Ge
Junchen Fu
Jun Peng
J. Jose
CVBM
42
0
0
14 Apr 2025
3D CoCa: Contrastive Learners are 3D Captioners
3D CoCa: Contrastive Learners are 3D Captioners
Ting Huang
Z. Zhang
Y. Wang
Hao Tang
25
0
0
13 Apr 2025
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Embodied Image Captioning: Self-supervised Learning Agents for Spatially Coherent Image Descriptions
Tommaso Galliena
Tommaso Apicella
Stefano Rosa
Pietro Morerio
Alessio Del Bue
Lorenzo Natale
32
0
0
11 Apr 2025
Kimi-VL Technical Report
Kimi-VL Technical Report
Kimi Team
Angang Du
B. Yin
Bowei Xing
Bowen Qu
...
Zhiqi Huang
Zihao Huang
Zijia Zhao
Z. Chen
Zongyu Lin
MLLM
VLM
MoE
103
0
0
10 Apr 2025
A Survey of Pathology Foundation Model: Progress and Future Directions
A Survey of Pathology Foundation Model: Progress and Future Directions
Conghao Xiong
Hao Chen
Joseph J. Y. Sung
LM&MA
AI4CE
48
0
0
05 Apr 2025
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
Dahun Kim
A. Piergiovanni
Ganesh Mallya
A. Angelova
CoGe
34
0
0
04 Apr 2025
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Safety Modulation: Enhancing Safety in Reinforcement Learning through Cost-Modulated Rewards
Hanping Zhang
Yuhong Guo
OffRL
31
0
0
03 Apr 2025
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
IDMR: Towards Instance-Driven Precise Visual Correspondence in Multimodal Retrieval
Bangwei Liu
Yicheng Bao
Shaohui Lin
Xuhong Wang
Xin Tan
Y. Wang
Yuan Xie
Chaochao Lu
55
0
0
01 Apr 2025
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
GECKO: Gigapixel Vision-Concept Contrastive Pretraining in Histopathology
S. Kapse
Pushpak Pati
Srikar Yellapragada
Srijan Das
Rajarsi R. Gupta
Joel H. Saltz
Dimitris Samaras
Prateek Prasanna
VLM
41
0
0
01 Apr 2025
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
FortisAVQA and MAVEN: a Benchmark Dataset and Debiasing Framework for Robust Multimodal Reasoning
Jie Ma
Zhitao Gao
Qi Chai
J. Liu
P. Wang
Jing Tao
Zhou Su
45
0
0
01 Apr 2025
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Hybrid Global-Local Representation with Augmented Spatial Guidance for Zero-Shot Referring Image Segmentation
Ting Liu
Siyuan Li
36
0
0
01 Apr 2025
Self-Evolving Visual Concept Library using Vision-Language Critics
Self-Evolving Visual Concept Library using Vision-Language Critics
Atharva Sehgal
Patrick Yuan
Ziniu Hu
Yisong Yue
Jennifer J. Sun
Swarat Chaudhuri
VLM
45
0
0
31 Mar 2025
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Order Matters: On Parameter-Efficient Image-to-Video Probing for Recognizing Nearly Symmetric Actions
Thinesh Thiyakesan Ponbagavathi
Alina Roitberg
34
0
0
31 Mar 2025
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
A Survey on Remote Sensing Foundation Models: From Vision to Multimodality
Ziyue Huang
Hongxi Yan
Qiqi Zhan
Shuai Yang
Mingming Zhang
Chenkai Zhang
Yiming Lei
Zeming Liu
Qingjie Liu
Y. Wang
42
0
0
28 Mar 2025
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
Jinjin Zhang
Guodong Wang
Yizhou Jin
Di Huang
42
1
0
24 Mar 2025
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Compositional Caching for Training-free Open-vocabulary Attribute Detection
Marco Garosi
Alessandro Conti
Gaowen Liu
Elisa Ricci
Massimiliano Mancini
ObjD
VLM
50
0
0
24 Mar 2025
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval
good4cir: Generating Detailed Synthetic Captions for Composed Image Retrieval
Pranavi Kolouju
Eric Xing
Robert Pless
Nathan Jacobs
Abby Stylianou
3DV
55
0
0
22 Mar 2025
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection
Gensheng Pei
Tao Chen
Yujia Wang
Xinhao Cai
Xiangbo Shu
Tianfei Zhou
Yazhou Yao
VLM
48
1
0
21 Mar 2025
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
ModalTune: Fine-Tuning Slide-Level Foundation Models with Multi-Modal Information for Multi-task Learning in Digital Pathology
Vishwesh Ramanathan
Tony Xu
Pushpak Pati
Faruk Ahmed
Maged Goubran
Anne L. Martel
43
0
0
21 Mar 2025
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Derm1M: A Million-scale Vision-Language Dataset Aligned with Clinical Ontology Knowledge for Dermatology
Siyuan Yan
Ming Hu
Yiwen Jiang
X. Li
Hao Fei
P. Tschandl
Harald Kittler
Zongyuan Ge
VLM
62
0
0
19 Mar 2025
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Squeeze Out Tokens from Sample for Finer-Grained Data Governance
Weixiong Lin
Chen Ju
Haicheng Wang
Shengchao Hu
Shuai Xiao
...
Yuheng Jiao
Mingshuai Yao
Jinsong Lan
Qingwen Liu
Ying Chen
48
0
0
18 Mar 2025
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
Optimized 3D Gaussian Splatting using Coarse-to-Fine Image Frequency Modulation
Umar Farooq
Jean-Yves Guillemaut
Adrian Hilton
M. Volino
3DGS
59
0
0
18 Mar 2025
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data
Haozhe Si
Yuxuan Wan
Minh Do
Deepak Vasisht
Han Zhao
Hendrik Hamann
41
0
0
17 Mar 2025
Quantum EigenGame for excited state calculation
Quantum EigenGame for excited state calculation
David Quiroga
Jason Han
Anastasios Kyrillidis
48
0
0
17 Mar 2025
Dynamic Relation Inference via Verb Embeddings
Dynamic Relation Inference via Verb Embeddings
Omri Suissa
Muhiim Ali
Ariana Azarbal
Hui Shen
Shekhar Pradhan
41
0
0
17 Mar 2025
Safe Vision-Language Models via Unsafe Weights Manipulation
Safe Vision-Language Models via Unsafe Weights Manipulation
Moreno DÍncà
E. Peruzzo
Xingqian Xu
Humphrey Shi
N. Sebe
Massimiliano Mancini
MU
55
0
0
14 Mar 2025
Leveraging Vision-Language Embeddings for Zero-Shot Learning in Histopathology Images
M. Rahaman
Ewan K. A. Millar
Erik H. W. Meijering
VLM
53
0
0
13 Mar 2025
ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning
Pengfei Luo
Jingbo Zhou
Tong Bill Xu
Yuan Xia
Linli Xu
Enhong Chen
LRM
62
0
0
13 Mar 2025
Towards Understanding Graphical Perception in Large Multimodal Models
Kai Zhang
Jianwei Yang
J. Inala
Chandan Singh
Jianfeng Gao
Yu Su
Chenglong Wang
42
1
0
13 Mar 2025
ForAug: Recombining Foregrounds and Backgrounds to Improve Vision Transformer Training with Bias Mitigation
Tobias Christian Nauen
Brian B. Moser
Federico Raue
Stanislav Frolov
Andreas Dengel
ViT
50
0
0
12 Mar 2025
Is CLIP ideal? No. Can we fix it? Yes!
Raphi Kang
Yue Song
Georgia Gkioxari
Pietro Perona
VLM
50
0
0
10 Mar 2025
Integrating Frequency-Domain Representations with Low-Rank Adaptation in Vision-Language Models
Md Azim Khan
A. Gangopadhyay
Jianwu Wang
Robert F. Erbacher
VLM
52
0
0
08 Mar 2025
1234...171819
Next