ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models
v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLMCLIPOffRL
ArXiv (abs)PDFHTMLHuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,043 papers shown
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language
  Inference
ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference
Mengcheng Lan
Chaofeng Chen
Yiping Ke
Xinjiang Wang
Xue Jiang
Wayne Zhang
VLM
332
68
0
17 Jul 2024
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Object-Aware Query Perturbation for Cross-Modal Image-Text Retrieval
Naoya Sogi
Takashi Shibata
Makoto Terao
VLM
282
4
0
17 Jul 2024
Open Vocabulary Multi-Label Video Classification
Open Vocabulary Multi-Label Video Classification
Rohit Gupta
Mamshad Nayeem Rizve
Jayakrishnan Unnikrishnan
Ashish Tawari
S. D. Tran
Mubarak Shah
Benjamin Z. Yao
Trishul Chilimbi
VLM
241
5
0
12 Jul 2024
NODE-Adapter: Neural Ordinary Differential Equations for Better
  Vision-Language Reasoning
NODE-Adapter: Neural Ordinary Differential Equations for Better Vision-Language Reasoning
Yi Zhang
Chun-Wun Cheng
Ke Yu
Zhihai He
Carola-Bibiane Schonlieb
Angelica I Aviles-Rivero
VLM
251
3
0
11 Jul 2024
Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement
Bootstrapping Vision-language Models for Self-supervised Remote Physiological Measurement
Zijie Yue
Miaojing Shi
Hanli Wang
Shuai Ding
Qijun Chen
Shanlin Yang
346
1
0
11 Jul 2024
TIP: Tabular-Image Pre-training for Multimodal Classification with
  Incomplete Data
TIP: Tabular-Image Pre-training for Multimodal Classification with Incomplete Data
Siyi Du
Shaoming Zheng
Yinsong Wang
Wenjia Bai
D. O’Regan
Chen Qin
LMTD
264
20
0
10 Jul 2024
Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring
  Image Segmentation
Pseudo-RIS: Distinctive Pseudo-supervision Generation for Referring Image Segmentation
Seonghoon Yu
Paul Hongsuck Seo
Jeany Son
DiffM
419
12
0
10 Jul 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
383
7
0
09 Jul 2024
Leveraging Task-Specific Knowledge from LLM for Semi-Supervised 3D
  Medical Image Segmentation
Leveraging Task-Specific Knowledge from LLM for Semi-Supervised 3D Medical Image Segmentation
Suruchi Kumari
Aryan Das
Swalpa Kumar Roy
Indu Joshi
Pravendra Singh
224
5
0
06 Jul 2024
Precision at Scale: Domain-Specific Datasets On-Demand
Precision at Scale: Domain-Specific Datasets On-Demand
Jesús M. Rodríguez-de-Vera
Imanol G. Estepa
Ignacio Sarasúa
Bhalaji Nagarajan
Petia Radeva
250
3
0
03 Jul 2024
FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training
  with Limited Resources
FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources
Xiyuan Wei
Fanjiang Ye
Ori Yonay
Xingyu Chen
Baixi Sun
Dingwen Tao
Tianbao Yang
VLMCLIP
403
5
0
01 Jul 2024
Semantic Compositions Enhance Vision-Language Contrastive Learning
Semantic Compositions Enhance Vision-Language Contrastive Learning
Maxwell Mbabilla Aladago
Lorenzo Torresani
Soroush Vosoughi
CoGeVLMCLIP
178
1
0
01 Jul 2024
PathAlign: A vision-language model for whole slide images in
  histopathology
PathAlign: A vision-language model for whole slide images in histopathology
Faruk Ahmed
Andrew Sellergren
Lin Yang
Shawn Xu
Boris Babenko
...
S. Shetty
Daniel Golden
Yao Xiao
David F. Steiner
Ellery Wulczyn
LM&MAVLM
272
28
0
27 Jun 2024
Foundational Models for Pathology and Endoscopy Images: Application for
  Gastric Inflammation
Foundational Models for Pathology and Endoscopy Images: Application for Gastric Inflammation
H. Kerdegari
Kyle Higgins
Dennis Veselkov
I. Laponogov
I. Poļaka
...
Junior Andrea Pescino
M. Leja
M. Dinis-Ribeiro
T. F. Kanonnikoff
Kirill Veselkov
421
6
0
26 Jun 2024
Diffusion Model-Based Video Editing: A Survey
Diffusion Model-Based Video Editing: A Survey
Wenhao Sun
Rong-Cheng Tu
Jingyi Liao
Dacheng Tao
VGen
330
36
0
26 Jun 2024
Visualization Literacy of Multimodal Large Language Models: A
  Comparative Study
Visualization Literacy of Multimodal Large Language Models: A Comparative Study
Zhimin Li
Haichao Miao
Valerio Pascucci
Shusen Liu
295
12
0
24 Jun 2024
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image
  Analysis
HEST-1k: A Dataset for Spatial Transcriptomics and Histology Image Analysis
Guillaume Jaume
Paul Doucet
Andrew H. Song
Ming Y. Lu
Cristina Almagro-Pérez
...
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Ahrong Kim
Faisal Mahmood
371
86
0
23 Jun 2024
A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
A Simple Framework for Open-Vocabulary Zero-Shot Segmentation
Thomas Stegmüller
Tim Lebailly
Nikola Dukic
Behzad Bozorgtabar
Tinne Tuytelaars
Jean-Philippe Thiran
VLM
434
3
0
23 Jun 2024
Multi-modal Transfer Learning between Biological Foundation Models
Multi-modal Transfer Learning between Biological Foundation Models
Juan Jose Garau-Luis
Patrick Bordes
Liam Gonzalez
Masa Roller
Bernardo P. de Almeida
...
Stefan Laurent
Jan Grzegorzewski
Maren Lang
Thomas Pierrot
Guillaume Richard
AI4CE
313
12
0
20 Jun 2024
StableSemantics: A Synthetic Language-Vision Dataset of Semantic
  Representations in Naturalistic Images
StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Rushikesh Zawar
Shaurya Dewan
Andrew F. Luo
Margaret M. Henderson
Michael J. Tarr
Leila Wehbe
VGenCoGe
189
1
0
19 Jun 2024
Towards a multimodal framework for remote sensing image change retrieval
  and captioning
Towards a multimodal framework for remote sensing image change retrieval and captioningIFIP Working Conference on Database Semantics (IWDS), 2024
Roger Ferrod
Luigi Di Caro
Dino Ienco
206
5
0
19 Jun 2024
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via
  Multimodal LLMs
GSR-BENCH: A Benchmark for Grounded Spatial Reasoning Evaluation via Multimodal LLMs
Navid Rajabi
Jana Kosecka
196
24
0
19 Jun 2024
SeTAR: Out-of-Distribution Detection with Selective Low-Rank
  Approximation
SeTAR: Out-of-Distribution Detection with Selective Low-Rank ApproximationNeural Information Processing Systems (NeurIPS), 2024
Yixia Li
Boya Xiong
Guanhua Chen
Yun Chen
OODD
259
7
0
18 Jun 2024
Improving Multi-Agent Debate with Sparse Communication Topology
Improving Multi-Agent Debate with Sparse Communication Topology
Yunxuan Li
Yibing Du
Jiageng Zhang
Le Hou
Peter Grabowski
Yeqing Li
Eugene Ie
LLMAG
211
63
0
17 Jun 2024
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Duoduo CLIP: Efficient 3D Understanding with Multi-View Images
Han-Hung Lee
Yiming Zhang
Angel X. Chang
3DPC
573
4
0
17 Jun 2024
Light Up the Shadows: Enhance Long-Tailed Entity Grounding with
  Concept-Guided Vision-Language Models
Light Up the Shadows: Enhance Long-Tailed Entity Grounding with Concept-Guided Vision-Language Models
Yikai Zhang
Qianyu He
Xintao Wang
Siyu Yuan
Jiaqing Liang
Yanghua Xiao
VLM
145
0
0
16 Jun 2024
Explore the Limits of Omni-modal Pretraining at Scale
Explore the Limits of Omni-modal Pretraining at Scale
Yiyuan Zhang
Handong Li
Jing Liu
Xiangyu Yue
VLMLRM
253
1
0
13 Jun 2024
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks
  and Algorithms
Aligning Vision Models with Human Aesthetics in Retrieval: Benchmarks and Algorithms
Miaosen Zhang
Yixuan Wei
Zhen Xing
Yifei Ma
Zuxuan Wu
...
Zheng Zhang
Jingdong Sun
Chong Luo
Xin Geng
Baining Guo
VLM
291
2
0
13 Jun 2024
Enhancing Domain Adaptation through Prompt Gradient Alignment
Enhancing Domain Adaptation through Prompt Gradient Alignment
Hoang Phan
Lam C. Tran
Quyen Tran
Trung Le
572
8
0
13 Jun 2024
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
mOSCAR: A Large-scale Multilingual and Multimodal Document-level Corpus
Matthieu Futeral
A. Zebaze
Pedro Ortiz Suarez
Julien Abadji
Rémi Lacroix
Cordelia Schmid
Rachel Bawden
Benoît Sagot
449
6
0
13 Jun 2024
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
ConMe: Rethinking Evaluation of Compositional Reasoning for Modern VLMs
Irene Huang
Wei Lin
M. Jehanzeb Mirza
Jacob A. Hansen
Sivan Doveh
...
Trevor Darrel
Chuang Gan
Aude Oliva
Rogerio Feris
Leonid Karlinsky
CoGeLRM
222
16
0
12 Jun 2024
Vision Model Pre-training on Interleaved Image-Text Data via Latent
  Compression Learning
Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning
Chenyu Yang
Xizhou Zhu
Jinguo Zhu
Weijie Su
Junjie Wang
...
Lewei Lu
Bin Li
Jie Zhou
Yu Qiao
Jifeng Dai
VLMCLIP
200
8
0
11 Jun 2024
Benchmarking Vision-Language Contrastive Methods for Medical
  Representation Learning
Benchmarking Vision-Language Contrastive Methods for Medical Representation Learning
Shuvendu Roy
Yasaman Parhizkar
Franklin Ogidi
Vahid Reza Khazaie
Michael Colacci
Ali Etemad
Elham Dolatabadi
Arash Afkanpour
VLM
264
1
0
11 Jun 2024
Let Go of Your Labels with Unsupervised Transfer
Let Go of Your Labels with Unsupervised Transfer
Artyom Gadetsky
Yulun Jiang
Maria Brbić
VLM
241
13
0
11 Jun 2024
Bridging Language Gaps in Audio-Text Retrieval
Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan
Heinrich Dinkel
Yongqing Wang
Jizhong Liu
Junbo Zhang
Yujun Wang
Bin Wang
VLM
246
10
0
11 Jun 2024
BrainChat: Decoding Semantic Information from fMRI using Vision-language
  Pretrained Models
BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
Wanaiu Huang
183
4
0
10 Jun 2024
Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data
  With Soft Alignment
Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment
Zijia Song
Z. Zang
Yelin Wang
Guozheng Yang
Jiangbin Zheng
Kaicheng Yu
Wanyu Chen
Stan Z. Li
254
0
0
09 Jun 2024
Understanding Information Storage and Transfer in Multi-modal Large
  Language Models
Understanding Information Storage and Transfer in Multi-modal Large Language ModelsNeural Information Processing Systems (NeurIPS), 2024
Samyadeep Basu
Martin Grayson
C. Morrison
Besmira Nushi
Soheil Feizi
Daniela Massiceti
299
31
0
06 Jun 2024
Low-Rank Similarity Mining for Multimodal Dataset Distillation
Low-Rank Similarity Mining for Multimodal Dataset Distillation
Yue Xu
Zhilin Lin
Yusong Qiu
Cewu Lu
Yong-Lu Li
DD
279
11
0
06 Jun 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal
  Learning
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
284
10
0
04 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in
  Large Multi-modal Models
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
222
31
0
04 Jun 2024
Few-Shot Classification of Interactive Activities of Daily Living
  (InteractADL)
Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)
Zane Durante
Robathan Harries
Edward Vendrow
Zelun Luo
Yuta Kyuragi
Kazuki Kozuka
Fei-Fei Li
Ehsan Adeli
VLM
258
2
0
03 Jun 2024
ED-SAM: An Efficient Diffusion Sampling Approach to Domain
  Generalization in Vision-Language Foundation Models
ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models
Thanh-Dat Truong
Pawan Sinha
Bhiksha Raj
Jackson Cothren
Khoa Luu
DiffMVLM
263
2
0
03 Jun 2024
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Hantao Zhou
Longxiang Tang
Rui Yang
Guanyi Qin
Yan Zhang
Yutao Li
Xiu Li
R. Hu
Guangtao Zhai
404
12
0
03 Jun 2024
Quantum Visual Feature Encoding Revisited
Quantum Visual Feature Encoding Revisited
Xuan-Bac Nguyen
Hoang-Quan Nguyen
Hugh Churchill
Samee U. Khan
Khoa Luu
227
15
0
30 May 2024
QClusformer: A Quantum Transformer-based Framework for Unsupervised
  Visual Clustering
QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering
Xuan-Bac Nguyen
Hoang-Quan Nguyen
Samuel Yen-Chi Chen
Samee U. Khan
Hugh Churchill
Khoa Luu
287
18
0
30 May 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
169
7
0
29 May 2024
CaLa: Complementary Association Learning for Augmenting Composed Image
  Retrieval
CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval
Xintong Jiang
Yaxiong Wang
Mengjian Li
Yujiao Wu
Bingwen Hu
Xueming Qian
CoGe
303
21
0
29 May 2024
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Vicky Zayats
Peter Chen
Melissa Ferrari
Dirk Padfield
AI4CE
218
1
0
29 May 2024
Wavelet-Based Image Tokenizer for Vision Transformers
Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu
Radu Soricut
ViT
235
7
0
28 May 2024
Previous
123...678...192021
Next
Page 7 of 21
Pageof 21