Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Home
Papers

All Papers

0 / 0 papers shown

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2205.01917
Cited By

CoCa: Contrastive Captioners are Image-Text Foundation Models

v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Vijay Vasudevan

Mojtaba Seyedhosseini

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown

Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Understanding Space Is Rocket Science -- Only Top Reasoning Models Can Solve Spatial Understanding Tasks

Mayug Maniparambil

Noel E. O'Connor

Anthony Ventresque

169

0

0

02 Sep 2025

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

197

2

0

01 Sep 2025

DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

DGL-RSIS: Decoupling Global Spatial Context and Local Class Semantics for Training-Free Remote Sensing Image Segmentation

Richard M. Timmerman

112

0

0

30 Aug 2025

Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

Category-level Text-to-Image Retrieval Improved: Bridging the Domain Gap with Diffusion Models and Vision Encoders

Faizan Farooq Khan

Vladan Stojnić

Mohamed Elhoseiny

99

0

0

29 Aug 2025

MobileCLIP2: Improving Multi-Modal Reinforced Training

MobileCLIP2: Improving Multi-Modal Reinforced Training

Pavan Kumar Anasosalu Vasu

Vaishaal Shankar

Alexander Toshev

Hadi Pouransari

432

1

0

28 Aug 2025

SCAR: A Characterization Scheme for Multi-Modal Dataset

SCAR: A Characterization Scheme for Multi-Modal Dataset

67

0

0

27 Aug 2025

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Plug-in Feedback Self-adaptive Attention in CLIP for Training-free Open-Vocabulary Segmentation

Konstantinos N. Plataniotis

120

1

0

27 Aug 2025

T-MASK: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring

T-MASK: Temporal Masking for Probing Foundation Models across Camera Views in Driver Monitoring

Thinesh Thiyakesan Ponbagavathi

188

0

0

22 Aug 2025

Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Normal and Abnormal Pathology Knowledge-Augmented Vision-Language Model for Anomaly Detection in Pathology Images

Anh Tien Nguyen

134

1

0

21 Aug 2025

Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping

Boosting Pathology Foundation Models via Few-shot Prompt-tuning for Rare Cancer Subtyping

...

87

2

0

21 Aug 2025

Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification

Glo-VLMs: Leveraging Vision-Language Models for Fine-Grained Diseased Glomerulus Classification

...

David J. Pisapia

142

2

0

21 Aug 2025

Controllable Latent Space Augmentation for Digital Pathology

Controllable Latent Space Augmentation for Digital Pathology

Sofiène Boutaj

Florent Couzinié-Devy

Maria Vakalopoulou

Stergios Christodoulidis

76

0

0

20 Aug 2025

HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents

HERAKLES: Hierarchical Skill Compilation for Open-ended LLM Agents

Pierre-Yves Oudeyer

Sylvain Lamprier

185

0

0

20 Aug 2025

EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

EmoSLLM: Parameter-Efficient Adaptation of LLMs for Speech Emotion Recognition

140

1

0

19 Aug 2025

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

MANGO: Multimodal Attention-based Normalizing Flow Approach to Fusion Learning

Thanh-Dat Truong

Christophe Bobda

252

1

0

13 Aug 2025

Label Smoothing is a Pragmatic Information Bottleneck

Label Smoothing is a Pragmatic Information Bottleneck

115

0

0

12 Aug 2025

Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

Transferable Model-agnostic Vision-Language Model Adaptation for Efficient Weak-to-Strong Generalization

147

0

0

12 Aug 2025

Effortless Vision-Language Model Specialization in Histopathology without Annotation

Effortless Vision-Language Model Specialization in Histopathology without Annotation

Marc Aubreville

Katharina Breininger

108

0

0

11 Aug 2025

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

MobileViCLIP: An Efficient Video-Text Model for Mobile Devices

188

0

0

10 Aug 2025

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

CLIPin: A Non-contrastive Plug-in to CLIP for Multimodal Semantic Alignment

172

0

0

08 Aug 2025

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Continual Learning for VLMs: A Survey and Taxonomy Beyond Forgetting

Alexandra Gomez-Villa

Joost van de Weijer

207

0

0

06 Aug 2025

Toward Errorless Training ImageNet-1k

Toward Errorless Training ImageNet-1k

132

0

0

06 Aug 2025

Live Music Models

Live Music Models

Antoine Caillon

Brian McWilliams

Cassie Tarakajian

...

Jason Baldridge

260

2

0

06 Aug 2025

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment

Context-Adaptive Multi-Prompt Embedding with Large Language Models for Vision-Language Alignment

187

0

0

03 Aug 2025

EvoVLMA: Evolutionary Vision-Language Model Adaptation

EvoVLMA: Evolutionary Vision-Language Model Adaptation

146

0

0

03 Aug 2025

Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment

Enhancing Zero-Shot Brain Tumor Subtype Classification via Fine-Grained Patch-Text Alignment

232

3

0

03 Aug 2025

SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation

SaviorRec: Semantic-Behavior Alignment for Cold-Start Recommendation

172

0

0

02 Aug 2025

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space

151

0

0

31 Jul 2025

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models

161

5

0

30 Jul 2025

MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces

MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic SpacesInternational Joint Conference on Artificial Intelligence (IJCAI), 2025

178

0

0

29 Jul 2025

SmartCLIP: Modular Vision-language Alignment with Identification Guarantees

SmartCLIP: Modular Vision-language Alignment with Identification GuaranteesComputer Vision and Pattern Recognition (CVPR), 2025

226

3

0

29 Jul 2025

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

On The Role of Pretrained Language Models in General-Purpose Text Embeddings: A Survey

Shouzheng Huang

253

3

0

28 Jul 2025

Group Relative Augmentation for Data Efficient Action Detection

Group Relative Augmentation for Data Efficient Action Detection

Martin Renqiang Min

162

0

0

28 Jul 2025

SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks

SAR-TEXT: A Large-Scale SAR Image-Text Dataset Built with SAR-Narrator and A Progressive Learning Strategy for Downstream Tasks

Qiangjuan Huang

Qiangjuan Huang

314

0

0

24 Jul 2025

GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

GRR-CoCa: Leveraging LLM Mechanisms in Multimodal Model Architectures

Nicole Catherine Lewis

Christina Gomez

135

0

0

24 Jul 2025

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

U-MARVEL: Unveiling Key Factors for Universal Multimodal Retrieval via Embedding Learning with MLLMs

242

2

0

20 Jul 2025

LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning

LoRA-Loop: Closing the Synthetic Replay Cycle for Continual VLM Learning

292

0

0

17 Jul 2025

Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations

Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations

289

2

0

13 Jul 2025

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

Beyond the Linear Separability Ceiling: Aligning Representations in VLMs

209

0

0

10 Jul 2025

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

OpenWorldSAM: Extending SAM2 for Universal Image Segmentation with Language Prompts

Priyadarshini Panda

291

0

0

07 Jul 2025

PhenoBench: A Comprehensive Benchmark for Cell Phenotyping

PhenoBench: A Comprehensive Benchmark for Cell Phenotyping

Fabian H. Reith

Claudia Winklmayr

Jerome Luescher

Christian M. Schuerch

Dagmar Kainmueller

J. L. Rumberger

309

0

0

04 Jul 2025

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation

162

2

0

20 Jun 2025

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Dual-Stage Value-Guided Inference with Margin-Based Reward Adjustment for Fast and Faithful VLM Captioning

Adinath Madhavrao Dukre

Muhammad Haris Khan

279

0

0

18 Jun 2025

PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue

PRISM2: Unlocking Multi-Modal General Pathology AI with Clinical Dialogue

Eugene Vorontsov

Eugene Vorontsov

Eric Zimmermann

...

Thomas J. Fuchs

Kristen Severson

217

5

0

16 Jun 2025

Interpretable Text-Guided Image Clustering via Iterative Search

Interpretable Text-Guided Image Clustering via Iterative Search

Oisin Mac Aodha

248

0

0

14 Jun 2025

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

308

0

0

13 Jun 2025

Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Attention, Please! Revisiting Attentive Probing Through the Lens of Efficiency

Dionysis Christopoulos

Ioannis Kakogeorgiou

Tilemachos Aravanis

Konstantinos Karantzalos

Yannis Avrithis

329

1

0

11 Jun 2025

Canonical Latent Representations in Conditional Diffusion Models

Ehsan Pajouheshgar

Sabine Süsstrunk

250

0

0

11 Jun 2025

SensorLM: Learning the Language of Wearable Sensors

SensorLM: Learning the Language of Wearable Sensors

Girish Narayanswamy

...

Shwetak N. Patel

Cecilia Mascolo

Daniel J. McDuff

435

13

0

10 Jun 2025

When Kernels Multiply, Clusters Unify: Fusing Embeddings with the Kronecker Product

206

1

0

10 Jun 2025

1 2 3 4 5...19 20 21