ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2205.01917
  4. Cited By
CoCa: Contrastive Captioners are Image-Text Foundation Models

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022
Jiahui Yu
Zirui Wang
Vijay Vasudevan
Legg Yeung
Mojtaba Seyedhosseini
Yonghui Wu
    VLM
    CLIP
    OffRL
ArXivPDFHTML

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 910 papers shown
Title
Bridging Language Gaps in Audio-Text Retrieval
Bridging Language Gaps in Audio-Text Retrieval
Zhiyong Yan
Heinrich Dinkel
Yongqing Wang
Jizhong Liu
Junbo Zhang
Yujun Wang
Bin Wang
VLM
27
4
0
11 Jun 2024
BrainChat: Decoding Semantic Information from fMRI using Vision-language
  Pretrained Models
BrainChat: Decoding Semantic Information from fMRI using Vision-language Pretrained Models
Wanaiu Huang
18
1
0
10 Jun 2024
Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data
  With Soft Alignment
Gentle-CLIP: Exploring Aligned Semantic In Low-Quality Multimodal Data With Soft Alignment
Zijia Song
Z. Zang
Yelin Wang
Guozheng Yang
Jiangbin Zheng
Kaicheng Yu
Wanyu Chen
Stan Z. Li
31
0
0
09 Jun 2024
Understanding Information Storage and Transfer in Multi-modal Large
  Language Models
Understanding Information Storage and Transfer in Multi-modal Large Language Models
Samyadeep Basu
Martin Grayson
C. Morrison
Besmira Nushi
S. Feizi
Daniela Massiceti
18
10
0
06 Jun 2024
Low-Rank Similarity Mining for Multimodal Dataset Distillation
Low-Rank Similarity Mining for Multimodal Dataset Distillation
Yue Xu
Zhilin Lin
Yusong Qiu
Cewu Lu
Yong-Lu Li
DD
41
3
0
06 Jun 2024
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal
  Learning
Leveraging Visual Tokens for Extended Text Contexts in Multi-Modal Learning
Alex Jinpeng Wang
Linjie Li
Yiqi Lin
Min Li
Lijuan Wang
Mike Zheng Shou
VLM
18
3
0
04 Jun 2024
CODE: Contrasting Self-generated Description to Combat Hallucination in
  Large Multi-modal Models
CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models
Junho Kim
Hyunjun Kim
Yeonju Kim
Yong Man Ro
MLLM
34
10
0
04 Jun 2024
Few-Shot Classification of Interactive Activities of Daily Living
  (InteractADL)
Few-Shot Classification of Interactive Activities of Daily Living (InteractADL)
Zane Durante
Robathan Harries
Edward Vendrow
Zelun Luo
Yuta Kyuragi
Kazuki Kozuka
Fei-Fei Li
Ehsan Adeli
VLM
25
0
0
03 Jun 2024
ED-SAM: An Efficient Diffusion Sampling Approach to Domain
  Generalization in Vision-Language Foundation Models
ED-SAM: An Efficient Diffusion Sampling Approach to Domain Generalization in Vision-Language Foundation Models
Thanh-Dat Truong
Xin Li
Bhiksha Raj
Jackson Cothren
Khoa Luu
DiffM
VLM
33
1
0
03 Jun 2024
UniQA: Unified Vision-Language Pre-training for Image Quality and
  Aesthetic Assessment
UniQA: Unified Vision-Language Pre-training for Image Quality and Aesthetic Assessment
Hantao Zhou
Longxiang Tang
Rui Yang
Guanyi Qin
Yan Zhang
Runze Hu
Xiu Li
29
5
0
03 Jun 2024
Quantum Visual Feature Encoding Revisited
Quantum Visual Feature Encoding Revisited
Xuan-Bac Nguyen
Hoang-Quan Nguyen
Hugh Churchill
Samee U. Khan
Khoa Luu
22
9
0
30 May 2024
QClusformer: A Quantum Transformer-based Framework for Unsupervised
  Visual Clustering
QClusformer: A Quantum Transformer-based Framework for Unsupervised Visual Clustering
Xuan-Bac Nguyen
Hoang-Quan Nguyen
Samuel Yen-Chi Chen
Samee U. Khan
Hugh Churchill
Khoa Luu
21
11
0
30 May 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
CaLa: Complementary Association Learning for Augmenting Composed Image
  Retrieval
CaLa: Complementary Association Learning for Augmenting Composed Image Retrieval
Xintong Jiang
Yaxiong Wang
Mengjian Li
Yujiao Wu
Bingwen Hu
Xueming Qian
CoGe
32
4
0
29 May 2024
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Zipper: A Multi-Tower Decoder Architecture for Fusing Modalities
Vicky Zayats
Peter Chen
Melissa Ferrari
Dirk Padfield
AI4CE
30
0
0
29 May 2024
Wavelet-Based Image Tokenizer for Vision Transformers
Wavelet-Based Image Tokenizer for Vision Transformers
Zhenhai Zhu
Radu Soricut
ViT
22
3
0
28 May 2024
Seeing the Image: Prioritizing Visual Correlation by Contrastive
  Alignment
Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment
Xin Xiao
Bohong Wu
Jiacong Wang
Chunyuan Li
Xun Zhou
Haoyuan Guo
VLM
34
7
0
28 May 2024
Visual Anchors Are Strong Information Aggregators For Multimodal Large
  Language Model
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
Haogeng Liu
Quanzeng You
Xiaotian Han
Yongfei Liu
Huaibo Huang
Ran He
Hongxia Yang
26
2
0
28 May 2024
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Synergy and Diversity in CLIP: Enhancing Performance Through Adaptive Backbone Ensembling
Cristian Rodriguez-Opazo
Ehsan Abbasnejad
Damien Teney
Edison Marrese-Taylor
Hamed Damirchi
A. Hengel
VLM
20
1
0
27 May 2024
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse
  Mixture-of-Experts
A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts
Mohammed Nowaz Rabbani Chowdhury
Meng Wang
K. E. Maghraoui
Naigang Wang
Pin-Yu Chen
Christopher Carothers
MoE
24
4
0
26 May 2024
ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with
  LLM-Enhanced Cardiological Text
ECG Semantic Integrator (ESI): A Foundation ECG Model Pretrained with LLM-Enhanced Cardiological Text
Han Yu
Peikun Guo
Akane Sano
34
14
0
26 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
64
41
0
23 May 2024
No Filter: Cultural and Socioeconomic Diversity in Contrastive
  Vision-Language Models
No Filter: Cultural and Socioeconomic Diversity in Contrastive Vision-Language Models
Angeline Pouget
Lucas Beyer
Emanuele Bugliarello
Xiao Wang
Andreas Steiner
Xiao-Qi Zhai
Ibrahim M. Alabdulmohsin
VLM
31
7
0
22 May 2024
More Distinctively Black and Feminine Faces Lead to Increased
  Stereotyping in Vision-Language Models
More Distinctively Black and Feminine Faces Lead to Increased Stereotyping in Vision-Language Models
Messi H.J. Lee
Jacob M. Montgomery
Calvin K Lai
VLM
29
0
0
22 May 2024
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in
  Large-Scale AI Models
OpenCarbonEval: A Unified Carbon Emission Estimation Framework in Large-Scale AI Models
Zhaojian Yu
Yinghao Wu
Zhuotao Deng
Yansong Tang
Xiao-Ping Zhang
39
2
0
21 May 2024
Transcriptomics-guided Slide Representation Learning in Computational
  Pathology
Transcriptomics-guided Slide Representation Learning in Computational Pathology
Guillaume Jaume
Lukas Oldenburg
Anurag J. Vaidya
Richard J. Chen
Drew F. K. Williamson
Thomas Peeters
Andrew H. Song
Faisal Mahmood
40
21
0
19 May 2024
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language
  Recognition from Spoken Language Supervision
A Tale of Two Languages: Large-Vocabulary Continuous Sign Language Recognition from Spoken Language Supervision
Charles Raude
Prajwal K R
Liliane Momeni
Hannah Bull
Samuel Albanie
Andrew Zisserman
Gül Varol
SLR
26
5
0
16 May 2024
PRISM: A Multi-Modal Generative Foundation Model for Slide-Level
  Histopathology
PRISM: A Multi-Modal Generative Foundation Model for Slide-Level Histopathology
George Shaikovski
Adam Casson
Kristen Severson
Eric Zimmermann
Yi Kan Wang
...
Peter Hamilton
William A. Moye
Eugene Vorontsov
Siqi Liu
Thomas J. Fuchs
MedIm
30
22
0
16 May 2024
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
Pavan Kumar Anasosalu Vasu
Hadi Pouransari
Fartash Faghri
Oncel Tuzel
VLM
CLIP
22
6
0
14 May 2024
Efficient Vision-Language Pre-training by Cluster Masking
Efficient Vision-Language Pre-training by Cluster Masking
Zihao Wei
Zixuan Pan
Andrew Owens
VLM
21
6
0
14 May 2024
All in One Framework for Multimodal Re-identification in the Wild
All in One Framework for Multimodal Re-identification in the Wild
He Li
Mang Ye
Ming Zhang
Bo Du
25
9
0
08 May 2024
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image
  Retrieval
iSEARLE: Improving Textual Inversion for Zero-Shot Composed Image Retrieval
Lorenzo Agnolucci
Alberto Baldrati
Marco Bertini
A. Bimbo
35
9
0
05 May 2024
Understanding Retrieval-Augmented Task Adaptation for Vision-Language
  Models
Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
Yifei Ming
Yixuan Li
VLM
23
7
0
02 May 2024
Hallucination of Multimodal Large Language Models: A Survey
Hallucination of Multimodal Large Language Models: A Survey
Zechen Bai
Pichao Wang
Tianjun Xiao
Tong He
Zongbo Han
Zheng Zhang
Mike Zheng Shou
VLM
LRM
80
139
0
29 Apr 2024
Learning text-to-video retrieval from image captioning
Learning text-to-video retrieval from image captioning
Lucas Ventura
Cordelia Schmid
Gül Varol
3DV
31
3
0
26 Apr 2024
Embracing Diversity: Interpretable Zero-shot classification beyond one
  vector per class
Embracing Diversity: Interpretable Zero-shot classification beyond one vector per class
Mazda Moayeri
Michael G. Rabbat
Mark Ibrahim
Diane Bouchacourt
VLM
41
1
0
25 Apr 2024
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
...
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
EGVM
117
13
0
25 Apr 2024
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities
  in Semantic Dataset Deduplication
FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication
Eric Slyman
Stefan Lee
Scott D. Cohen
Kushal Kafle
VLM
23
5
0
24 Apr 2024
MoDE: CLIP Data Experts via Clustering
MoDE: CLIP Data Experts via Clustering
Jiawei Ma
Po-Yao Huang
Saining Xie
Shang-Wen Li
Luke Zettlemoyer
Shih-Fu Chang
Wen-tau Yih
Hu Xu
MoE
CLIP
VLM
18
10
0
24 Apr 2024
SPARO: Selective Attention for Robust and Compositional Transformer
  Encodings for Vision
SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
Ankit Vani
Bac Nguyen
Samuel Lavoie
Ranjay Krishna
Aaron Courville
26
1
0
24 Apr 2024
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster
  Pre-training on Web-scale Image-Text Data
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
Sachin Mehta
Maxwell Horton
Fartash Faghri
Mohammad Hossein Sekhavat
Mahyar Najibi
Mehrdad Farajtabar
Oncel Tuzel
Mohammad Rastegari
VLM
CLIP
29
6
0
24 Apr 2024
Reconstructing the Image Stitching Pipeline: Integrating Fusion and
  Rectangling into a Unified Inpainting Model
Reconstructing the Image Stitching Pipeline: Integrating Fusion and Rectangling into a Unified Inpainting Model
Ziqi Xie
Weidong Zhao
Xianhui Liu
Jian Zhao
Ning Jia
26
0
0
23 Apr 2024
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection
  and Correction
FINEMATCH: Aspect-based Fine-grained Image and Text Mismatch Detection and Correction
Hang Hua
Jing Shi
Kushal Kafle
Simon Jenni
Daoan Zhang
John Collomosse
Scott D. Cohen
Jiebo Luo
CoGe
VLM
42
9
0
23 Apr 2024
AutoAD III: The Prequel -- Back to the Pixels
AutoAD III: The Prequel -- Back to the Pixels
Tengda Han
Max Bain
Arsha Nagrani
Gül Varol
Weidi Xie
Andrew Zisserman
VGen
DiffM
36
20
0
22 Apr 2024
Image Generative Semantic Communication with Multi-Modal Similarity
  Estimation for Resource-Limited Networks
Image Generative Semantic Communication with Multi-Modal Similarity Estimation for Resource-Limited Networks
Eri Hosonuma
Taku Yamazaki
Takumi Miyoshi
Akihito Taya
Yuuki Nishiyama
K. Sezaki
DiffM
18
1
0
17 Apr 2024
Vocabulary-free Image Classification and Semantic Segmentation
Vocabulary-free Image Classification and Semantic Segmentation
Alessandro Conti
Enrico Fini
Massimiliano Mancini
Paolo Rota
Yiming Wang
Elisa Ricci
VLM
35
2
0
16 Apr 2024
CNN-based explanation ensembling for dataset, representation and
  explanations evaluation
CNN-based explanation ensembling for dataset, representation and explanations evaluation
Weronika Hryniewska-Guzik
Luca Longo
P. Biecek
FAtt
43
0
0
16 Apr 2024
Knowledge-enhanced Visual-Language Pretraining for Computational
  Pathology
Knowledge-enhanced Visual-Language Pretraining for Computational Pathology
Xiao Zhou
Xiaoman Zhang
Chaoyi Wu
Ya-Qin Zhang
Weidi Xie
Yanfeng Wang
VLM
27
6
0
15 Apr 2024
The Devil is in the Few Shots: Iterative Visual Knowledge Completion for
  Few-shot Learning
The Devil is in the Few Shots: Iterative Visual Knowledge Completion for Few-shot Learning
Yaohui Li
Qifeng Zhou
Haoxing Chen
Jianbing Zhang
Xinyu Dai
Hao Zhou
VLM
29
0
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video
  Captioning
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
30
6
0
14 Apr 2024
Previous
123456...171819
Next