v1v2 (latest)

CoCa: Contrastive Captioners are Image-Text Foundation Models

4 May 2022

Mojtaba Seyedhosseini

ArXiv (abs)PDF HTML HuggingFace (3 upvotes)

Papers citing "CoCa: Contrastive Captioners are Image-Text Foundation Models"

50 / 1,042 papers shown

Deep Correlated Prompting for Visual Recognition with Missing ModalitiesNeural Information Processing Systems (NeurIPS), 2024

Wei Feng

462

09 Oct 2024

TuneVLSeg: Prompt Tuning Benchmark for Vision-Language Segmentation ModelsAsian Conference on Computer Vision (ACCV), 2024

286

07 Oct 2024

Uncertainty-Guided Enhancement on Driving Perception System via Foundation ModelsIEEE International Conference on Robotics and Automation (ICRA), 2024

Mao Ye

252

02 Oct 2024

Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity

Jiaxun Zhang

395

01 Oct 2024

Procedure-Aware Surgical Video-language Pretraining with Hierarchical Knowledge AugmentationNeural Information Processing Systems (NeurIPS), 2024

404

30 Sep 2024

FAST: A Dual-tier Few-Shot Learning Paradigm for Whole Slide Image ClassificationNeural Information Processing Systems (NeurIPS), 2024

Xiaoyuan Luo

198

29 Sep 2024

Vision-Language Models are Strong Noisy Label DetectorsNeural Information Processing Systems (NeurIPS), 2024

Tong Wei

214

29 Sep 2024

From Vision to Audio and Beyond: A Unified Model for Audio-Visual Representation and GenerationInternational Conference on Machine Learning (ICML), 2024

416

27 Sep 2024

ViKL: A Mammography Interpretation Framework via Multimodal Aggregation of Visual-knowledge-linguistic Features

Xin Wei

Yaling Tao

Changde Du

Gangming Zhao

Yizhou Yu

Jinpeng Li

218

24 Sep 2024

LARE: Latent Augmentation using Regional Embedding with Vision-Language ModelMachine Learning with Applications (MLWA), 2024

Masayuki Goto

247

19 Sep 2024

From Linguistic Giants to Sensory Maestros: A Survey on Cross-Modal Reasoning with Large Language Models

422

19 Sep 2024

MoRAG -- Multi-Fusion Retrieval Augmented Generation for Human MotionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

Kalakonda Sai Shashank

Shubh Maheshwari

Ravi Kiran Sarvadevabhatla

VGen DiffM

293

18 Sep 2024

Evaluating Pre-trained Convolutional Neural Networks and Foundation Models as Feature Extractors for Content-based Medical Image RetrievalEngineering applications of artificial intelligence (EAAI), 2024

343

14 Sep 2024

Phikon-v2, A large and public feature extractor for biomarker prediction

256

13 Sep 2024

ComAlign: Compositional Alignment in Vision-Language Models

Mohammadmahdi Samiei

210

12 Sep 2024

Recent Trends of Multimodal Affective Computing: A Survey from NLP Perspective

Erik Cambria

Hasti Seifi

375

11 Sep 2024

Spindle: Efficient Distributed Training of Multi-Task Large Models via Wavefront SchedulingInternational Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2024

Yujie Wang

Fangcheng Fu

Jie Zhang

Bin Cui

160

05 Sep 2024

CanvOI, an Oncology Intelligence Foundation Model: Scaling FLOPS Differently

182

04 Sep 2024

No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning

Manu Gaur

Darshan Singh

Makarand Tapaswi

939

04 Sep 2024

Rethinking Sparse Lexical Representations for Image Retrieval in the Age of Rising Multi-Modal Large Language Models

163

29 Aug 2024

RSTeller: Scaling Up Visual Language Modeling in Remote Sensing with Rich Linguistic Semantics from Openly Available Data and Large Language ModelsIsprs Journal of Photogrammetry and Remote Sensing (ISPRS J. Photogramm. Remote Sens.), 2024

620

27 Aug 2024

A New Era in Computational Pathology: A Survey on Foundation and Vision-Language Models

Nasim Yahya Soltani

428

23 Aug 2024

Has Multimodal Learning Delivered Universal Intelligence in Healthcare? A Comprehensive SurveyInformation Fusion (Inf. Fusion), 2024

Ling Huang

Mengling Feng

293

23 Aug 2024

XDT-CXR: Investigating Cross-Disease Transferability in Zero-Shot Binary Classification of Chest X-RaysMachine Learning in Health Care (MLHC), 2024

Umaima Rahman

Abhishek Basu

Muhammad Uzair Khattak

Aniq Ur Rahman

MedIm

213

21 Aug 2024

WRIM-Net: Wide-Ranging Information Mining Network for Visible-Infrared Person Re-IdentificationEuropean Conference on Computer Vision (ECCV), 2024

Yuan Zichao

263

20 Aug 2024

$C${^2}$RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval$

{^2}

RL: Content and Context Representation Learning for Gloss-free Sign Language Translation and Retrieval

Zhigang Chen

Jun Wan

Yanyan Liang

197

19 Aug 2024

NAVERO: Unlocking Fine-Grained Semantics for Video-Language Compositionality

Hao Yang

Ashwin Swaminathan

Colin Jon Taylor

194

18 Aug 2024

CROME: Cross-Modal Adapters for Efficient Multimodal LLM

Sayna Ebrahimi

Sercan O. Arik

Tejas Nama

Tomas Pfister

188

13 Aug 2024

Contrastive masked auto-encoders based self-supervised hashing for 2D image and 3D point cloud cross-modal retrievalIEEE International Conference on Multimedia and Expo (ICME), 2024

188

11 Aug 2024

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic SegmentationEuropean Conference on Computer Vision (ECCV), 2024

Dahyun Kang

Minsu Cho

ObjD VLM

390

09 Aug 2024

UniBench: Visual Reasoning Requires Rethinking Vision-Language Beyond ScalingNeural Information Processing Systems (NeurIPS), 2024

284

09 Aug 2024

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language ModelingEuropean Conference on Computer Vision (ECCV), 2024

Feng Yang

341

07 Aug 2024

Multistain Pretraining for Slide Representation Learning in PathologyEuropean Conference on Computer Vision (ECCV), 2024

236

05 Aug 2024

Text-Guided Video Masked AutoencoderEuropean Conference on Computer Vision (ECCV), 2024

167

01 Aug 2024

Conditioned Prompt-Optimization for Continual Deepfake Detection

318

31 Jul 2024

GABInsight: Exploring Gender-Activity Binding Bias in Vision-Language Models

Ali Abdollahi

Mahdi Ghaznavi

Mohammad Reza Karimi Nejad

401

30 Jul 2024

MMTrail: A Multimodal Trailer Video Dataset with Language and Music Descriptions

...

Wenhan Luo

Qifeng Chen

Shanghang Zhang

Qi-fei Liu

Yi-Ting Guo

301

30 Jul 2024

Look Hear: Gaze Prediction for Speech-directed Human AttentionEuropean Conference on Computer Vision (ECCV), 2024

Sounak Mondal

Seoyoung Ahn

Zhibo Yang

Niranjan Balasubramanian

Dimitris Samaras

G. Zelinsky

Minh Hoai

409

28 Jul 2024

MMCLIP: Cross-modal Attention Masked Modelling for Medical Language-Image Pre-Training

Qi Chen

Qi Wu

237

28 Jul 2024

Unified Lexical Representation for Interpretable Visual-Language Alignment

211

25 Jul 2024

QPT V2: Masked Image Modeling Advances Visual Scoring

236

23 Jul 2024

Improved Few-Shot Image Classification Through Multiple-Choice Questions

Dipika Khullar

Emmett Goodman

Negin Sokhandan

151

23 Jul 2024

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity

...

Yu Qiao

313

22 Jul 2024

In-Context Learning Improves Compositional Understanding of Vision-Language Models

193

22 Jul 2024

The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

Laura Niss

Kevin Vogt-Lowell

Theodoros Tsiligkaridis

VLM

306

22 Jul 2024

A Multimodal Knowledge-enhanced Whole-slide Pathology Foundation Model

...

Anjia Han

Ronald Cheong Kin Chan

Li Liang

Xiuming Zhang

Hao Chen

436

22 Jul 2024

Large-vocabulary forensic pathological analyses via prototypical cross-modal contrastive learning

Fan Wang

...

209

20 Jul 2024

Multimodal Label Relevance Ranking via Reinforcement Learning

189

18 Jul 2024

ViLLa: Video Reasoning Segmentation with Large Language Model

Yu Qiao

507

18 Jul 2024

ClearCLIP: Decomposing CLIP Representations for Dense Vision-Language Inference

Yiping Ke

332

17 Jul 2024