A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style
Models on Dense Captions

A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions

14 December 2023

Mary Williamson

Adriana Romero Soriano

Papers citing "A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions"

14 / 14 papers shown

Title
Improving Editability in Image Generation with Layer-wise Memory Daneul Kim Jaeah Lee Jaesik Park DiffM KELM 53 0 0 02 May 2025
Perception Encoder: The best visual embeddings are not at the output of the network Daniel Bolya Po-Yao (Bernie) Huang Peize Sun Jang Hyun Cho Andrea Madotto ... Shiyu Dong Nikhila Ravi Daniel Li Piotr Dollár Christoph Feichtenhofer ObjD VOS 103 0 0 17 Apr 2025
GOAL: Global-local Object Alignment Learning Hyungyu Choi Young Kyun Jang Chanho Eom VLM 48 0 0 22 Mar 2025
Know "No'' Better: A Data-Driven Approach for Enhancing Negation Awareness in CLIP J. Park Jungbeom Lee Jongyoon Song Sangwon Yu Dahuin Jung Sungroh Yoon 45 0 0 19 Jan 2025
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training Sanghwan Kim Rui Xiao Mariana-Iuliana Georgescu Stephan Alaniz Zeynep Akata VLM 70 0 0 02 Dec 2024
TULIP: Token-length Upgraded CLIP Ivona Najdenkoska Mohammad Mahdi Derakhshani Yuki M. Asano N. V. Noord Marcel Worring Cees G. M. Snoek VLM 43 3 0 13 Oct 2024
No Detail Left Behind: Revisiting Self-Retrieval for Fine-Grained Image Captioning Manu Gaur Darshan Singh Makarand Tapaswi 49 1 0 04 Sep 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions Yu-Guan Hsieh Cheng-Yu Hsieh Shih-Ying Yeh Louis Béthune Hadi Pour Ansari Pavan Kumar Anasosalu Vasu Chun-Liang Li Ranjay Krishna Oncel Tuzel Marco Cuturi 58 4 0 09 Jul 2024
Fantastic Copyrighted Beasts and How (Not) to Generate Them Luxi He Yangsibo Huang Weijia Shi Tinghao Xie Haotian Liu Yue Wang Luke Zettlemoyer Chiyuan Zhang Danqi Chen Peter Henderson 39 9 0 20 Jun 2024
Multi-Modal Generative Embedding Model Feipeng Ma Hongwei Xue Guangting Wang Yizhou Zhou Fengyun Rao Shilin Yan Yueyi Zhang Siying Wu Mike Zheng Shou Xiaoyan Sun VLM 26 3 0 29 May 2024
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models Junnan Li Dongxu Li Silvio Savarese Steven C. H. Hoi VLM MLLM 244 4,186 0 30 Jan 2023
An Exploration of Hierarchical Attention Transformers for Efficient Long Document Classification Ilias Chalkidis Xiang Dai Manos Fergadiotis Prodromos Malakasiotis Desmond Elliott 30 33 0 11 Oct 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li Dongxu Li Caiming Xiong S. Hoi MLLM BDL VLM CLIP 382 4,010 0 28 Jan 2022
WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning Krishna Srinivasan K. Raman Jiecao Chen Michael Bendersky Marc Najork VLM 184 307 0 02 Mar 2021