v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015

Piotr Dollar

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown

MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging

Ho Hin Lee

...

315

09 Oct 2024

M^3EL

: A Multi-task Multi-topic Dataset for Multi-modal Entity Linking

247

08 Oct 2024

SIA-OVD: Shape-Invariant Adapter for Bridging the Image-Region Gap in Open-Vocabulary DetectionACM Multimedia (MM), 2024

208

08 Oct 2024

Precise Model Benchmarking with Only a Few ObservationsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Riccardo Fogliato

Pratik Patil

Nil-Jana Akpinar

Mathew Monfort

209

07 Oct 2024

Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic CompositionalityConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Junmo Kim

343

07 Oct 2024

MM-R$^3$: On (In-)Consistency of Vision-Language Models (VLMs)

MM-R

^3

: On (In-)Consistency of Vision-Language Models (VLMs)

289

07 Oct 2024

VEDIT: Latent Prediction Architecture For Procedural Video Representation LearningInternational Conference on Learning Representations (ICLR), 2024

300

04 Oct 2024

Bridging the Gap between Text, Audio, Image, and Any Sequence: A Novel Approach using Gloss-based Annotation

Sen Fang

171

04 Oct 2024

Toward a Holistic Evaluation of Robustness in CLIP ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

349

02 Oct 2024

ASCIIEval: Benchmarking Models' Visual Perception in Text Strings via ASCII Art

247

02 Oct 2024

MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

Haotian Zhang

Mingfei Gao

...

Zirui Wang

Yinfei Yang

303

30 Sep 2024

Multimodal LLM Enhanced Cross-lingual Cross-modal RetrievalACM Multimedia (MM), 2024

Yabing Wang

Zhibin Wang

Gang Hua

222

30 Sep 2024

Efficient Backdoor Defense in Multimodal Contrastive Learning: A Token-Level Unlearning Method for Mitigating Threats

Xiaochun Cao

273

29 Sep 2024

From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding

...

Huaijian Zhang

299

27 Sep 2024

Emu3: Next-Token Prediction is All You Need

Xinlong Wang

Xiaosong Zhang

Zhengxiong Luo

Quan-Sen Sun

Yufeng Cui

...

Xi Yang

Jingjing Liu

Yonghua Lin

Tiejun Huang

Zhongyuan Wang

MLLM

290

483

27 Sep 2024

IFCap: Image-like Retrieval and Frequency-based Entity Filtering for Zero-shot CaptioningConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

217

26 Sep 2024

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal ModelsComputer Vision and Pattern Recognition (CVPR), 2024

Matt Deitke

Christopher Clark

Sangho Lee

Rohun Tripathi

Yue Yang

...

Noah A. Smith

Hannaneh Hajishirzi

Ross Girshick

Ali Farhadi

Aniruddha Kembhavi

OSLM VLM

457

25 Sep 2024

Understanding the Cognitive Complexity in Language Elicited by Product Images

255

25 Sep 2024

Enhancing Advanced Visual Reasoning Ability of Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Chaoyi Zhang

Weidong Cai

259

21 Sep 2024

Instruction-guided Multi-Granularity Segmentation and Captioning with Large Multimodal Model

861

20 Sep 2024

JourneyBench: A Challenging One-Stop Vision-Language Understanding Benchmark of Generated ImagesNeural Information Processing Systems (NeurIPS), 2024

Zhecan Wang

Junzhang Liu

Chia-Wei Tang

Hani Alomari

Anushka Sivakumar

...

Haoxuan You

A. Ishmam

Kai-Wei Chang

Shih-Fu Chang

Chris Thomas

CoGe VLM

505

19 Sep 2024

OneEncoder: A Lightweight Framework for Progressive Alignment of Modalities

Hanane Azzag

M. Lebbah

ObjD

349

17 Sep 2024

Benchmarking VLMs' Reasoning About Persuasive Atypical ImagesIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024

378

16 Sep 2024

Evaluating authenticity and quality of image captions via sentiment and semantic analyses

128

14 Sep 2024

Guiding Vision-Language Model Selection for Visual Question-Answering Across Tasks, Domains, and Knowledge Types

Neelabh Sinha

Vinija Jain

Vasu Sharma

187

14 Sep 2024

Alignment of Diffusion Models: Fundamentals, Challenges, and Future

463

11 Sep 2024

FrozenSeg: Harmonizing Frozen Foundation Models for Open-Vocabulary Segmentation

Xiatian Zhu

244

05 Sep 2024

A New People-Object Interaction Dataset and NVS BenchmarksInternational Conference on Information Photonics (ICIP), 2024

268

03 Sep 2024

Blocks as Probes: Dissecting Categorization Ability of Large Multimodal ModelsBritish Machine Vision Conference (BMVC), 2024

Jialin Li

150

03 Sep 2024

Expanding on EnCLAP with Auxiliary Retrieval Model for Automated Audio Captioning

Jaeyeon Kim

Jaeyoon Jung

Minjeong Jeon

Sang Hoon Woo

Jinjoo Lee

172

02 Sep 2024

Pre-Training Multimodal Hallucination Detectors with Corrupted Grounding Data

Spencer Whitehead

Jacob Phillips

Sean Hendryx

183

30 Aug 2024

Image-Perfect Imperfections: Safety, Bias, and Authenticity in the Shadow of Text-To-Image Model EvolutionConference on Computer and Communications Security (CCS), 2024

Yixin Wu

Yun Shen

Michael Backes

Yang Zhang

263

30 Aug 2024

A Survey on Evaluation of Multimodal Large Language Models

Jiaxing Huang

Jingyi Zhang

LM&MA ELM LRM

305

28 Aug 2024

Probing the Robustness of Vision-Language Pretrained Models: A Multimodal Adversarial Attack Approach

Xi Zheng

287

24 Aug 2024

ParGo: Bridging Vision-Language with Partial and Global ViewsAAAI Conference on Artificial Intelligence (AAAI), 2024

519

23 Aug 2024

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

332

21 Aug 2024

Attribution Analysis Meets Model Editing: Advancing Knowledge Correction in Vision Language Models with VisEditAAAI Conference on Artificial Intelligence (AAAI), 2024

695

19 Aug 2024

Quality Assessment in the Era of Large Models: A Survey

Zicheng Zhang

Guangtao Zhai

344

17 Aug 2024

Can Large Language Models Understand Symbolic Graphics Programs?International Conference on Learning Representations (ICLR), 2024

602

15 Aug 2024

Efficient and Versatile Robust Fine-Tuning of Zero-shot ModelsEuropean Conference on Computer Vision (ECCV), 2024

232

11 Aug 2024

ArtVLM: Attribute Recognition Through Vision-Based Prefix Language ModelingEuropean Conference on Computer Vision (ECCV), 2024

Feng Yang

341

07 Aug 2024

Attacks and Defenses for Generative Diffusion Models: A Comprehensive SurveyACM Computing Surveys (ACM CSUR), 2024

341

06 Aug 2024

GazeXplain: Learning to Predict Natural Language Explanations of Visual ScanpathsEuropean Conference on Computer Vision (ECCV), 2024

Xianyu Chen

Ming Jiang

Qi Zhao

213

05 Aug 2024

VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical KnowledgeIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024

329

05 Aug 2024

VolDoGer: LLM-assisted Datasets for Domain Generalization in Vision-Language Tasks

309

29 Jul 2024

Start from Video-Music Retrieval: An Inter-Intra Modal Loss for Cross Modal Retrieval

225

28 Jul 2024

LLAVADI: What Matters For Multimodal Large Language Models Distillation

Xiangtai Li

Ming-Hsuan Yang

216

28 Jul 2024

SWIFT: Semantic Watermarking for Image Forgery Thwarting

245

26 Jul 2024

MLLM-CompBench: A Comparative Reasoning Benchmark for Multimodal LLMs

Wei-Lun Chao

345

23 Jul 2024

Multimodal Unlearnable Examples: Protecting Data against Multimodal Contrastive Learning

283

23 Jul 2024