v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015

Piotr Dollar

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown

jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images

794

11 Dec 2024

FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing

289

10 Dec 2024

Visual Lexicon: Rich Image Features in Language SpaceComputer Vision and Pattern Recognition (CVPR), 2024

208

09 Dec 2024

JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLMPacific Asia Conference on Language, Information and Computation (PACLIC), 2024

Takuro Fujii

Satoru Katsumata

202

09 Dec 2024

Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning DistractorACM Multimedia (MM), 2024

315

08 Dec 2024

EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios

396

05 Dec 2024

Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference

179

04 Dec 2024

AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?

650

04 Dec 2024

ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?

Simone Paolo Ponzetto

Fahimeh Moafian

Zhixue Zhao

MLLM

370

03 Dec 2024

Progress-Aware Video Frame CaptioningComputer Vision and Pattern Recognition (CVPR), 2024

600

03 Dec 2024

Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model

302

02 Dec 2024

Perception of Visual Content: Differences Between Humans and Foundation ModelsInternational Conference on Web and Social Media (ICWSM), 2024

431

28 Nov 2024

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

336

27 Nov 2024

Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial AttacksComputer Vision and Pattern Recognition (CVPR), 2024

348

24 Nov 2024

Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-ExpertsComputer Vision and Pattern Recognition (CVPR), 2024

369

23 Nov 2024

Neuro-Symbolic Evaluation of Text-to-Video Models using Formal VerificationComputer Vision and Pattern Recognition (CVPR), 2024

635

22 Nov 2024

PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment

526

18 Nov 2024

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference OptimizationComputer Vision and Pattern Recognition (CVPR), 2024

385

17 Nov 2024

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile DevicesComputer Vision and Pattern Recognition (CVPR), 2024

...

197

16 Nov 2024

EVOKE: Elevating Chest X-ray Report Generation via Multi-View Contrastive Learning and Patient-Specific Knowledge

Ruixuan Liu

Tianyi Liu

Kun Xie

Zhicheng Jiao

119

15 Nov 2024

Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted CaptionsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2024

488

13 Nov 2024

Hierarchical Visual Feature Aggregation for OCR-Free Document UnderstandingNeural Information Processing Systems (NeurIPS), 2024

139

08 Nov 2024

Image Understanding Makes for A Good Tokenizer for Image GenerationNeural Information Processing Systems (NeurIPS), 2024

203

07 Nov 2024

MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning

...

269

05 Nov 2024

Classification Done Right for Vision-Language Pre-TrainingNeural Information Processing Systems (NeurIPS), 2024

415

05 Nov 2024

Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction

245

01 Nov 2024

MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank ExpertsNeural Information Processing Systems (NeurIPS), 2024

Mingyu Ding

Jingdong Wang

159

30 Oct 2024

Controlling Language and Diffusion Models by Transporting ActivationsInternational Conference on Learning Representations (ICLR), 2024

Luca Zappella

324

30 Oct 2024

VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference AccelerationInternational Conference on Learning Representations (ICLR), 2024

240

29 Oct 2024

What Factors Affect Multi-Modal In-Context Learning? An In-Depth ExplorationNeural Information Processing Systems (NeurIPS), 2024

L. Qin

Qiguang Chen

Hao Fei

Zhi Chen

Min Li

Wanxiang Che

207

27 Oct 2024

Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion ModelsNeural Information Processing Systems (NeurIPS), 2024

Liulei Li

Wenguan Wang

Yue Yang

235

26 Oct 2024

Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable SensorsProceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024

264

26 Oct 2024

A Combinatorial Approach to Neural Emergent CommunicationInternational Conference on Computational Linguistics (COLING), 2024

Zheyuan Zhang

147

24 Oct 2024

Probabilistic Language-Image Pre-TrainingInternational Conference on Learning Representations (ICLR), 2024

1.2K

24 Oct 2024

ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language TuningInternational Journal of Computer Vision (IJCV), 2024

287

23 Oct 2024

Offline Evaluation of Set-Based Text-to-Image Generation

212

22 Oct 2024

Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss

Zhiqiang Hu

155

22 Oct 2024

Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance

Zhe Chen

...

395

21 Oct 2024

TIPS: Text-Image Pretraining with Spatial awarenessInternational Conference on Learning Representations (ICLR), 2024

Kevis-Kokitsi Maninis

...

Mojtaba Seyedhosseini

Howard Zhou

Andre Araujo

VLM

436

21 Oct 2024

EVA: An Embodied World Model for Future Video Anticipation

...

229

20 Oct 2024

Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations

375

17 Oct 2024

Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and GenerationComputer Vision and Pattern Recognition (CVPR), 2024

...

390

264

17 Oct 2024

Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation

Qiong Cao

210

17 Oct 2024

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-TrainingACM Multimedia (ACM MM), 2022

377

16 Oct 2024

Learning to Customize Text-to-Image Diffusion In Diverse Context

217

14 Oct 2024

Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question AnsweringIEEE Transactions on Image Processing (TIP), 2024

Ting Yu

Kunhao Fu

Jian Zhang

Qingming Huang

Jun Yu

218

12 Oct 2024

Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language BootstrappingInternational Conference on Learning Representations (ICLR), 2024

428

11 Oct 2024

A Unified Debiasing Approach for Vision-Language Models across Modalities and TasksNeural Information Processing Systems (NeurIPS), 2024

198

10 Oct 2024

Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-trainingComputer Vision and Pattern Recognition (CVPR), 2024

Zhaokai Wang

Yu Qiao

Xizhou Zhu

VLM MLLM

361

10 Oct 2024

Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

927

10 Oct 2024

All Papers

Microsoft COCO Captions: Data Collection and Evaluation Server

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"