v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015

Piotr Dollar

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown

Coeff-Tuning: A Graph Filter Subspace View for Tuning Attention-Based Large ModelsComputer Vision and Pattern Recognition (CVPR), 2025

Zichen Miao

Wei Chen

Qiang Qiu

271

24 Mar 2025

4D-Bench: Benchmarking Multi-modal Large Language Models for 4D Object Understanding

...

266

22 Mar 2025

BadToken: Token-level Backdoor Attacks to Multi-modal Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

451

20 Mar 2025

UniCrossAdapter: Multimodal Adaptation of CLIP for Radiology Report Generation

295

20 Mar 2025

Deeply Supervised Flow-Based Generative Models

Inkyu Shin

Chenglin Yang

Liang-Chieh Chen

444

18 Mar 2025

Can Large Vision Language Models Read Maps Like a Human?

372

18 Mar 2025

Dynamic Relation Inference via Verb Embeddings

383

17 Mar 2025

Scale Efficient Training for Large DatasetsComputer Vision and Pattern Recognition (CVPR), 2025

Qing Zhou

Junyu Gao

Qi Wang

324

17 Mar 2025

Hyperbolic Safety-Aware Vision-Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

292

15 Mar 2025

Neurons: Emulating the Human Visual Cortex Improves Fidelity and Interpretability in fMRI-to-Video Reconstruction

296

14 Mar 2025

RONA: Pragmatically Diverse Image Captioning with Coherence Relations

Aashish Anantha Ramakrishnan

Aadarsh Anantha Ramakrishnan

Dongwon Lee

309

14 Mar 2025

Vamba: Understanding Hour-Long Videos with Hybrid Mamba-Transformers

303

14 Mar 2025

Cyclic Contrastive Knowledge Transfer for Open-Vocabulary Object DetectionInternational Conference on Learning Representations (ICLR), 2025

1.0K

14 Mar 2025

FlowTok: Flowing Seamlessly Across Text and Image Tokens

524

13 Mar 2025

Teaching LMMs for Image Quality Scoring and Interpreting

410

12 Mar 2025

Scaling Laws for Conditional Emergence of Multilingual Image Captioning via Generalization from Translation

563

12 Mar 2025

SuperCap: Multi-resolution Superpixel-based Image Captioning

283

11 Mar 2025

LongProLIP: A Probabilistic Vision-Language Model with Long Context Text

Sanghyuk Chun

Sangdoo Yun

VLM

304

11 Mar 2025

Stick to Facts: Towards Fidelity-oriented Product Description GenerationConference on Empirical Methods in Natural Language Processing (EMNLP), 2019

324

11 Mar 2025

A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions

Rahul Nair

Bhanu Tokas

Neel Shah

391

10 Mar 2025

Task-Agnostic Attacks Against Vision Foundation Models

229

05 Mar 2025

Are Large Vision Language Models Good Game Players?International Conference on Learning Representations (ICLR), 2025

245

04 Mar 2025

Language-Guided Visual Perception Disentanglement for Image Quality Assessment and Conditional Image Generation

233

04 Mar 2025

UFO: A Unified Approach to Fine-grained Visual Perception via Open-ended Language Interface

448

03 Mar 2025

Abn-BLIP: Abnormality-aligned Bootstrapping Language-Image Pre-training for Pulmonary Embolism Diagnosis and Report Generation from CTPAMedical Image Analysis (MedIA), 2025

...

269

03 Mar 2025

Stealthy Backdoor Attack in Self-Supervised Learning Vision Encoders for Large Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

Zhaoyi Liu

Huan Zhang

AAML

694

25 Feb 2025

Capability Instruction Tuning: A New Paradigm for Dynamic LLM RoutingAAAI Conference on Artificial Intelligence (AAAI), 2025

451

24 Feb 2025

Fine-Grained Captioning of Long Videos through Scene Graph Consolidation

Sanghyeok Chu

Seonguk Seo

Bohyung Han

595

23 Feb 2025

InterFeedback: Unveiling Interactive Intelligence of Large Multimodal Models via Human Feedback

508

20 Feb 2025

CAPability: A Comprehensive Visual Caption Benchmark for Evaluating Both Correctness and Thoroughness

...

462

19 Feb 2025

RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

495

18 Feb 2025

MindLLM: A Subject-Agnostic and Versatile Model for fMRI-to-Text Decoding

397

18 Feb 2025

Learning to Sample Effective and Diverse Prompts for Text-to-Image GenerationComputer Vision and Pattern Recognition (CVPR), 2025

297

17 Feb 2025

Towards Cross-Lingual Explanation of Artwork in Large-scale Vision Language Models

373

17 Feb 2025

Any Information Is Just Worth One Single Screenshot: Unifying Search With Visualized Information RetrievalAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

960

17 Feb 2025

Scaling Autonomous Agents via Automatic Reward Modeling And PlanningInternational Conference on Learning Representations (ICLR), 2025

306

17 Feb 2025

Pixel-Level Reasoning Segmentation via Multi-turn ConversationsAnnual Meeting of the Association for Computational Linguistics (ACL), 2025

335

13 Feb 2025

PatentLMM: Large Multimodal Model for Generating Descriptions for Patent FiguresAAAI Conference on Artificial Intelligence (AAAI), 2025

373

28 Jan 2025

MASS: Overcoming Language Bias in Image-Text MatchingAAAI Conference on Artificial Intelligence (AAAI), 2025

209

20 Jan 2025

OneLLM: One Framework to Align All Modalities with LanguageComputer Vision and Pattern Recognition (CVPR), 2023

552

191

10 Jan 2025

Multimodal Multihop Source Retrieval for Web Question Answering

Navya Yarrabelly

Saloni Mittal

143

07 Jan 2025

A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing ImagesIEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (IEEE JSTARS), 2024

Dawen Yu

Shunping Ji

ViT

286

03 Jan 2025

VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language TasksNeural Information Processing Systems (NeurIPS), 2024

...

787

118

03 Jan 2025

A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 FramesComputer Vision and Pattern Recognition (CVPR), 2023

272

31 Dec 2024

B-AVIBench: Towards Evaluating the Robustness of Large Vision-Language Model on Black-box Adversarial Visual-InstructionsIEEE Transactions on Information Forensics and Security (IEEE TIFS), 2024

214

31 Dec 2024

Incorporating Feature Pyramid Tokenization and Open Vocabulary Semantic SegmentationInternational Conference on Artificial Neural Networks (ICANN), 2024

382

18 Dec 2024

Adversarial Hubness in Multi-Modal Retrieval

572

18 Dec 2024

From Simple to Professional: A Combinatorial Controllable Image Captioning Agent

286

15 Dec 2024

Learning to Merge Tokens via Decoupled Embedding for Efficient Vision TransformersNeural Information Processing Systems (NeurIPS), 2024

Dong Hoon Lee

Seunghoon Hong

230

13 Dec 2024

DocVLM: Make Your VLM an Efficient ReaderComputer Vision and Pattern Recognition (CVPR), 2024

629

11 Dec 2024

All Papers

Microsoft COCO Captions: Data Collection and Evaluation Server

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"