v1v2 (latest)

Microsoft COCO Captions: Data Collection and Evaluation Server

1 April 2015

Piotr Dollar

Papers citing "Microsoft COCO Captions: Data Collection and Evaluation Server"

50 / 1,519 papers shown

Cacophony: An Improved Contrastive Audio-Text ModelIEEE/ACM Transactions on Audio Speech and Language Processing (TASLP), 2024

325

10 Feb 2024

GPTs Are Multilingual Annotators for Sequence Generation Tasks

Juhwan Choi

Eunju Lee

Kyohoon Jin

Youngbin Kim

156

08 Feb 2024

Question Aware Vision Transformer for Multimodal Reasoning

299

08 Feb 2024

Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models

Jian Yang

237

08 Feb 2024

Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning

263

03 Feb 2024

Can MLLMs Perform Text-to-Image In-Context Learning?

263

02 Feb 2024

SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling

Eileen Wang

S. Han

Josiah Poon

281

01 Feb 2024

EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning

Jaeyeon Kim

Jaeyoon Jung

Jinjoo Lee

Sang Hoon Woo

CLIP VLM

218

31 Jan 2024

EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain

Tong Zhang

434

218

30 Jan 2024

Towards Unified Interactive Visual Grounding in The Wild

299

30 Jan 2024

M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining

Jingdong Chen

Ming Yang

VLM MLLM

220

29 Jan 2024

Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQAAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

293

29 Jan 2024

MM-LLMs: Recent Advances in MultiModal Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2024

512

335

24 Jan 2024

Common-Sense Bias Modeling for Classification TasksAAAI Conference on Artificial Intelligence (AAAI), 2024

486

24 Jan 2024

Enhancing Object Detection Performance for Small Objects through Synthetic Data Generation and Proportional Class-Balancing Technique: A Comparative Study in Industrial Scenarios

Christiane Plociennik

Martin Ruskowski

197

23 Jan 2024

SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning CapabilitiesComputer Vision and Pattern Recognition (CVPR), 2024

Dorsa Sadigh

329

548

22 Jan 2024

Text-to-Image Cross-Modal Generation: A Systematic Review

Maciej Żelaszczyk

Jacek Mańdziuk

320

21 Jan 2024

LLMRA: Multi-modal Large Language Model based Restoration Assistant

Xiaoyu Jin

Yuan Shi

Bin Xia

Wenming Yang

201

21 Jan 2024

CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios

358

19 Jan 2024

Supervised Fine-tuning in turn Improves Visual Foundation Models

Chun Yuan

Ying Shan

VLM CLIP

256

18 Jan 2024

MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer

...

Yu Qiao

240

18 Jan 2024

Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer

352

17 Jan 2024

COCO is "ALL'' You Need for Visual Instruction Fine-tuningIEEE International Conference on Multimedia and Expo (ICME), 2024

Hongxia Yang

210

17 Jan 2024

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

...

Somayeh Sojoudi

193

09 Jan 2024

CaMML: Context-Aware Multimodal Learner for Large Models

277

06 Jan 2024

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

314

06 Jan 2024

4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency

325

112

28 Dec 2023

Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action

282

274

28 Dec 2023

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

198

27 Dec 2023

Cloud-Device Collaborative Learning for Multimodal Large Language Models

Yuan Zhang

...

Shanghang Zhang

215

26 Dec 2023

UniRef++: Segment Every Reference Object in Spatial and Temporal Spaces

Huchuan Lu

Ping Luo

273

25 Dec 2023

Learning from Mistakes: Iterative Prompt Relabeling for Text-to-Image Diffusion Model Training

Shanghang Zhang

465

23 Dec 2023

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Weijie Su

...

Ping Luo

Yu Qiao

641

2,210

21 Dec 2023

Generative Multimodal Models are In-Context Learners

...

Tiejun Huang

374

422

20 Dec 2023

Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining

255

19 Dec 2023

CLIM: Contrastive Language-Image Mosaic for Region Representation

Wentao Liu

199

18 Dec 2023

M3DBench: Let's Instruct Large Models with Multi-modal 3D Prompts

307

17 Dec 2023

Simple Image-level Classification Improves Open-vocabulary Object DetectionAAAI Conference on Artificial Intelligence (AAAI), 2023

292

16 Dec 2023

Tell Me What You See: Text-Guided Real-World Image DenoisingIEEE Open Journal of Signal Processing (IEEE Open J. Signal Process.), 2023

E. Yosef

Raja Giryes

DiffM

499

15 Dec 2023

VL-GPT: A Generative Pre-trained Transformer for Vision and Language Understanding and Generation

Sijie Zhao

Ying Shan

191

14 Dec 2023

Pixel Aligned Language ModelsComputer Vision and Pattern Recognition (CVPR), 2023

291

14 Dec 2023

CL2CM: Improving Cross-Lingual Cross-Modal Retrieval via Cross-Lingual Knowledge TransferAAAI Conference on Artificial Intelligence (AAAI), 2023

Fan Wang

206

14 Dec 2023

Depicting Beyond Scores: Advancing Image Quality Assessment through Multi-modal Language ModelsEuropean Conference on Computer Vision (ECCV), 2023

Jinjin Gu

400

14 Dec 2023

Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image CaptioningAAAI Conference on Artificial Intelligence (AAAI), 2023

254

14 Dec 2023

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

461

11 Dec 2023

Stellar: Systematic Evaluation of Human-Centric Personalized Text-to-Image Methods

Panos Achlioptas

Alexandros Benetatos

Iordanis Fostiropoulos

Dimitris Skourtis

276

11 Dec 2023

GlitchBench: Can large multimodal models detect video game glitches?Computer Vision and Pattern Recognition (CVPR), 2023

Mohammad Reza Taesiri

329

08 Dec 2023

Lyrics: Boosting Fine-grained Language-Vision Alignment and Comprehension via Semantic-aware Visual Objects

229

08 Dec 2023

GPT4SGG: Synthesizing Scene Graphs from Holistic and Region-specific Narratives

Zuyao Chen

Jinlin Wu

Zhen Lei

Zhaoxiang Zhang

Changwen Chen

296

07 Dec 2023

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

Yong Liu

227

07 Dec 2023