v1v2v3v4 (latest)

Connecting Vision and Language with Localized Narratives

European Conference on Computer Vision (ECCV), 2019

6 December 2019

Papers citing "Connecting Vision and Language with Localized Narratives"

50 / 200 papers shown

SGDiff: Scene Graph Guided Diffusion Model for Image Collaborative SegCaptioningAAAI Conference on Artificial Intelligence (AAAI), 2025

128

01 Dec 2025

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight

121

25 Nov 2025

NVIDIA Nemotron Nano V2 VL

Nvidia

Amala Sanjay Deshmukh

...

310

06 Nov 2025

VAMOS: A Hierarchical Vision-Language-Action Model for Capability-Modulated and Steerable Navigation

...

135

23 Oct 2025

FineVision: Open Data Is All You Need

Aritra Roy Gosthipaty

Andrés Marafioti

VLM

195

20 Oct 2025

BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices

Ghaluh Indah Permata Sari

Yie-Tarng Chen

125

12 Oct 2025

Looking to Learn: Token-wise Dynamic Gating for Low-Resource Vision-Language Modelling

Bianca-Mihaela Ganescu

136

09 Oct 2025

One Patch to Caption Them All: A Unified Zero-Shot Captioning Framework

173

03 Oct 2025

Model Merging to Maintain Language-Only Performance in Developmentally Plausible Multimodal Models

160

02 Oct 2025

VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions

156

30 Sep 2025

Long Story Short: Disentangling Compositionality and Long-Caption Understanding in VLMs

Israfel Salazar

Desmond Elliott

Yova Kementchedjhieva

CoGe VLM

203

23 Sep 2025

MAJORScore: A Novel Metric for Evaluating Multimodal Relevance via Joint Representation

101

22 Sep 2025

Case-Based Decision-Theoretic Decoding with Quality Memories

Hiroyuki Deguchi

Masaaki Nagata

161

16 Sep 2025

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

ObjD

315

12 Sep 2025

VoCap: Video Object Captioning and Segmentation from Any Prompt

261

29 Aug 2025

OVFact: Measuring and Improving Open-Vocabulary Factuality for Long Caption Models

186

25 Jul 2025

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences

203

25 Jul 2025

AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

...

176

17 Jul 2025

FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation

266

20 Jun 2025

HalLoc: Token-level Localization of Hallucinations for Vision Language ModelsComputer Vision and Pattern Recognition (CVPR), 2025

374

12 Jun 2025

ViCrit: A Verifiable Reinforcement Learning Proxy Task for Visual Perception in VLMs

...

308

11 Jun 2025

MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping

280

02 Jun 2025

ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations

271

29 May 2025

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

434

23 May 2025

DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?

223

22 May 2025

DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic SegmentationComputer Vision and Pattern Recognition (CVPR), 2025

236

16 May 2025

Learning Graph Representation of Agent DiffusersAdaptive Agents and Multi-Agent Systems (AAMAS), 2025

Ahmed Nabil Belbachir

Anis Yazidi

AI4CE

490

10 May 2025

FUSION: Fully Integration of Vision-Language Representations for Deep Cross-Modal Understanding

466

14 Apr 2025

Latent Beam Diffusion Models for Generating Visual Sequences

368

26 Mar 2025

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

501

26 Mar 2025

StreamMind: Unlocking Full Frame Rate Streaming Video Dialogue through Event-Gated Cognition

927

08 Mar 2025

M2-omni: Advancing Omni-MLLM for Comprehensive Modality Support with Competitive Performance

...

584

26 Feb 2025

BabyLM Turns 3: Call for papers for the 2025 BabyLM workshop

...

320

15 Feb 2025

Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex InteractionsAAAI Conference on Artificial Intelligence (AAAI), 2024

524

12 Feb 2025

DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students' Hand-Drawn Math ImagesNorth American Chapter of the Association for Computational Linguistics (NAACL), 2025

296

28 Jan 2025

Findings of the Second BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora

493

06 Dec 2024

BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile DevicesComputer Vision and Pattern Recognition (CVPR), 2024

...

201

16 Nov 2024

SimpsonsVQA: Enhancing Inquiry-Based Learning with a Tailored Dataset

Ngoc Dung Huynh

Mohamed Reda Bouadjenek

Sunil Aryal

Imran Razzak

Hakim Hacid

233

30 Oct 2024

Offline Evaluation of Set-Based Text-to-Image Generation

215

22 Oct 2024

Exploring Curriculum Learning for Vision-Language Tasks: A Study on Small-Scale Multimodal Training

190

20 Oct 2024

Trust but Verify: Programmatic VLM Evaluation in the Wild

163

17 Oct 2024

Compositional Entailment Learning for Hyperbolic Vision-Language ModelsInternational Conference on Learning Representations (ICLR), 2024

Avik Pal

Max van Spengler

Guido Maria DÁmely di Melendugno

Alessandro Flaborea

Fabio Galasso

Pascal Mettes

CoGe

377

09 Oct 2024

Dynamic Prompting of Frozen Text-to-Image Diffusion Models for Panoptic Narrative GroundingACM Multimedia (MM), 2024

Hongyu Li

Bin Ma

Jizhong Han

Si Liu

DiffM

234

12 Sep 2024

PiTe: Pixel-Temporal Alignment for Large Video-Language ModelEuropean Conference on Computer Vision (ECCV), 2024

Yang Liu

Pengxiang Ding

Siteng Huang

Min Zhang

Han Zhao

Donglin Wang

214

11 Sep 2024

Building and better understanding vision-language models: insights and future directions

Hugo Laurençon

317

132

22 Aug 2024

TraDiffusion: Trajectory-Based Training-Free Image Generation

Mingrui Wu

Jiayi Ji

Xiaoshuai Sun

Rongrong Ji

207

19 Aug 2024

Look Hear: Gaze Prediction for Speech-directed Human AttentionEuropean Conference on Computer Vision (ECCV), 2024

Sounak Mondal

Seoyoung Ahn

Zhibo Yang

Niranjan Balasubramanian

Dimitris Samaras

G. Zelinsky

Minh Hoai

407

28 Jul 2024

Position: Measure Dataset Diversity, Don't Just Claim It

Dora Zhao

Jerone T. A. Andrews

Orestis Papakyriakopoulos

Alice Xiang

275

11 Jul 2024

Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions

Pavan Kumar Anasosalu Vasu

378

09 Jul 2024

A Single Transformer for Scalable Vision-Language Modeling

290

08 Jul 2024