Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2011.13681
Cited By

Point and Ask: Incorporating Pointing into Visual Question Answering

v1v2v3v4 (latest)

Point and Ask: Incorporating Pointing into Visual Question Answering

27 November 2020

William Fu-Hinthorn

Olga Russakovsky

ArXiv (abs)PDF HTML Github (19★)

Papers citing "Point and Ask: Incorporating Pointing into Visual Question Answering"

18 / 18 papers shown

Towards Understanding Visual Grounding in Visual Language Models

Towards Understanding Visual Grounding in Visual Language Models

Georgios Pantazopoulos

Eda B. Özyiğit

495

4

0

12 Sep 2025

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Shrikant B. Kendre

Silvio Savarese

Juan Carlos Niebles

186

2

0

03 Sep 2025

LOVA3: Learning to Visual Question Answering, Asking and Assessment

LOVA3: Learning to Visual Question Answering, Asking and AssessmentNeural Information Processing Systems (NeurIPS), 2024

Henry Hengyuan Zhao

Mike Zheng Shou

454

17

0

21 Feb 2025

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMs

The Labyrinth of Links: Navigating the Associative Maze of Multi-modal LLMsInternational Conference on Learning Representations (ICLR), 2024

Hong Li

Yong-Lu Li

424

3

0

02 Oct 2024

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for
Remote Sensing Vision-Language Understanding

SkySenseGPT: A Fine-Grained Instruction Tuning Dataset and Model for Remote Sensing Vision-Language Understanding

Yongjun Zhang

...

Yansheng Li

427

83

0

14 Jun 2024

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

List Items One by One: A New Data Source and Learning Paradigm for Multimodal LLMs

Jianwei Yang

...

Kevin Qinghong Lin

Julian McAuley

342

27

0

25 Apr 2024

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal
Large Language Models

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

244

29

0

14 Apr 2024

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Draw-and-Understand: Leveraging Visual Prompts to Enable MLLMs to Comprehend What You Want

Shanghang Zhang

519

97

0

29 Mar 2024

Visual Instruction Tuning towards General-Purpose Multimodal Model: A
Survey

Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey

234

34

0

27 Dec 2023

Jack of All Tasks, Master of Many: Designing General-purpose
Coarse-to-Fine Vision-Language Model

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Shraman Pramanick

Ser-Nam Lim

Amjad Almahairi

533

59

0

19 Dec 2023

Genixer: Empowering Multimodal Large Language Models as a Powerful Data
Generator

Genixer: Empowering Multimodal Large Language Models as a Powerful Data Generator

Henry Hengyuan Zhao

Mike Zheng Shou

567

13

0

11 Dec 2023

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual
Prompts

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual PromptsComputer Vision and Pattern Recognition (CVPR), 2023

Siva Karthik Mustikovela

Gregory P. Meyer

404

175

0

01 Dec 2023

NExT-Chat: An LMM for Chat, Detection and Segmentation

NExT-Chat: An LMM for Chat, Detection and Segmentation

Ao Zhang

Wei Ji

Zhiyuan Liu

447

81

0

08 Nov 2023

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language
Models

From CLIP to DINO: Visual Encoders Shout in Multi-modal Large Language Models

Dongsheng Jiang

476

93

0

13 Oct 2023

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring
Instruction Tuning

ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction TuningInternational Joint Conference on Artificial Intelligence (IJCAI), 2023

Liang Zhao

...

Jian‐Yuan Sun

Yuang Peng

Chunrui Han

Xiangyu Zhang

224

71

0

18 Jul 2023

Localized Questions in Medical Visual Question Answering

Localized Questions in Medical Visual Question AnsweringInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2023

Sergio Tascon-Morales

Pablo Márquez-Neila

Raphael Sznitman

210

10

0

03 Jul 2023

AssistSR: Task-oriented Video Segment Retrieval for Personal AI
Assistant

AssistSR: Task-oriented Video Segment Retrieval for Personal AI Assistant

Stan Weixian Lei

Mike Zheng Shou

357

8

0

30 Nov 2021

A Review on Explainability in Multimodal Deep Neural Nets

A Review on Explainability in Multimodal Deep Neural NetsIEEE Access (IEEE Access), 2021

508

181

0

17 May 2021

Page 1 of 1