v1v2 (latest)

Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering

17 February 2025

ArXiv (abs)PDF HTML Github

Papers citing "Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering"

50 / 72 papers shown

PhyDetEx: Detecting and Explaining the Physical Plausibility of T2V Models

189

01 Dec 2025

VideoVerse: Does Your T2V Generator Have World Model Capability to Synthesize Videos?

428

09 Oct 2025

Visual Question Decomposition on Multimodal Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2024

Volker Tresp

Jindong Gu

433

28 Sep 2024

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation

Ying Tai

Lanjun Wang

Zili Yi

DiffM MLLM

298

29 Apr 2024

What Is Missing in Multilingual Visual Reasoning and How to Fix It

Yueqi Song

Simran Khanuja

Graham Neubig

VLM LRM

684

03 Mar 2024

More Agents Is All You Need

465

145

03 Feb 2024

Q&A Prompts: Discovering Rich Visual Clues through Mining Question-Answer Prompts for VQA requiring Diverse World Knowledge

Haibi Wang

Weifeng Ge

LRM

505

19 Jan 2024

CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

Yuntao Du

Xiaojian Ma

448

18 Dec 2023

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin

1.8K

1,402

16 Nov 2023

Improving Zero-shot Visual Question Answering via Large Language Models with Reasoning Question PromptsACM Multimedia (ACM MM), 2023

498

15 Nov 2023

CogVLM: Visual Expert for Pretrained Language ModelsNeural Information Processing Systems (NeurIPS), 2023

Weihan Wang

Qingsong Lv

Wenmeng Yu

Wenyi Hong

Ji Qi

...

Bin Xu

Juanzi Li

Yuxiao Dong

Ming Ding

Jie Tang

VLM MLLM

840

778

06 Nov 2023

Exploring Question Decomposition for Zero-Shot VQANeural Information Processing Systems (NeurIPS), 2023

261

25 Oct 2023

Woodpecker: Hallucination Correction for Multimodal Large Language ModelsScience China Information Sciences (Sci China Inf Sci), 2023

Enhong Chen

433

231

24 Oct 2023

Large Language Models are Visual Reasoning CoordinatorsNeural Information Processing Systems (NeurIPS), 2023

Bo Li

Ziwei Liu

338

102

23 Oct 2023

A Simple Baseline for Knowledge-Based Visual Question Answering

Alexandros Xenos

Themos Stafylakis

Ioannis Patras

Georgios Tzimiropoulos

387

20 Oct 2023

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Heng Ji

384

08 Sep 2023

Towards CausalGPT: A Multi-Agent Approach for Faithful Knowledge Reasoning via Promoting Causal Consistency in LLMs

Tianshui Chen

Liang Lin

LRM

365

23 Aug 2023

A Survey on Large Language Model based Autonomous Agents

Lei Wang

...

Yankai Lin

859

2,667

22 Aug 2023

BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual QuestionsAAAI Conference on Artificial Intelligence (AAAI), 2023

477

201

19 Aug 2023

Llama 2: Open Foundation and Fine-Tuned Chat Models

Louis Martin

...

Sharan Narang

Sergey Edunov

12.4K

16,448

18 Jul 2023

Unleashing the Emergent Cognitive Synergy in Large Language Models: A Task-Solving Agent through Multi-Persona Self-CollaborationNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Heng Ji

768

265

11 Jul 2023

AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn

498

114

14 Jun 2023

Q: How to Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images!Computer Vision and Pattern Recognition (CVPR), 2023

346

06 Jun 2023

LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One DayNeural Information Processing Systems (NeurIPS), 2023

Jianwei Yang

457

1,568

01 Jun 2023

Collaborative Multi-Agent Video Fast-ForwardingIEEE transactions on multimedia (IEEE TMM), 2023

Shuyue Lan

Zhilu Wang

Ermin Wei

Amit K. Roy-Chowdhury

Qi Zhu

229

27 May 2023

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

520

24 May 2023

Improving Factuality and Reasoning in Language Models through Multiagent DebateInternational Conference on Machine Learning (ICML), 2023

Yilun Du

Shuang Li

Antonio Torralba

J. Tenenbaum

Igor Mordatch

LLMAG LRM

493

1,502

23 May 2023

CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual GroundingIEEE transactions on multimedia (IEEE TMM), 2023

Linhui Xiao

Xiaoshan Yang

Fang Peng

Ming Yan

Yaowei Wang

Changsheng Xu

ObjD VLM

565

15 May 2023

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction TuningNeural Information Processing Systems (NeurIPS), 2023

1.9K

3,275

11 May 2023

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language ModelsAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Lei Wang

624

670

06 May 2023

Retrieval-based Knowledge Augmented Vision Language Pre-trainingACM Multimedia (ACM MM), 2023

351

27 Apr 2023

Visual Instruction TuningNeural Information Processing Systems (NeurIPS), 2023

1.4K

8,828

17 Apr 2023

Generative Agents: Interactive Simulacra of Human BehaviorACM Symposium on User Interface Software and Technology (UIST), 2023

Cristina Mata

Joseph C. O'Brien

Carrie J. Cai

Meredith Ringel Morris

Abigail Z. Jacobs

Michael S. Bernstein

LM&Ro AI4CE

1.1K

3,775

07 Apr 2023

Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior RefinementIEEE International Conference on Computer Vision (ICCV), 2023

295

119

03 Apr 2023

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Zicheng Liu

411

545

20 Mar 2023

ViperGPT: Visual Inference via Python Execution for ReasoningIEEE International Conference on Computer Vision (ICCV), 2023

Dídac Surís

Sachit Menon

Carl Vondrick

MLLM LRM ReLM

453

703

14 Mar 2023

LLaMA: Open and Efficient Foundation Language Models

...

20.2K

19,316

27 Feb 2023

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language ModelsInternational Conference on Machine Learning (ICML), 2023

Silvio Savarese

1.6K

7,623

30 Jan 2023

From Images to Textual Prompts: Zero-shot VQA with Frozen Large Language ModelsComputer Vision and Pattern Recognition (CVPR), 2022

561

174

21 Dec 2022

Visual Programming: Compositional visual reasoning without trainingComputer Vision and Pattern Recognition (CVPR), 2022

Tanmay Gupta

Aniruddha Kembhavi

ReLM VLM LRM

578

635

18 Nov 2022

PromptCap: Prompt-Guided Task-Aware Image Captioning

Weijia Shi

488

134

15 Nov 2022

Scaling Instruction-Finetuned Language ModelsJournal of machine learning research (JMLR), 2022

...

1.8K

4,038

20 Oct 2022

LAION-5B: An open large-scale dataset for training next generation image-text modelsNeural Information Processing Systems (NeurIPS), 2022

...

1.5K

4,964

16 Oct 2022

Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question AnsweringNeural Information Processing Systems (NeurIPS), 2022

Oyvind Tafjord

736

2,137

20 Sep 2022

A Unified End-to-End Retriever-Reader Framework for Knowledge-based VQAACM Multimedia (ACM MM), 2022

226

30 Jun 2022

A-OKVQA: A Benchmark for Visual Question Answering using World KnowledgeEuropean Conference on Computer Vision (ECCV), 2022

594

866

03 Jun 2022

REVIVE: Regional Visual Representation Matters in Knowledge-Based Visual Question AnsweringNeural Information Processing Systems (NeurIPS), 2022

Lu Yuan

379

108

02 Jun 2022

Flamingo: a Visual Language Model for Few-Shot LearningNeural Information Processing Systems (NeurIPS), 2022

Jean-Baptiste Alayrac

...

869

5,564

29 Apr 2022

Winoground: Probing Vision and Language Models for Visio-Linguistic CompositionalityComputer Vision and Pattern Recognition (CVPR), 2022

Amanpreet Singh

Douwe Kiela

487

564

07 Apr 2022

Self-Consistency Improves Chain of Thought Reasoning in Language ModelsInternational Conference on Learning Representations (ICLR), 2022

3.7K

6,303

21 Mar 2022