v1v2 (latest)

Fusion of Detected Objects in Text for Visual Question Answering

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019

14 August 2019

ArXiv (abs)PDF HTML Github (1675★)

Papers citing "Fusion of Detected Objects in Text for Visual Question Answering"

50 / 109 papers shown

GroundSight: Augmenting Vision-Language Models with Grounding Information and De-hallucination

30 Sep 2025

Memory-Augmented Multimodal LLMs for Surgical VQA via Self-Contained Inquiry

292

17 Nov 2024

Vision Model Pre-training on Interleaved Image-Text Data via Latent Compression Learning

Chenyu Yang

Xizhou Zhu

Jinguo Zhu

Weijie Su

Junjie Wang

...

Lewei Lu

Bin Li

Jie Zhou

Yu Qiao

Jifeng Dai

VLM CLIP

252

11 Jun 2024

EventLens: Leveraging Event-Aware Pretraining and Cross-modal Linking Enhances Visual Commonsense Reasoning

224

22 Apr 2024

FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues

Shuang Li

Jiahua Wang

Lijie Wen

LRM

175

29 Mar 2024

Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

Maurits J. R. Bleeker

Mariya Hendriksen

Andrew Yates

Maarten de Rijke

VLM

363

27 Feb 2024

$$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$aphs$

\mathbb{VD}

\mathbb{GR}

: Boosting

\mathbb{V}

isual

\mathbb{D}

ialog with Cascaded Spatial-Temporal Multi-Modal

\mathbb{GR}

aphsIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

235

25 Oct 2023

UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large ModelsIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2023

487

17 Oct 2023

ELIP: Efficient Discriminative Language-Image Pre-training with Fewer Vision Tokens

Haoyu Zhang

358

28 Sep 2023

Separate and Locate: Rethink the Text in Text-based Visual Question AnsweringACM Multimedia (ACM MM), 2023

330

31 Aug 2023

MoviePuzzle: Visual Narrative Reasoning through Multimodal Order Learning

403

04 Jun 2023

Using Visual Cropping to Enhance Fine-Detail Question Answering of BLIP-Family Models

174

31 May 2023

Deeply Coupled Cross-Modal Prompt LearningAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

Wei Tang

291

29 May 2023

ArK: Augmented Reality with Knowledge Interactive Emergent Ability

...

Yejin Choi

233

01 May 2023

Enhancing object detection robustness: A synthetic and natural perturbation approach

170

20 Apr 2023

Probabilistic Prompt Learning for Dense PredictionComputer Vision and Pattern Recognition (CVPR), 2023

346

03 Apr 2023

Borrowing Human Senses: Comment-Aware Self-Training for Social Media Multimodal ClassificationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Chunpu Xu

Jing Li

VLM

175

27 Mar 2023

Large-scale Multi-Modal Pre-trained Models: A Comprehensive SurveyMachine Intelligence Research (MIR), 2023

Yaowei Wang

Yonghong Tian

Wen Gao

AI4CE VLM

604

290

20 Feb 2023

Multi-modal Machine Learning in Engineering Design: A Review and Future DirectionsJournal of Computing and Information Science in Engineering (JCISE), 2023

416

14 Feb 2023

A survey on knowledge-enhanced multimodal learningArtificial Intelligence Review (Artif Intell Rev), 2022

Maria Lymperaiou

Giorgos Stamou

543

19 Nov 2022

Towards All-in-one Pre-training via Maximizing Multi-modal Mutual InformationComputer Vision and Pattern Recognition (CVPR), 2022

Weijie Su

Gao Huang

Yu Qiao

Xiaogang Wang

Jie Zhou

Jifeng Dai

265

17 Nov 2022

DiMBERT: Learning Vision-Language Grounded Representations with Disentangled Multimodal-AttentionACM Transactions on Knowledge Discovery from Data (TKDD), 2021

Xuancheng Ren

Yuexian Zou

243

28 Oct 2022

Masked Vision-Language Transformer in FashionMachine Intelligence Research (MIR), 2022

Luc Van Gool

281

27 Oct 2022

Learning by Hallucinating: Vision-Language Pre-training with Weak SupervisionIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2022

205

24 Oct 2022

Contrastive Language-Image Pre-Training with Knowledge GraphsNeural Information Processing Systems (NeurIPS), 2022

Gao Huang

205

17 Oct 2022

Learning to Evaluate Performance of Multi-modal Semantic LocalizationIEEE Transactions on Geoscience and Remote Sensing (IEEE TGRS), 2022

Xian Sun

276

14 Sep 2022

Computational Sarcasm Analysis on Social Media: A Systematic Review

231

13 Sep 2022

PreSTU: Pre-Training for Scene-Text UnderstandingIEEE International Conference on Computer Vision (ICCV), 2022

Wei-Lun Chao

440

12 Sep 2022

A Sketch Is Worth a Thousand Words: Image Retrieval with Text and SketchEuropean Conference on Computer Vision (ECCV), 2022

Diyi Yang

217

05 Aug 2022

Vision-and-Language Pretraining

314

05 Jul 2022

Multimodal Learning with Transformers: A SurveyIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

648

934

13 Jun 2022

PEVL: Position-enhanced Pre-training and Prompt Tuning for Vision-language ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2022

Yuan Yao

Qi-An Chen

Ao Zhang

Wei Ji

Zhiyuan Liu

Tat-Seng Chua

Maosong Sun

VLM MLLM

276

23 May 2022

Learning to Answer Visual Questions from Web VideosIEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2022

413

10 May 2022

Training and challenging models for text-guided fashion image retrieval

Eric Dodds

Jack Culpepper

Gaurav Srivastava

233

23 Apr 2022

Towards Lightweight Transformer via Group-wise Transformation for Vision-and-Language TasksIEEE Transactions on Image Processing (IEEE TIP), 2022

Liujuan Cao

Yongjian Wu

Feiyue Huang

Rongrong Ji

ViT

176

16 Apr 2022

Visual-Language Navigation Pretraining via Prompt-based Environmental Self-explorationAnnual Meeting of the Association for Computational Linguistics (ACL), 2022

Xiwen Liang

Fengda Zhu

Lingling Li

Hang Xu

Xiaodan Liang

LM&Ro VLM

175

08 Mar 2022

Vision-Language Intelligence: Tasks, Representation Learning, and Large Models

Lei Zhang

229

03 Mar 2022

VLP: A Survey on Vision-Language Pre-trainingMachine Intelligence Research (MIR), 2022

Minglun Han

461

310

18 Feb 2022

MERLOT Reserve: Neural Script Knowledge through Vision and Language and SoundComputer Vision and Pattern Recognition (CVPR), 2022

Yejin Choi

558

249

07 Jan 2022

LaTr: Layout-Aware Transformer for Scene-Text VQAComputer Vision and Pattern Recognition (CVPR), 2021

429

118

23 Dec 2021

Decompose the Sounds and Pixels, Recompose the EventsAAAI Conference on Artificial Intelligence (AAAI), 2021

166

21 Dec 2021

Uni-Perceiver: Pre-training Unified Architecture for Generic Perception for Zero-shot and Few-shot Tasks

291

157

02 Dec 2021

UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling

Zicheng Liu

413

139

23 Nov 2021

LAViTeR: Learning Aligned Visual and Textual Representations Assisted by Image and Caption Generation

Mohammad Abuzar Shaikh

188

04 Sep 2021

Audio-Visual Transformer Based Crowd Counting

270

04 Sep 2021

Auto-Parsing Network for Image Captioning and Visual Question AnsweringIEEE International Conference on Computer Vision (ICCV), 2021

Xu Yang

Chongyang Gao

Hanwang Zhang

Jianfei Cai

312

24 Aug 2021

From Two to One: A New Scene Text Recognizer with Visual Language Modeling NetworkIEEE International Conference on Computer Vision (ICCV), 2021

281

182

22 Aug 2021

Airbert: In-domain Pretraining for Vision-and-Language Navigation

309

180

20 Aug 2021

Knowledge Perceived Multi-modal Pretraining in E-commerce

Ningyu Zhang

Huajun Chen

289

20 Aug 2021

Exceeding the Limits of Visual-Linguistic Multi-Task Learning

Cameron R. Wolfe

Keld T. Lundgaard

VLM

206

27 Jul 2021