Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1908.07490
Cited By
LXMERT: Learning Cross-Modality Encoder Representations from Transformers
20 August 2019
Hao Hao Tan
Mohit Bansal
VLM
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"LXMERT: Learning Cross-Modality Encoder Representations from Transformers"
50 / 240 papers shown
Title
Knowledge-Informed Deep Learning for Irrigation Type Mapping from Remote Sensing
Oishee Bintey Hoque
Nibir Chandra Mandal
Abhijin Adiga
S. Swarup
S. Nouwakpo
Amanda Wilson
M. Marathe
16
0
0
13 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
34
0
0
04 May 2025
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation
Yinfeng Yu
Dongsheng Yang
22
0
0
30 Apr 2025
Multimodal graph representation learning for website generation based on visual sketch
Tung D. Vu
Chung Hoang
Truong-Son Hy
3DV
48
0
0
25 Apr 2025
Think Hierarchically, Act Dynamically: Hierarchical Multi-modal Fusion and Reasoning for Vision-and-Language Navigation
Junrong Yue
Y. Zhang
Chuan Qin
Bo Li
Xiaomin Lie
Xinlei Yu
Wenxin Zhang
Zhendong Zhao
43
0
0
23 Apr 2025
Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?
Roman Kochnev
Arash Torabi Goodarzi
Zofia Antonina Bentyn
D. Ignatov
Radu Timofte
46
2
0
08 Apr 2025
ChatBEV: A Visual Language Model that Understands BEV Maps
Qingyao Xu
S. Chen
Guang Chen
Yanfeng Wang
Y. Zhang
39
0
0
18 Mar 2025
Enhancing Collective Intelligence in Large Language Models Through Emotional Integration
Likith Kadiyala
Ramteja Sajja
Y. Sermet
Ibrahim Demir
56
0
0
05 Mar 2025
FilterRAG: Zero-Shot Informed Retrieval-Augmented Generation to Mitigate Hallucinations in VQA
S M Sarwar
66
1
0
25 Feb 2025
Towards Top-Down Reasoning: An Explainable Multi-Agent Approach for Visual Question Answering
Zeqing Wang
Wentao Wan
Qiqing Lao
Runmeng Chen
Minjie Lang
Keze Wang
Liang Lin
Liang Lin
LRM
96
3
0
17 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
62
2
0
11 Feb 2025
A Multimodal PDE Foundation Model for Prediction and Scientific Text Descriptions
Elisa Negrini
Yuxuan Liu
Liu Yang
Stanley Osher
Hayden Schaeffer
AI4CE
83
0
0
09 Feb 2025
A Survey of Large Language Models for Healthcare: from Data, Technology, and Applications to Accountability and Ethics
Kai He
Rui Mao
Qika Lin
Yucheng Ruan
Xiang Lan
Mengling Feng
Erik Cambria
LM&MA
AILaw
82
148
0
28 Jan 2025
sDREAMER: Self-distilled Mixture-of-Modality-Experts Transformer for Automatic Sleep Staging
Jingyuan Chen
Yuan Yao
Mie Anderson
Natalie Hauglund
Celia Kjaerby
Verena Untiet
Maiken Nedergaard
Jiebo Luo
41
1
0
28 Jan 2025
Overcoming Language Priors for Visual Question Answering Based on Knowledge Distillation
Daowan Peng
Wei Wei
51
0
0
10 Jan 2025
FOLDER: Accelerating Multi-modal Large Language Models with Enhanced Performance
Haicheng Wang
Zhemeng Yu
Gabriele Spadaro
Chen Ju
Victor Quétu
Enzo Tartaglione
Enzo Tartaglione
VLM
76
3
0
05 Jan 2025
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
66
2
0
20 Nov 2024
Efficient Bilinear Attention-based Fusion for Medical Visual Question Answering
Zhilin Zhang
Jie Wang
Zhanghao Qin
Ruiqi Zhu
Xiaoliang Gong
MedIm
33
0
0
28 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
28
0
0
01 Oct 2024
PROSE-FD: A Multimodal PDE Foundation Model for Learning Multiple Operators for Forecasting Fluid Dynamics
Yuxuan Liu
Jingmin Sun
Xinjie He
Griffin Pinney
Zecheng Zhang
Hayden Schaeffer
AI4CE
30
5
0
15 Sep 2024
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGe
VLM
22
0
0
12 Sep 2024
Data Augmentation via Latent Diffusion for Saliency Prediction
Bahar Aydemir
Deblina Bhattacharjee
Tong Zhang
Mathieu Salzmann
Sabine Süsstrunk
18
1
0
11 Sep 2024
What Do VLMs NOTICE? A Mechanistic Interpretability Pipeline for Gaussian-Noise-free Text-Image Corruption and Evaluation
Michal Golovanevsky
William Rudman
Vedant Palit
Ritambhara Singh
Carsten Eickhoff
24
1
0
24 Jun 2024
EAVE: Efficient Product Attribute Value Extraction via Lightweight Sparse-layer Interaction
Li Yang
Qifan Wang
Jianfeng Chi
Jiahao Liu
Jingang Wang
Fuli Feng
Zenglin Xu
Yi Fang
Lifu Huang
Dongfang Liu
20
1
0
10 Jun 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
30
13
0
08 Jun 2024
Towards Semantic Equivalence of Tokenization in Multimodal LLM
Shengqiong Wu
Hao Fei
Xiangtai Li
Jiayi Ji
Hanwang Zhang
Tat-Seng Chua
Shuicheng Yan
MLLM
59
25
0
07 Jun 2024
Can't make an Omelette without Breaking some Eggs: Plausible Action Anticipation using Large Video-Language Models
Himangi Mittal
Nakul Agarwal
Shao-Yuan Lo
Kwonjoon Lee
30
13
0
30 May 2024
Enhancing Large Vision Language Models with Self-Training on Image Comprehension
Yihe Deng
Pan Lu
Fan Yin
Ziniu Hu
Sheng Shen
James Y. Zou
Kai-Wei Chang
Wei Wang
SyDa
VLM
LRM
31
36
0
30 May 2024
FinEmbedDiff: A Cost-Effective Approach of Classifying Financial Documents with Vector Sampling using Multi-modal Embedding Models
Anjanava Biswas
Wrick Talukdar
11
1
0
28 May 2024
A Survey on Vision-Language-Action Models for Embodied AI
Yueen Ma
Zixing Song
Yuzheng Zhuang
Jianye Hao
Irwin King
LM&Ro
64
38
0
23 May 2024
Text-Video Retrieval with Global-Local Semantic Consistent Learning
Haonan Zhang
Pengpeng Zeng
Lianli Gao
Jingkuan Song
Yihang Duan
Xinyu Lyu
Hengtao Shen
VLM
CLIP
23
2
0
21 May 2024
Visual Language Model based Cross-modal Semantic Communication Systems
Feibo Jiang
Chuanguo Tang
Li Dong
Kezhi Wang
Kun Yang
Cunhua Pan
VLM
28
2
0
06 May 2024
Self-Bootstrapped Visual-Language Model for Knowledge Selection and Question Answering
Dongze Hao
Qunbo Wang
Longteng Guo
Jie Jiang
Jing Liu
27
0
0
22 Apr 2024
PDF-MVQA: A Dataset for Multimodal Information Retrieval in PDF-based Visual Question Answering
Yihao Ding
Kaixuan Ren
Jiabin Huang
Siwen Luo
S. Han
29
1
0
19 Apr 2024
Look, Listen, and Answer: Overcoming Biases for Audio-Visual Question Answering
Jie Ma
Min Hu
Pinghui Wang
Wangchun Sun
Lingyun Song
Hongbin Pei
Jun Liu
Youtian Du
30
4
0
18 Apr 2024
ViTextVQA: A Large-Scale Visual Question Answering Dataset for Evaluating Vietnamese Text Comprehension in Images
Quan Van Nguyen
Dan Quang Tran
Huy Quang Pham
Thang Kien-Bao Nguyen
Nghia Hieu Nguyen
Kiet Van Nguyen
N. Nguyen
CoGe
35
3
0
16 Apr 2024
Multimodal Cross-Document Event Coreference Resolution Using Linear Semantic Transfer and Mixed-Modality Ensembles
Abhijnan Nath
Huma Jamil
Shafiuddin Rehan Ahmed
George Baker
Rahul Ghosh
James H. Martin
Nathaniel Blanchard
Nikhil Krishnaswamy
21
2
0
13 Apr 2024
Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts
Övgü Özdemir
Erdem Akagündüz
31
9
0
12 Apr 2024
Contextual Chart Generation for Cyber Deception
David D. Nguyen
David Liebowitz
Surya Nepal
S. Kanhere
Sharif Abuadbba
35
0
0
07 Apr 2024
FSMR: A Feature Swapping Multi-modal Reasoning Approach with Joint Textual and Visual Clues
Shuang Li
Jiahua Wang
Lijie Wen
LRM
16
0
0
29 Mar 2024
Mind the Error! Detection and Localization of Instruction Errors in Vision-and-Language Navigation
Francesco Taioli
Stefano Rosa
A. Castellini
Lorenzo Natale
Alessio Del Bue
Alessandro Farinelli
Marco Cristani
Yiming Wang
33
5
0
15 Mar 2024
GET: Unlocking the Multi-modal Potential of CLIP for Generalized Category Discovery
Enguang Wang
Zhimao Peng
Zhengyuan Xie
Fei Yang
Xialei Liu
Ming-Ming Cheng
48
3
0
15 Mar 2024
Towards Deviation-Robust Agent Navigation via Perturbation-Aware Contrastive Learning
Bingqian Lin
Yanxin Long
Yi Zhu
Fengda Zhu
Xiaodan Liang
QiXiang Ye
Liang Lin
21
5
0
09 Mar 2024
A Comprehensive Survey of Convolutions in Deep Learning: Applications, Challenges, and Future Trends
Abolfazl Younesi
Mohsen Ansari
Mohammadamin Fazli
A. Ejlali
Muhammad Shafique
Joerg Henkel
3DV
33
43
0
23 Feb 2024
Multimodal Transformer With a Low-Computational-Cost Guarantee
Sungjin Park
Edward Choi
28
1
0
23 Feb 2024
WinoViz: Probing Visual Properties of Objects Under Different States
Woojeong Jin
Tejas Srinivasan
Jesse Thomason
Xiang Ren
23
1
0
21 Feb 2024
Convincing Rationales for Visual Question Answering Reasoning
Kun Li
G. Vosselman
Michael Ying Yang
34
1
0
06 Feb 2024
Cross-Modal Coordination Across a Diverse Set of Input Modalities
Jorge Sánchez
Rodrigo Laguna
VLM
13
0
0
29 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
44
4
0
28 Dec 2023
Guided Image Restoration via Simultaneous Feature and Image Guided Fusion
Xinyi Liu
Qian Zhao
Jie-Kai Liang
Huiyu Zeng
Deyu Meng
Lei Zhang
30
0
0
14 Dec 2023
1
2
3
4
5
Next