LXMERT: Learning Cross-Modality Encoder Representations from Transformers

20 August 2019

Papers citing "LXMERT: Learning Cross-Modality Encoder Representations from Transformers"

50 / 1,506 papers shown

Title
Multimodal Representation Learning by Alternating Unimodal Adaptation Xiaohui Zhang Jaehong Yoon Mohit Bansal Huaxiu Yao 24 21 0 17 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback Yangyi Chen Karan Sikka Michael Cogswell Heng Ji Ajay Divakaran 24 56 0 16 Nov 2023
Attribute Diversity Determines the Systematicity Gap in VQA Ian Berlot-Attwell Kumar Krishna Agrawal A. M. Carrell Yash Sharma Naomi Saphra 21 1 0 15 Nov 2023
Interaction is all You Need? A Study of Robots Ability to Understand and Execute Kushal Koshti Nidhir Bhavsar 45 1 0 13 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling Cheng Yang Rui Xu Ye Guo Peixiang Huang Yiru Chen Wenkui Ding Zhongyuan Wang Hong Zhou LRM 8 5 0 09 Nov 2023
Zero-shot Translation of Attention Patterns in VQA Models to Natural Language Leonard Salewski A. Sophia Koepke Hendrik P. A. Lensch Zeynep Akata 27 2 0 08 Nov 2023
LRM: Large Reconstruction Model for Single Image to 3D Yicong Hong Kai Zhang Jiuxiang Gu Sai Bi Yang Zhou Difan Liu Feng Liu Kalyan Sunkavalli Trung Bui Hao Tan 3DV 3DH 40 411 0 08 Nov 2023
Multitask Multimodal Prompted Training for Interactive Embodied Task Completion Georgios Pantazopoulos Malvina Nikandrou Amit Parekh Bhathiya Hemanthage Arash Eshghi Ioannis Konstas Verena Rieser Oliver Lemon Alessandro Suglia LM&Ro 24 7 0 07 Nov 2023
Scene-Driven Multimodal Knowledge Graph Construction for Embodied AI Yaoxian Song Penglei Sun Haoyu Liu Li Zhixu Wei Song Yanghua Xiao Xiaofang Zhou LM&Ro 51 12 0 07 Nov 2023
CLIP-Motion: Learning Reward Functions for Robotic Actions Using Consecutive Observations Xuzhe Dang Stefan Edelkamp 35 4 0 06 Nov 2023
MetaReVision: Meta-Learning with Retrieval for Visually Grounded Compositional Concept Acquisition Guangyue Xu Parisa Kordjamshidi Joyce Chai 11 2 0 02 Nov 2023
Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection Sungjune Park Hyunjun Kim Y. Ro 37 11 0 02 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities Md Farhan Ishmam Md Sakib Hossain Shovon M. F. Mridha Nilanjan Dey 35 36 0 01 Nov 2023
Neuroformer: Multimodal and Multitask Generative Pretraining for Brain Data Antonis Antoniades Yiyi Yu Joseph Canzano William Wang Spencer L. Smith AI4CE 40 11 0 31 Oct 2023
Harvest Video Foundation Models via Efficient Post-Pretraining Yizhuo Li Kunchang Li Yinan He Yi Wang Yali Wang Limin Wang Yu Qiao Ping Luo CLIP VLM VGen 33 2 0 30 Oct 2023
Generating Context-Aware Natural Answers for Questions in 3D Scenes Mohammed Munzer Dwedari Matthias Niessner Dave Zhenyu Chen 22 1 0 30 Oct 2023
This Looks Like Those: Illuminating Prototypical Concepts Using Multiple Visualizations Chiyu Ma Brandon Zhao Chaofan Chen Cynthia Rudin 18 26 0 28 Oct 2023
3D-Aware Visual Question Answering about Parts, Poses and Occlusions Xingrui Wang Wufei Ma Zhuowan Li Adam Kortylewski Alan L. Yuille CoGe 19 12 0 27 Oct 2023
ArchBERT: Bi-Modal Understanding of Neural Architectures and Natural Languages Mohammad Akbari Saeed Ranjbar Alvar Behnam Kamranian Amin Banitalebi-Dehkordi Yong Zhang AI4CE 20 0 0 26 Oct 2023
Evaluating Bias and Fairness in Gender-Neutral Pretrained Vision-and-Language Models Laura Cabello Emanuele Bugliarello Stephanie Brandl Desmond Elliott 23 7 0 26 Oct 2023
Apollo: Zero-shot MultiModal Reasoning with Multiple Experts Daniela Ben-David Tzuf Paz-Argaman Reut Tsarfaty MoE 21 0 0 25 Oct 2023
CAD -- Contextual Multi-modal Alignment for Dynamic AVQA Asmar Nadeem Adrian Hilton R. Dawes Graham A. Thomas A. Mustafa 16 9 0 25 Oct 2023
$$\mathbb{VD}$-$\mathbb{GR}$: Boosting $\mathbb{V}$isual $\mathbb{D}$ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$aphs$ $\mathbb{VD}$ - $\mathbb{GR}$ : Boosting $\mathbb{V}$ isual $\mathbb{D}$ ialog with Cascaded Spatial-Temporal Multi-Modal $\mathbb{GR}$ aphs Adnen Abdessaied Lei Shi Andreas Bulling 3DH 19 3 0 25 Oct 2023
Emergent Communication in Interactive Sketch Question Answering Zixing Lei Yiming Zhang Yuxin Xiong Siheng Chen 32 2 0 24 Oct 2023
Multimodal Representations for Teacher-Guided Compositional Visual Reasoning Wafa Aissa Marin Ferecatu M. Crucianu LRM 13 0 0 24 Oct 2023
Visually Grounded Continual Language Learning with Selective Specialization Kyra Ahrens Lennart Bengtson Jae Hee Lee Stefan Wermter 16 0 0 24 Oct 2023
LXMERT Model Compression for Visual Question Answering Maryam Hashemi Ghazaleh Mahmoudi Sara Kodeiri Hadi Sheikhi Sauleh Eetemadi VLM 11 4 0 23 Oct 2023
Large Language Models are Visual Reasoning Coordinators Liangyu Chen Bo Li Sheng Shen Jingkang Yang Chunyuan Li Kurt Keutzer Trevor Darrell Ziwei Liu VLM LRM 34 46 0 23 Oct 2023
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models Xinyi Chen Raquel Fernández Sandro Pezzelle VLM 13 9 0 23 Oct 2023
ITEm: Unsupervised Image-Text Embedding Learning for eCommerce Baohao Liao Michael Kozielski Sanjika Hewavitharana Jiangbo Yuan Shahram Khadivi Tomer Lancewicki SSL 13 0 0 22 Oct 2023
Semantic and Expressive Variation in Image Captions Across Languages Andre Ye Sebastin Santy Jena D. Hwang Amy X. Zhang Ranjay Krishna VLM 46 3 0 22 Oct 2023
Large Language Models and Multimodal Retrieval for Visual Word Sense Disambiguation Anastasia Kritharoula Maria Lymperaiou Giorgos Stamou 17 4 0 21 Oct 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation Siyu Zhang Ye-Ting Chen Fang Wang Yaoru Sun Jun Yang Lizhi Bai SSL 17 0 0 20 Oct 2023
RSAdapter: Adapting Multimodal Models for Remote Sensing Visual Question Answering Yuduo Wang Pedram Ghamisi 11 4 0 19 Oct 2023
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models Yanyang Guo Fangkai Jiao Zhiqi Shen Liqiang Nie Mohan S. Kankanhalli MLLM 14 5 0 17 Oct 2023
PELA: Learning Parameter-Efficient Models with Low-Rank Approximation Yangyang Guo Guangzhi Wang Mohan S. Kankanhalli 19 2 0 16 Oct 2023
AutoDIR: Automatic All-in-One Image Restoration with Latent Diffusion Yitong Jiang Zhaoyang Zhang Tianfan Xue Jinwei Gu DiffM 32 43 0 16 Oct 2023
VLIS: Unimodal Language Models Guide Multimodal Language Generation Jiwan Chung Youngjae Yu VLM 22 1 0 15 Oct 2023
Overview of ImageArg-2023: The First Shared Task in Multimodal Argument Mining Zhexiong Liu Mohamed Elarby Yang Zhong Diane Litman 11 11 0 15 Oct 2023
Penetrative AI: Making LLMs Comprehend the Physical World Huatao Xu Liying Han Qirui Yang Mo Li Mani Srivastava 10 52 0 14 Oct 2023
JM3D & JM3D-LLM: Elevating 3D Understanding with Joint Multi-modal Cues Jiayi Ji Haowei Wang Changli Wu Yiwei Ma Xiaoshuai Sun Rongrong Ji 32 1 0 14 Oct 2023
Question Answering for Electronic Health Records: A Scoping Review of datasets and models Jayetri Bardhan Kirk Roberts Daisy Zhe Wang 21 0 0 12 Oct 2023
DeltaSpace: A Semantic-aligned Feature Space for Flexible Text-guided Image Editing Yueming Lyu Kang Zhao Bo Peng Yue Jiang Yingya Zhang Jing Dong 17 2 0 12 Oct 2023
Open-Set Knowledge-Based Visual Question Answering with Inference Paths Jingru Gan Xinzhe Han Shuhui Wang Qingming Huang 22 0 0 12 Oct 2023
Jaeger: A Concatenation-Based Multi-Transformer VQA Model Jieting Long Zewei Shi Penghao Jiang Yidong Gan 20 0 0 11 Oct 2023
MemSum-DQA: Adapting An Efficient Long Document Extractive Summarizer for Document Question Answering Nianlong Gu Yingqiang Gao Richard H. R. Hahnloser RALM 30 0 0 10 Oct 2023
Video-Teller: Enhancing Cross-Modal Generation with Fusion and Decoupling Haogeng Liu Qihang Fan Tingkai Liu Linjie Yang Yunzhe Tao Huaibo Huang Ran He Hongxia Yang VGen 21 12 0 08 Oct 2023
Understanding the Robustness of Multi-modal Contrastive Learning to Distribution Shift Yihao Xue Siddharth Joshi Dang Nguyen Baharan Mirzasoleiman VLM 24 4 0 08 Oct 2023
HowToCaption: Prompting LLMs to Transform Video Annotations at Scale Nina Shvetsova Anna Kukleva Xudong Hong Christian Rupprecht Bernt Schiele Hilde Kuehne 35 25 0 07 Oct 2023
VLATTACK: Multimodal Adversarial Attacks on Vision-Language Tasks via Pre-trained Models Ziyi Yin Muchao Ye Tianrong Zhang Tianyu Du Jinguo Zhu Han Liu Jinghui Chen Ting Wang Fenglong Ma AAML VLM CoGe 28 36 0 07 Oct 2023