Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining

23 May 2023

Emanuele Bugliarello

Aida Nematzadeh

Lisa Anne Hendricks

Papers citing "Weakly-Supervised Learning of Visual Relations in Multimodal Pretraining"

15 / 15 papers shown

Title
Beyond the Doors of Perception: Vision Transformers Represent Relations Between Objects Michael A. Lepori Alexa R. Tartaglini Wai Keen Vong Thomas Serre Brenden Lake Ellie Pavlick 34 2 0 22 Jun 2024
PartIR: Composing SPMD Partitioning Strategies for Machine Learning Sami Alabed Daniel Belov Bart Chrzaszcz Juliana Franco Dominik Grewe ... Michael Schaarschmidt Timur Sitdikov Agnieszka Swietlik Dimitrios Vytiniotis Joel Wee 21 3 0 20 Jan 2024
Towards Grounded Visual Spatial Reasoning in Multi-Modal Vision Language Models Navid Rajabi Jana Kosecka VLM 14 2 0 18 Aug 2023
Kosmos-2: Grounding Multimodal Large Language Models to the World Zhiliang Peng Wenhui Wang Li Dong Y. Hao Shaohan Huang Shuming Ma Furu Wei MLLM ObjD VLM 14 688 0 26 Jun 2023
An Examination of the Robustness of Reference-Free Image Captioning Evaluation Metrics Saba Ahmadi Aishwarya Agrawal 17 6 0 24 May 2023
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention Shijie Geng Jianbo Yuan Yu Tian Yuxiao Chen Yongfeng Zhang CLIP VLM 41 44 0 06 Mar 2023
Do Vision-and-Language Transformers Learn Grounded Predicate-Noun Dependencies? Mitja Nikolaus Emmanuelle Salin Stéphane Ayache Abdellah Fourtassi Benoit Favre 11 13 0 21 Oct 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation Junnan Li Dongxu Li Caiming Xiong S. Hoi MLLM BDL VLM CLIP 385 4,010 0 28 Jan 2022
Masked Autoencoders Are Scalable Vision Learners Kaiming He Xinlei Chen Saining Xie Yanghao Li Piotr Dollár Ross B. Girshick ViT TPM 258 7,337 0 11 Nov 2021
Visually Grounded Reasoning across Languages and Cultures Fangyu Liu Emanuele Bugliarello E. Ponti Siva Reddy Nigel Collier Desmond Elliott VLM LRM 92 167 0 28 Sep 2021
Graphhopper: Multi-Hop Scene Graph Reasoning for Visual Question Answering Rajat Koner Hang Li Marcel Hildebrandt Deepan Das Volker Tresp Stephan Günnemann 36 31 0 13 Jul 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts Soravit Changpinyo P. Sharma Nan Ding Radu Soricut VLM 273 1,077 0 17 Feb 2021
Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers Lisa Anne Hendricks John F. J. Mellor R. Schneider Jean-Baptiste Alayrac Aida Nematzadeh 75 110 0 31 Jan 2021
Explainable and Explicit Visual Reasoning over Scene Graphs Jiaxin Shi Hanwang Zhang Juan-Zi Li OCL 146 230 0 05 Dec 2018
Image Generation from Scene Graphs Justin Johnson Agrim Gupta Li Fei-Fei GNN 211 809 0 04 Apr 2018