v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019

16 August 2019

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown

The All-Seeing Project V2: Towards General Relation Comprehension of the Open World

...

Yu Qiao

318

29 Feb 2024

Automatic Creative Selection with Cross-Modal Matching

167

28 Feb 2024

Acquiring Linguistic Knowledge from Multimodal Input

Theodor Amariucai

Alexander Scott Warstadt

CLL

284

27 Feb 2024

Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning

Maurits J. R. Bleeker

Mariya Hendriksen

Andrew Yates

Maarten de Rijke

VLM

322

27 Feb 2024

CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora

331

23 Feb 2024

Exploring Missing Modality in Multimodal Egocentric Datasets

302

21 Jan 2024

POP-3D: Open-Vocabulary 3D Occupancy Prediction from ImagesNeural Information Processing Systems (NeurIPS), 2024

258

17 Jan 2024

CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event ClassificationEuropean Conference on Information Retrieval (ECIR), 2024

257

11 Jan 2024

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

...

Somayeh Sojoudi

191

09 Jan 2024

FM-AE: Frequency-masked Multimodal Autoencoder for Zinc Electrolysis Plate Contact Abnormality Detection

Can Zhou

Hongqiu Zhu

Tianhao Liu

08 Jan 2024

Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances

Cristian Rodriguez-Opazo

Edison Marrese-Taylor

185

22 Dec 2023

Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model

Ser-Nam Lim

386

19 Dec 2023

A Foundational Multimodal Vision Language AI Assistant for Human Pathology

Ming Y. Lu

Bowen Chen

Drew F. K. Williamson

...

210

13 Dec 2023

Open-Vocabulary Segmentation with Semantic-Assisted Calibration

Yong Liu

224

07 Dec 2023

Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image CaptioningIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023

Cong Yang

Zuchao Li

Lefei Zhang

163

02 Dec 2023

ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual PromptsComputer Vision and Pattern Recognition (CVPR), 2023

Mu Cai

Haotian Liu

Dennis Park

Siva Karthik Mustikovela

325

151

01 Dec 2023

A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

288

30 Nov 2023

LEAP: LLM-Generation of Egocentric Action Programs

280

29 Nov 2023

Contrastive Vision-Language Alignment Makes Efficient Instruction Learner

175

29 Nov 2023

E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer

392

28 Nov 2023

LANS: A Layout-Aware Neural Solver for Plane Geometry ProblemAnnual Meeting of the Association for Computational Linguistics (ACL), 2023

247

25 Nov 2023

ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision RepresentationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Heng Ji

251

22 Nov 2023

BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning

498

20 Nov 2023

Open-Vocabulary Camouflaged Object Segmentation

Huchuan Lu

330

19 Nov 2023

Active Prompt Learning in Vision Language Models

259

18 Nov 2023

DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback

Heng Ji

439

16 Nov 2023

Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings

Bishwaranjan Bhattacharjee

300

13 Nov 2023

Improving Vision-and-Language Reasoning via Spatial Relations Modeling

168

09 Nov 2023

Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023

Junkyu Jang

Eugene Hwang

Sung-Hyuk Park

154

03 Nov 2023

From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023

Md Farhan Ishmam

Md Sakib Hossain Shovon

M. F. Mridha

Nilanjan Dey

399

01 Nov 2023

M2C: Towards Automatic Multimodal Manga ComplementConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

Hongcheng Guo

Boyang Wang

Jiaqi Bai

Jiaheng Liu

Jian Yang

Zhoujun Li

214

26 Oct 2023

The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023

208

23 Oct 2023

Jaeger: A Concatenation-Based Multi-Transformer VQA Model

174

11 Oct 2023

I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information ExtractionACM Multimedia Asia (MA), 2023

Yusheng Huang

Zhouhan Lin

161

10 Oct 2023

GRID: A Platform for General Robot Intelligence Development

271

02 Oct 2023

AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZInternational Conference on Learning Representations (ICLR), 2023

Jonas Belouadi

Anne Lauscher

Steffen Eger

270

30 Sep 2023

Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search

Qi Wu

230

28 Sep 2023

Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2023

Qi Wu

203

28 Sep 2023

Tile Classification Based Viewport Prediction with Multi-modal Fusion TransformerACM Multimedia (ACM MM), 2023

180

26 Sep 2023

VidChapters-7M: Video Chapters at ScaleNeural Information Processing Systems (NeurIPS), 2023

248

25 Sep 2023

A Survey on Image-text Multimodal Models

Ruifeng Guo

Jingxuan Wei

Linzhuang Sun

Khai-Nguyen Nguyen

Guiyong Chang

Dawei Liu

Sibo Zhang

Zhengbing Yao

Mingjun Xu

Liping Bu

VLM

320

23 Sep 2023

In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023

Bernt Schiele

229

16 Sep 2023

Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary TasksFindings (Findings), 2023

Danae Sánchez Villegas

Daniel Preoctiuc-Pietro

Nikolaos Aletras

221

14 Sep 2023

Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation

215

12 Sep 2023

Measuring and Improving Chain-of-Thought Reasoning in Vision-Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023

Heng Ji

320

08 Sep 2023

Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identificationIEEE International Conference on Computer Vision (ICCV), 2023

Jingdong Wang

239

04 Sep 2023

A Fine-Grained Image Description Generation Method Based on Joint ObjectivesChinese Conference on Computer Supported Cooperative Work and Social Computing (SCWSC), 2023

123

02 Sep 2023

Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications

Wenyi Wu

Karim Bouyarmane

Ismail B. Tutar

30 Aug 2023

Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object DetectionIEEE Transactions on Image Processing (IEEE TIP), 2023

215

30 Aug 2023

Multi-event Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023

Gengyuan Zhang

Jisen Ren

Jindong Gu

Volker Tresp

193

22 Aug 2023