Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
1908.06066
Cited By
v1
v2
v3 (latest)
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
SSL
VLM
MLLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"
50 / 518 papers shown
Title
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
310
85
0
29 Feb 2024
Automatic Creative Selection with Cross-Modal Matching
Alex Kim
Jia Huang
Rob Monarch
Jerry Kwac
Anikesh Kamath
P. Khurd
Kailash Thiyagarajan
Goodman Gu
VLM
151
0
0
28 Feb 2024
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
283
3
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
315
9
0
27 Feb 2024
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long
Xuri Ge
R. McCreadie
Joemon M. Jose
282
12
0
23 Feb 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
291
7
0
21 Jan 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
Neural Information Processing Systems (NeurIPS), 2024
Antonín Vobecký
Oriane Siméoni
David Hurych
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
247
50
0
17 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
European Conference on Information Retrieval (ECIR), 2024
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
242
10
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
183
2
0
09 Jan 2024
FM-AE: Frequency-masked Multimodal Autoencoder for Zinc Electrolysis Plate Contact Abnormality Detection
Can Zhou
Can Zhou
Hongqiu Zhu
Tianhao Liu
64
9
0
08 Jan 2024
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
Anton Van Den Hengel
VLM
173
1
0
22 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
382
50
0
19 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
190
29
0
13 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
208
47
0
07 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023
Cong Yang
Zuchao Li
Lefei Zhang
155
58
0
02 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts
Computer Vision and Pattern Recognition (CVPR), 2023
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLM
LRM
MLLM
313
149
0
01 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video Retrieval
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
271
1
1
30 Nov 2023
LEAP: LLM-Generation of Egocentric Action Programs
Eadom Dessalene
Michael Maynord
Cornelia Fermuller
Yiannis Aloimonos
273
2
0
29 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
159
4
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
380
1
0
28 Nov 2023
LANS: A Layout-Aware Neural Solver for Plane Geometry Problem
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Zhong-Zhi Li
Ming-Liang Zhang
Fei Yin
Cheng-Lin Liu
232
21
0
25 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yangyi Chen
Xingyao Wang
Pengfei Yu
Derek Hoiem
Heng Ji
232
14
0
22 Nov 2023
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
Yaning Tan
Mingli Zhu
Aishan Liu
Baoyuan Wu
Xiaochun Cao
Ee-Chien Chang
497
92
0
20 Nov 2023
Open-Vocabulary Camouflaged Object Segmentation
Youwei Pang
Xiaoqi Zhao
Jiaming Zuo
Lihe Zhang
Huchuan Lu
VLM
ObjD
313
12
0
19 Nov 2023
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
239
18
0
18 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
411
99
0
16 Nov 2023
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings
Fatema Hasan
Yulong Li
James R. Foulds
Shimei Pan
Bishwaranjan Bhattacharjee
291
2
0
13 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Cheng Yang
Rui Xu
Ye Guo
Peixiang Huang
Yiru Chen
Wenkui Ding
Zhongyuan Wang
Hong Zhou
LRM
164
8
0
09 Nov 2023
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit Retrieval
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Junkyu Jang
Eugene Hwang
Sung-Hyuk Park
146
2
0
03 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Information Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
385
70
0
01 Nov 2023
M2C: Towards Automatic Multimodal Manga Complement
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Hongcheng Guo
Boyang Wang
Jiaqi Bai
Jiaheng Liu
Jian Yang
Zhoujun Li
198
13
0
26 Oct 2023
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Xinyi Chen
Raquel Fernández
Sandro Pezzelle
VLM
190
12
0
23 Oct 2023
Jaeger: A Concatenation-Based Multi-Transformer VQA Model
Jieting Long
Zewei Shi
Penghao Jiang
Yidong Gan
159
0
0
11 Oct 2023
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information Extraction
ACM Multimedia Asia (MA), 2023
Yusheng Huang
Zhouhan Lin
157
7
0
10 Oct 2023
GRID: A Platform for General Robot Intelligence Development
Sai H. Vemprala
Shuhang Chen
Abhinav Shukla
Dinesh Narayanan
Ashish Kapoor
251
11
0
02 Oct 2023
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZ
International Conference on Learning Representations (ICLR), 2023
Jonas Belouadi
Anne Lauscher
Steffen Eger
262
48
0
30 Sep 2023
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search
Yuanmin Tang
Daling Wang
Keke Gai
Wenfang Wu
Yifei Zhang
Gang Xiong
Qi Wu
213
4
0
28 Sep 2023
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image Retrieval
AAAI Conference on Artificial Intelligence (AAAI), 2023
Yuanmin Tang
Jiahao Yu
Keke Gai
Jiamin Zhuang
Gang Xiong
Yue Hu
Qi Wu
191
52
0
28 Sep 2023
Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
ACM Multimedia (ACM MM), 2023
Zhihao Zhang
Yiwei Chen
Weizhan Zhang
Caixia Yan
Qinghua Zheng
Qi Wang
Wang Chen
140
9
0
26 Sep 2023
VidChapters-7M: Video Chapters at Scale
Neural Information Processing Systems (NeurIPS), 2023
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
234
38
0
25 Sep 2023
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
304
21
0
23 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval
IEEE International Conference on Computer Vision (ICCV), 2023
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
202
6
0
16 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary Tasks
Findings (Findings), 2023
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
205
4
0
14 Sep 2023
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
215
16
0
12 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language Models
North American Chapter of the Association for Computational Linguistics (NAACL), 2023
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
309
42
0
08 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identification
IEEE International Conference on Computer Vision (ICCV), 2023
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
231
34
0
04 Sep 2023
A Fine-Grained Image Description Generation Method Based on Joint Objectives
Chinese Conference on Computer Supported Cooperative Work and Social Computing (SCWSC), 2023
Yifan Zhang
Chunzhen Lin
Donglin Cao
Dazhen Lin
EGVM
111
0
0
02 Sep 2023
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
Wenyi Wu
Karim Bouyarmane
Ismail B. Tutar
54
2
0
30 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object Detection
IEEE Transactions on Image Processing (IEEE TIP), 2023
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
193
9
0
30 Aug 2023
Multi-event Video-Text Retrieval
IEEE International Conference on Computer Vision (ICCV), 2023
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
179
18
0
22 Aug 2023
Previous
1
2
3
4
5
...
9
10
11
Next