ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.06066
  4. Cited By
Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal
  Pre-training
v1v2v3 (latest)

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training

AAAI Conference on Artificial Intelligence (AAAI), 2019
16 August 2019
Gen Li
Nan Duan
Yuejian Fang
Ming Gong
Daxin Jiang
Ming Zhou
    SSLVLMMLLM
ArXiv (abs)PDFHTML

Papers citing "Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training"

50 / 518 papers shown
The All-Seeing Project V2: Towards General Relation Comprehension of the
  Open World
The All-Seeing Project V2: Towards General Relation Comprehension of the Open World
Weiyun Wang
Yiming Ren
Hao Luo
Tiantong Li
Chenxiang Yan
...
Qingyun Li
Lewei Lu
Xizhou Zhu
Yu Qiao
Jifeng Dai
MLLM
318
85
0
29 Feb 2024
Automatic Creative Selection with Cross-Modal Matching
Automatic Creative Selection with Cross-Modal Matching
Alex Kim
Jia Huang
Rob Monarch
Jerry Kwac
Anikesh Kamath
P. Khurd
Kailash Thiyagarajan
Goodman Gu
VLM
167
0
0
28 Feb 2024
Acquiring Linguistic Knowledge from Multimodal Input
Acquiring Linguistic Knowledge from Multimodal Input
Theodor Amariucai
Alexander Scott Warstadt
CLL
284
4
0
27 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation
  Learning
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
322
9
0
27 Feb 2024
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
CFIR: Fast and Effective Long-Text To Image Retrieval for Large Corpora
Zijun Long
Xuri Ge
R. McCreadie
Joemon M. Jose
331
12
0
23 Feb 2024
Exploring Missing Modality in Multimodal Egocentric Datasets
Exploring Missing Modality in Multimodal Egocentric Datasets
Merey Ramazanova
Alejandro Pardo
Humam Alwassel
Guohao Li
EgoV
302
7
0
21 Jan 2024
POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images
POP-3D: Open-Vocabulary 3D Occupancy Prediction from ImagesNeural Information Processing Systems (NeurIPS), 2024
Antonín Vobecký
Oriane Siméoni
David Hurych
Spyros Gidaris
Andrei Bursuc
Patrick Pérez
Josef Sivic
258
51
0
17 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention
  Network for Crisis Event Classification
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event ClassificationEuropean Conference on Information Retrieval (ECIR), 2024
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
257
10
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
191
2
0
09 Jan 2024
FM-AE: Frequency-masked Multimodal Autoencoder for Zinc Electrolysis
  Plate Contact Abnormality Detection
FM-AE: Frequency-masked Multimodal Autoencoder for Zinc Electrolysis Plate Contact Abnormality Detection
Can Zhou
Can Zhou
Hongqiu Zhu
Tianhao Liu
68
9
0
08 Jan 2024
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies
  and Variances
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
Anton Van Den Hengel
VLM
185
1
0
22 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose
  Coarse-to-Fine Vision-Language Model
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLMMLLM
386
50
0
19 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human
  Pathology
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedImLM&MA
210
30
0
13 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
224
48
0
07 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
  Captioning
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image CaptioningIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2023
Cong Yang
Zuchao Li
Lefei Zhang
163
61
0
02 Dec 2023
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual
  Prompts
ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual PromptsComputer Vision and Pattern Recognition (CVPR), 2023
Mu Cai
Haotian Liu
Dennis Park
Siva Karthik Mustikovela
Gregory P. Meyer
Yuning Chai
Yong Jae Lee
VLMLRMMLLM
325
151
0
01 Dec 2023
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse
  Captions for Better Long Video Retrieval
A Video is Worth 10,000 Words: Training and Benchmarking with Diverse Captions for Better Long Video RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
M. Gwilliam
Michael Cogswell
Meng Ye
Karan Sikka
Abhinav Shrivastava
Ajay Divakaran
3DV
288
1
1
30 Nov 2023
LEAP: LLM-Generation of Egocentric Action Programs
LEAP: LLM-Generation of Egocentric Action Programs
Eadom Dessalene
Michael Maynord
Cornelia Fermuller
Yiannis Aloimonos
280
2
0
29 Nov 2023
Contrastive Vision-Language Alignment Makes Efficient Instruction
  Learner
Contrastive Vision-Language Alignment Makes Efficient Instruction Learner
Lizhao Liu
Xinyu Sun
Tianhang Xiang
Zhuangwei Zhuang
Liuren Yin
Mingkui Tan
VLM
175
4
0
29 Nov 2023
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with
  Semantic Vector-Quantized Tokenizer
E-ViLM: Efficient Video-Language Model via Masked Video Modeling with Semantic Vector-Quantized Tokenizer
Jacob Zhiyuan Fang
Skyler Zheng
Vasu Sharma
Robinson Piramuthu
VLM
392
1
0
28 Nov 2023
LANS: A Layout-Aware Neural Solver for Plane Geometry Problem
LANS: A Layout-Aware Neural Solver for Plane Geometry ProblemAnnual Meeting of the Association for Computational Linguistics (ACL), 2023
Zhong-Zhi Li
Ming-Liang Zhang
Fei Yin
Cheng-Lin Liu
247
21
0
25 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
  Code-Vision Representation
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision RepresentationConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Yangyi Chen
Xingyao Wang
Pengfei Yu
Derek Hoiem
Heng Ji
251
14
0
22 Nov 2023
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive
  Learning
BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning
Yaning Tan
Mingli Zhu
Aishan Liu
Baoyuan Wu
Xiaochun Cao
Ee-Chien Chang
498
92
0
20 Nov 2023
Open-Vocabulary Camouflaged Object Segmentation
Open-Vocabulary Camouflaged Object Segmentation
Youwei Pang
Xiaoqi Zhao
Jiaming Zuo
Lihe Zhang
Huchuan Lu
VLMObjD
330
13
0
19 Nov 2023
Active Prompt Learning in Vision Language Models
Active Prompt Learning in Vision Language Models
Jihwan Bang
Sumyeong Ahn
Jae-Gil Lee
VLM
259
20
0
18 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
439
99
0
16 Nov 2023
Teach me with a Whisper: Enhancing Large Language Models for Analyzing
  Spoken Transcripts using Speech Embeddings
Teach me with a Whisper: Enhancing Large Language Models for Analyzing Spoken Transcripts using Speech Embeddings
Fatema Hasan
Yulong Li
James R. Foulds
Shimei Pan
Bishwaranjan Bhattacharjee
300
2
0
13 Nov 2023
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Improving Vision-and-Language Reasoning via Spatial Relations Modeling
Cheng Yang
Rui Xu
Ye Guo
Peixiang Huang
Yiru Chen
Wenkui Ding
Zhongyuan Wang
Hong Zhou
LRM
168
8
0
09 Nov 2023
Lost Your Style? Navigating with Semantic-Level Approach for
  Text-to-Outfit Retrieval
Lost Your Style? Navigating with Semantic-Level Approach for Text-to-Outfit RetrievalIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2023
Junkyu Jang
Eugene Hwang
Sung-Hyuk Park
154
2
0
03 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering
  (VQA) Approaches, Challenges, and Opportunities
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and OpportunitiesInformation Fusion (Inf. Fusion), 2023
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
399
71
0
01 Nov 2023
M2C: Towards Automatic Multimodal Manga Complement
M2C: Towards Automatic Multimodal Manga ComplementConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Hongcheng Guo
Boyang Wang
Jiaqi Bai
Jiaheng Liu
Jian Yang
Zhoujun Li
214
13
0
26 Oct 2023
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained
  Multimodal Models
The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal ModelsConference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Xinyi Chen
Raquel Fernández
Sandro Pezzelle
VLM
208
12
0
23 Oct 2023
Jaeger: A Concatenation-Based Multi-Transformer VQA Model
Jaeger: A Concatenation-Based Multi-Transformer VQA Model
Jieting Long
Zewei Shi
Penghao Jiang
Yidong Gan
174
0
0
11 Oct 2023
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal
  Information Extraction
I2SRM: Intra- and Inter-Sample Relationship Modeling for Multimodal Information ExtractionACM Multimedia Asia (MA), 2023
Yusheng Huang
Zhouhan Lin
161
7
0
10 Oct 2023
GRID: A Platform for General Robot Intelligence Development
GRID: A Platform for General Robot Intelligence Development
Sai H. Vemprala
Shuhang Chen
Abhinav Shukla
Dinesh Narayanan
Ashish Kapoor
271
11
0
02 Oct 2023
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with
  TikZ
AutomaTikZ: Text-Guided Synthesis of Scientific Vector Graphics with TikZInternational Conference on Learning Representations (ICLR), 2023
Jonas Belouadi
Anne Lauscher
Steffen Eger
270
48
0
30 Sep 2023
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal
  Sponsored Search
Align before Search: Aligning Ads Image to Text for Accurate Cross-Modal Sponsored Search
Yuanmin Tang
Daling Wang
Keke Gai
Wenfang Wu
Yifei Zhang
Gang Xiong
Qi Wu
230
4
0
28 Sep 2023
Context-I2W: Mapping Images to Context-dependent Words for Accurate
  Zero-Shot Composed Image Retrieval
Context-I2W: Mapping Images to Context-dependent Words for Accurate Zero-Shot Composed Image RetrievalAAAI Conference on Artificial Intelligence (AAAI), 2023
Yuanmin Tang
Jiahao Yu
Keke Gai
Jiamin Zhuang
Gang Xiong
Yue Hu
Qi Wu
203
53
0
28 Sep 2023
Tile Classification Based Viewport Prediction with Multi-modal Fusion Transformer
Tile Classification Based Viewport Prediction with Multi-modal Fusion TransformerACM Multimedia (ACM MM), 2023
Zhihao Zhang
Yiwei Chen
Weizhan Zhang
Caixia Yan
Qinghua Zheng
Qi Wang
Wang Chen
180
10
0
26 Sep 2023
VidChapters-7M: Video Chapters at Scale
VidChapters-7M: Video Chapters at ScaleNeural Information Processing Systems (NeurIPS), 2023
Antoine Yang
Arsha Nagrani
Ivan Laptev
Josef Sivic
Cordelia Schmid
VGen
248
39
0
25 Sep 2023
A Survey on Image-text Multimodal Models
A Survey on Image-text Multimodal Models
Ruifeng Guo
Jingxuan Wei
Linzhuang Sun
Khai-Nguyen Nguyen
Guiyong Chang
Dawei Liu
Sibo Zhang
Zhengbing Yao
Mingjun Xu
Liping Bu
VLM
320
22
0
23 Sep 2023
In-Style: Bridging Text and Uncurated Videos with Style Transfer for
  Text-Video Retrieval
In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Nina Shvetsova
Anna Kukleva
Bernt Schiele
Hilde Kuehne
DiffM
229
6
0
16 Sep 2023
Improving Multimodal Classification of Social Media Posts by Leveraging
  Image-Text Auxiliary Tasks
Improving Multimodal Classification of Social Media Posts by Leveraging Image-Text Auxiliary TasksFindings (Findings), 2023
Danae Sánchez Villegas
Daniel Preoctiuc-Pietro
Nikolaos Aletras
221
4
0
14 Sep 2023
Beyond Generation: Harnessing Text to Image Models for Object Detection
  and Segmentation
Beyond Generation: Harnessing Text to Image Models for Object Detection and Segmentation
Yunhao Ge
Lyne Tchapmi
Brian Nlong Zhao
Neel Joshi
Laurent Itti
Vibhav Vineet
DiffM
215
16
0
12 Sep 2023
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language
  Models
Measuring and Improving Chain-of-Thought Reasoning in Vision-Language ModelsNorth American Chapter of the Association for Computational Linguistics (NAACL), 2023
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
LRM
320
42
0
08 Sep 2023
Unified Pre-training with Pseudo Texts for Text-To-Image Person
  Re-identification
Unified Pre-training with Pseudo Texts for Text-To-Image Person Re-identificationIEEE International Conference on Computer Vision (ICCV), 2023
Zhiyin Shao
Xinyu Zhang
Changxing Ding
Jian Wang
Jingdong Wang
239
38
0
04 Sep 2023
A Fine-Grained Image Description Generation Method Based on Joint
  Objectives
A Fine-Grained Image Description Generation Method Based on Joint ObjectivesChinese Conference on Computer Supported Cooperative Work and Social Computing (SCWSC), 2023
Yifan Zhang
Chunzhen Lin
Donglin Cao
Dazhen Lin
EGVM
123
0
0
02 Sep 2023
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes
  in Product Images for e-commerce Vision-Language Applications
Catalog Phrase Grounding (CPG): Grounding of Product Textual Attributes in Product Images for e-commerce Vision-Language Applications
Wenyi Wu
Karim Bouyarmane
Ismail B. Tutar
54
2
0
30 Aug 2023
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object
  Detection
Exploring Multi-Modal Contextual Knowledge for Open-Vocabulary Object DetectionIEEE Transactions on Image Processing (IEEE TIP), 2023
Yifan Xu
Mengdan Zhang
Xiaoshan Yang
Changsheng Xu
ObjD
215
9
0
30 Aug 2023
Multi-event Video-Text Retrieval
Multi-event Video-Text RetrievalIEEE International Conference on Computer Vision (ICCV), 2023
Gengyuan Zhang
Jisen Ren
Jindong Gu
Volker Tresp
193
18
0
22 Aug 2023
Previous
12345...91011
Next