Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2104.03135
Cited By
Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning
7 April 2021
Zhicheng Huang
Zhaoyang Zeng
Yupan Huang
Bei Liu
Dongmei Fu
Jianlong Fu
VLM
ViT
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning"
50 / 175 papers shown
Title
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training
Chong Liu
Yuqi Zhang
Hongsong Wang
Weihua Chen
F. Wang
Yan Huang
Yixing Shen
Liang Wang
16
25
0
15 Jun 2023
Global and Local Semantic Completion Learning for Vision-Language Pre-training
Rong-Cheng Tu
Yatai Ji
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
10
2
0
12 Jun 2023
Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark
Li Xu
Bo Liu
Ameer Hamza Khan
Lu Fan
Xiao-Ming Wu
LM&MA
15
9
0
10 Jun 2023
Multi-CLIP: Contrastive Vision-Language Pre-training for Question Answering tasks in 3D Scenes
Alexandros Delitzas
Maria Parelli
Nikolas Hars
G. Vlassis
Sotiris Anagnostidis
Gregor Bachmann
Thomas Hofmann
CLIP
12
19
0
04 Jun 2023
Recent Advances of Local Mechanisms in Computer Vision: A Survey and Outlook of Recent Work
Qiangchang Wang
Yilong Yin
21
0
0
02 Jun 2023
Masked Autoencoder for Unsupervised Video Summarization
Minho Shim
Taeoh Kim
Jinhyung Kim
Dongyoon Wee
15
1
0
02 Jun 2023
Adapting Pre-trained Language Models to Vision-Language Tasks via Dynamic Visual Prompting
Shubin Huang
Qiong Wu
Yiyi Zhou
Weijie Chen
Rongsheng Zhang
Xiaoshuai Sun
Rongrong Ji
VLM
VPVLM
LRM
16
0
0
01 Jun 2023
UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning
Ahmed Masry
P. Kavehzadeh
Do Xuan Long
Enamul Hoque
Shafiq R. Joty
LRM
19
100
0
24 May 2023
Parameter-efficient Tuning of Large-scale Multimodal Foundation Model
Haixin Wang
Xinlong Yang
Jianlong Chang
Di Jin
Jinan Sun
Shikun Zhang
Xiao Luo
Qi Tian
22
22
0
15 May 2023
ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding
Le Xue
Ning Yu
Shu Zhen Zhang
Artemis Panagopoulou
Junnan Li
...
Jiajun Wu
Caiming Xiong
Ran Xu
Juan Carlos Niebles
Silvio Savarese
19
115
0
14 May 2023
Structure-CLIP: Towards Scene Graph Knowledge to Enhance Multi-modal Structured Representations
Yufen Huang
Jiji Tang
Zhuo Chen
Rongsheng Zhang
Xinfeng Zhang
...
Zeng Zhao
Zhou Zhao
Tangjie Lv
Zhipeng Hu
Wen Zhang
VLM
12
21
0
06 May 2023
Few-shot Domain-Adaptive Visually-fused Event Detection from Text
Farhad Moghimifar
Fatemeh Shiri
Van Nguyen
Gholamreza Haffari
Yuanyou Li
VLM
30
2
0
04 May 2023
Rethinking Benchmarks for Cross-modal Image-text Retrieval
Wei-Neng Chen
Linli Yao
Qin Jin
VLM
8
17
0
21 Apr 2023
MoMo: A shared encoder Model for text, image and multi-Modal representations
Rakesh Chada
Zhao-Heng Zheng
P. Natarajan
ViT
13
4
0
11 Apr 2023
FashionSAP: Symbols and Attributes Prompt for Fine-grained Fashion Vision-Language Pre-training
Yunpeng Han
Lisai Zhang
Qingcai Chen
Zhijian Chen
Zhonghua Li
Jianxin Yang
Zhao Cao
AI4TS
VLM
21
11
0
11 Apr 2023
Probabilistic Prompt Learning for Dense Prediction
Hyeongjun Kwon
Taeyong Song
Somi Jeong
Jin-Hwa Kim
Jinhyun Jang
K. Sohn
VLM
14
18
0
03 Apr 2023
Multi-Modal Representation Learning with Text-Driven Soft Masks
Jaeyoo Park
Bohyung Han
SSL
17
4
0
03 Apr 2023
Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens
Yuxiao Chen
Jianbo Yuan
Yu Tian
Shijie Geng
Xinyu Li
Ding Zhou
Dimitris N. Metaxas
Hongxia Yang
14
33
0
27 Mar 2023
HiCLIP: Contrastive Language-Image Pretraining with Hierarchy-aware Attention
Shijie Geng
Jianbo Yuan
Yu Tian
Yuxiao Chen
Yongfeng Zhang
CLIP
VLM
41
44
0
06 Mar 2023
FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks
Xiaoping Han
Xiatian Zhu
Licheng Yu
Li Zhang
Yi-Zhe Song
Tao Xiang
VLM
11
38
0
04 Mar 2023
The Contribution of Knowledge in Visiolinguistic Learning: A Survey on Tasks and Challenges
Maria Lymperaiou
Giorgos Stamou
VLM
24
4
0
04 Mar 2023
MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering
Jingjing Jiang
Nanning Zheng
MoE
32
6
0
02 Mar 2023
Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
Antoine Yang
Arsha Nagrani
Paul Hongsuck Seo
Antoine Miech
Jordi Pont-Tuset
Ivan Laptev
Josef Sivic
Cordelia Schmid
AI4TS
VLM
18
220
0
27 Feb 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Xiao Wang
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
24
199
0
20 Feb 2023
UniAdapter: Unified Parameter-Efficient Transfer Learning for Cross-modal Modeling
Haoyu Lu
Yuqi Huo
Guoxing Yang
Zhiwu Lu
Wei Zhan
M. Tomizuka
Mingyu Ding
25
31
0
13 Feb 2023
Learning by Asking for Embodied Visual Navigation and Task Completion
Ying Shen
Ismini Lourentzou
20
1
0
09 Feb 2023
SwinCross: Cross-modal Swin Transformer for Head-and-Neck Tumor Segmentation in PET/CT Images
Gary Y. Li
Junyu Chen
Se-In Jang
Kuang Gong
Quanzheng Li
ViT
MedIm
31
14
0
08 Feb 2023
Learning to Agree on Vision Attention for Visual Commonsense Reasoning
Zhenyang Li
Yangyang Guo
Ke-Jyun Wang
Fan Liu
Liqiang Nie
Mohan S. Kankanhalli
18
10
0
04 Feb 2023
Masked Autoencoding Does Not Help Natural Language Supervision at Scale
Floris Weers
Vaishaal Shankar
Angelos Katharopoulos
Yinfei Yang
Tom Gunter
CLIP
13
4
0
19 Jan 2023
SPRING: Situated Conversation Agent Pretrained with Multimodal Questions from Incremental Layout Graph
Yuxing Long
Binyuan Hui
Fulong Ye
Yanyang Li
Zhuoxin Han
Caixia Yuan
Yongbin Li
Xiaojie Wang
LLMAG
12
7
0
05 Jan 2023
GIVL: Improving Geographical Inclusivity of Vision-Language Models with Pre-Training Methods
Da Yin
Feng Gao
Govind Thattai
Michael F. Johnston
Kai-Wei Chang
VLM
27
15
0
05 Jan 2023
TeViS:Translating Text Synopses to Video Storyboards
Xu Gu
Yuchong Sun
Feiyue Ni
Shizhe Chen
Xihua Wang
Ruihua Song
B. Li
Xiang Cao
DiffM
23
4
0
31 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
19
37
0
19 Dec 2022
Efficient Image Captioning for Edge Devices
Ning Wang
Jiangrong Xie
Hangzai Luo
Qinglin Cheng
Jihao Wu
Mingbo Jia
Linlin Li
VLM
CLIP
23
20
0
18 Dec 2022
BEVBert: Multimodal Map Pre-training for Language-guided Navigation
Dongyan An
Yuankai Qi
Yangguang Li
Yan Huang
Liangsheng Wang
T. Tan
Jing Shao
28
55
0
08 Dec 2022
Controllable Image Captioning via Prompting
Ning Wang
Jiahao Xie
Jihao Wu
Mingbo Jia
Linlin Li
14
23
0
04 Dec 2022
Seeing What You Miss: Vision-Language Pre-training with Semantic Completion Learning
Yatai Ji
Rong-Cheng Tu
Jie Jiang
Weijie Kong
Chengfei Cai
Wenzhe Zhao
Hongfa Wang
Yujiu Yang
Wei Liu
VLM
32
13
0
24 Nov 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
28
13
0
19 Nov 2022
CapEnrich: Enriching Caption Semantics for Web Images via Cross-modal Pre-trained Knowledge
Linli Yao
Wei-Neng Chen
Qin Jin
VLM
22
10
0
17 Nov 2022
Multi-VQG: Generating Engaging Questions for Multiple Images
Min-Hsuan Yeh
Vicent Chen
Ting-Hao Haung
Lun-Wei Ku
CoGe
6
7
0
14 Nov 2022
Masked Vision-Language Transformer in Fashion
Ge-Peng Ji
Mingchen Zhuge
D. Gao
Deng-Ping Fan
Christos Sakaridis
Luc Van Gool
17
25
0
27 Oct 2022
Learning by Hallucinating: Vision-Language Pre-training with Weak Supervision
T. Wang
Jorma T. Laaksonen
T. Langer
Heikki Arponen
Tom E. Bishop
VLM
16
6
0
24 Oct 2022
Towards Unifying Reference Expression Generation and Comprehension
Duo Zheng
Tao Kong
Ya Jing
Jiaan Wang
Xiaojie Wang
ObjD
27
6
0
24 Oct 2022
EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning
Tiannan Wang
Wangchunshu Zhou
Yan Zeng
Xinsong Zhang
VLM
19
36
0
14 Oct 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
16
68
0
12 Oct 2022
MAP: Multimodal Uncertainty-Aware Vision-Language Pre-training Model
Yatai Ji
Junjie Wang
Yuan Gong
Lin Zhang
Yan Zhu
Hongfa Wang
Jiaxing Zhang
Tetsuya Sakai
Yujiu Yang
MLLM
19
29
0
11 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
24
21
0
09 Oct 2022
LAVIS: A Library for Language-Vision Intelligence
Dongxu Li
Junnan Li
Hung Le
Guangsen Wang
Silvio Savarese
S. Hoi
VLM
113
51
0
15 Sep 2022
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering
Jingjing Jiang
Zi-yi Liu
Nanning Zheng
21
8
0
14 Sep 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
25
68
0
14 Sep 2022
Previous
1
2
3
4
Next