Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2106.13488
Cited By
Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training
25 June 2021
Hongwei Xue
Yupan Huang
Bei Liu
Houwen Peng
Jianlong Fu
Houqiang Li
Jiebo Luo
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training"
50 / 60 papers shown
Title
Recent Advances of Multimodal Continual Learning: A Comprehensive Survey
Dianzhi Yu
Xinni Zhang
Yankai Chen
Aiwei Liu
Yifei Zhang
Philip S. Yu
Irwin King
VLM
CLL
36
9
0
07 Oct 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
30
13
0
08 Jun 2024
Visual-Text Cross Alignment: Refining the Similarity Score in Vision-Language Models
Jinhao Li
Haopeng Li
S. Erfani
Lei Feng
James Bailey
Feng Liu
VLM
27
3
0
05 Jun 2024
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
21
0
0
09 May 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
19
4
0
01 Apr 2024
Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval
Haocheng Han
Qinghua Zheng
Guangwen Dai
Minnan Luo
Jingdong Wang
22
5
0
08 Mar 2024
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
27
3
0
25 Nov 2023
From Image to Language: A Critical Analysis of Visual Question Answering (VQA) Approaches, Challenges, and Opportunities
Md Farhan Ishmam
Md Sakib Hossain Shovon
M. F. Mridha
Nilanjan Dey
35
35
0
01 Nov 2023
Expedited Training of Visual Conditioned Language Generation via Redundancy Reduction
Yiren Jian
Tingkai Liu
Yunzhe Tao
Chunhui Zhang
Soroush Vosoughi
HX Yang
VLM
13
7
0
05 Oct 2023
Kosmos-2.5: A Multimodal Literate Model
Tengchao Lv
Yupan Huang
Jingye Chen
Lei Cui
Shuming Ma
...
Weiyao Luo
Shaoxiang Wu
Guoxin Wang
Cha Zhang
Furu Wei
VLM
MLLM
15
63
0
20 Sep 2023
StructChart: Perception, Structuring, Reasoning for Visual Chart Understanding
Renqiu Xia
Bo-Wen Zhang
Hao Peng
Hancheng Ye
Xiangchao Yan
Peng Ye
Botian Shi
Yu Qiao
Junchi Yan
8
0
0
20 Sep 2023
Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
Yupan Huang
Zaiqiao Meng
Fangyu Liu
Yixuan Su
Nigel Collier
Yutong Lu
MLLM
25
21
0
31 Aug 2023
SINC: Self-Supervised In-Context Learning for Vision-Language Tasks
Yi-Syuan Chen
Yun-Zhu Song
Cheng Yu Yeo
Bei Liu
Jianlong Fu
Hong-Han Shuai
VLM
LRM
16
4
0
15 Jul 2023
Bootstrapping Vision-Language Learning with Decoupled Language Pre-training
Yiren Jian
Chongyang Gao
Soroush Vosoughi
VLM
MLLM
16
25
0
13 Jul 2023
Efficient Token-Guided Image-Text Retrieval with Consistent Multimodal Contrastive Training
Chong Liu
Yuqi Zhang
Hongsong Wang
Weihua Chen
F. Wang
Yan Huang
Yixing Shen
Liang Wang
11
25
0
15 Jun 2023
Image Captioners Are Scalable Vision Learners Too
Michael Tschannen
Manoj Kumar
Andreas Steiner
Xiaohua Zhai
N. Houlsby
Lucas Beyer
VLM
CLIP
9
53
0
13 Jun 2023
Semantic Composition in Visually Grounded Language Models
Rohan Pandey
CoGe
11
1
0
15 May 2023
Mask to reconstruct: Cooperative Semantics Completion for Video-text Retrieval
Han Fang
Zhifei Yang
Xianghao Zang
Chao Ban
Hao Sun
VGen
17
2
0
13 May 2023
Vision-Language Models in Remote Sensing: Current Progress and Future Trends
Xiang Li
Congcong Wen
Yuan Hu
Zhenghang Yuan
Xiao Xiang Zhu
VLM
8
69
0
09 May 2023
Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention
Mingyu Ding
Yikang Shen
Lijie Fan
Zhenfang Chen
Z. Chen
Ping Luo
J. Tenenbaum
Chuang Gan
ViT
59
9
0
06 Apr 2023
VideoXum: Cross-modal Visual and Textural Summarization of Videos
Jingyang Lin
Hang Hua
Ming Chen
Yikang Li
Jenhao Hsiao
C. Ho
Jiebo Luo
23
30
0
21 Mar 2023
Transformers in Speech Processing: A Survey
S. Latif
Aun Zaidi
Heriberto Cuayáhuitl
Fahad Shamshad
Moazzam Shoukat
Junaid Qadir
33
46
0
21 Mar 2023
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
Xiao Wang
Guangyao Chen
Guangwu Qian
Pengcheng Gao
Xiaoyong Wei
Yaowei Wang
Yonghong Tian
Wen Gao
AI4CE
VLM
24
195
0
20 Feb 2023
TeViS:Translating Text Synopses to Video Storyboards
Xu Gu
Yuchong Sun
Feiyue Ni
Shizhe Chen
Xihua Wang
Ruihua Song
B. Li
Xiang Cao
DiffM
21
3
0
31 Dec 2022
Exploring Vision Transformers as Diffusion Learners
He Cao
Jianan Wang
Tianhe Ren
Xianbiao Qi
Yihao Chen
Yuan Yao
L. Zhang
23
10
0
28 Dec 2022
Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment
Rohan Pandey
Rulin Shao
Paul Pu Liang
Ruslan Salakhutdinov
Louis-Philippe Morency
10
12
0
20 Dec 2022
Alignment-Enriched Tuning for Patch-Level Pre-trained Document Image Models
Lei Wang
Jian He
Xingdong Xu
Ning Liu
Hui-juan Liu
17
2
0
27 Nov 2022
A survey on knowledge-enhanced multimodal learning
Maria Lymperaiou
Giorgos Stamou
28
13
0
19 Nov 2022
AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning with Masked Autoencoders
W. G. C. Bandara
Naman Patel
A. Gholami
Mehdi Nikkhah
M. Agrawal
Vishal M. Patel
15
33
0
16 Nov 2022
Stare at What You See: Masked Image Modeling without Reconstruction
Hongwei Xue
Peng Gao
Hongyang Li
Yu Qiao
Hao Sun
Houqiang Li
Jiebo Luo
17
31
0
16 Nov 2022
Long-Form Video-Language Pre-Training with Multimodal Temporal Contrastive Learning
Yuchong Sun
Hongwei Xue
Ruihua Song
Bei Liu
Huan Yang
Jianlong Fu
AI4TS
VLM
11
68
0
12 Oct 2022
VoLTA: Vision-Language Transformer with Weakly-Supervised Local-Feature Alignment
Shraman Pramanick
Li Jing
Sayan Nag
Jiachen Zhu
Hardik Shah
Yann LeCun
Ramalingam Chellappa
19
21
0
09 Oct 2022
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
Hongwei Xue
Yuchong Sun
Bei Liu
Jianlong Fu
Rui Song
Houqiang Li
Jiebo Luo
CLIP
VLM
12
68
0
14 Sep 2022
Efficient Vision-Language Pretraining with Visual Concepts and Hierarchical Alignment
Mustafa Shukor
Guillaume Couairon
Matthieu Cord
VLM
CLIP
19
26
0
29 Aug 2022
GRIT-VLP: Grouped Mini-batch Sampling for Efficient Vision and Language Pre-training
Jaeseok Byun
Taebaek Hwang
Jianlong Fu
Taesup Moon
VLM
13
10
0
08 Aug 2022
Vision-and-Language Pretraining
Thong Nguyen
Cong-Duy Nguyen
Xiaobao Wu
See-Kiong Ng
A. Luu
VLM
CLIP
11
2
0
05 Jul 2022
BridgeTower: Building Bridges Between Encoders in Vision-Language Representation Learning
Xiao Xu
Chenfei Wu
Shachar Rosenman
Vasudev Lal
Wanxiang Che
Nan Duan
30
64
0
17 Jun 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
6
106
0
15 Jun 2022
Multimodal Learning with Transformers: A Survey
P. Xu
Xiatian Zhu
David A. Clifton
ViT
26
518
0
13 Jun 2022
GIT: A Generative Image-to-text Transformer for Vision and Language
Jianfeng Wang
Zhengyuan Yang
Xiaowei Hu
Linjie Li
Kevin Qinghong Lin
Zhe Gan
Zicheng Liu
Ce Liu
Lijuan Wang
VLM
16
524
0
27 May 2022
HiVLP: Hierarchical Vision-Language Pre-Training for Fast Image-Text Retrieval
Feilong Chen
Xiuyi Chen
Jiaxin Shi
Duzhen Zhang
Jianlong Chang
Qi Tian
VLM
CLIP
27
6
0
24 May 2022
Training Vision-Language Transformers from Captions
Liangke Gui
Yingshan Chang
Qiuyuan Huang
Subhojit Som
Alexander G. Hauptmann
Jianfeng Gao
Yonatan Bisk
VLM
ViT
169
11
0
19 May 2022
SceneTrilogy: On Human Scene-Sketch and its Complementarity with Photo and Text
Pinaki Nath Chowdhury
A. Bhunia
Aneeshan Sain
Subhadeep Koley
Tao Xiang
Yi-Zhe Song
30
29
0
25 Apr 2022
LayoutLMv3: Pre-training for Document AI with Unified Text and Image Masking
Yupan Huang
Tengchao Lv
Lei Cui
Yutong Lu
Furu Wei
17
430
0
18 Apr 2022
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
Mengjun Cheng
Yipeng Sun
Long Wang
Xiongwei Zhu
Kun Yao
...
Guoli Song
Junyu Han
Jingtuo Liu
Errui Ding
Jingdong Wang
20
47
0
31 Mar 2022
Image-text Retrieval: A Survey on Recent Research and Development
Min Cao
Shiping Li
Juntao Li
Liqiang Nie
Min Zhang
13
81
0
28 Mar 2022
Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
Chuhui Xue
Wenqing Zhang
Yu Hao
Shijian Lu
Philip H. S. Torr
Song Bai
VLM
27
31
0
08 Mar 2022
Vision-Language Intelligence: Tasks, Representation Learning, and Large Models
Feng Li
Hao Zhang
Yi-Fan Zhang
S. Liu
Jian Guo
L. Ni
Pengchuan Zhang
Lei Zhang
AI4TS
VLM
11
36
0
03 Mar 2022
A Survey of Vision-Language Pre-Trained Models
Yifan Du
Zikang Liu
Junyi Li
Wayne Xin Zhao
VLM
13
177
0
18 Feb 2022
1
2
Next