Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2111.08276
Cited By
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
16 November 2021
Yan Zeng
Xinsong Zhang
Hang Li
VLM
CLIP
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts"
44 / 44 papers shown
Title
Decoupled Global-Local Alignment for Improving Compositional Understanding
Xiaoxing Hu
Kaicheng Yang
J. Z. Wang
Haoran Xu
Ziyong Feng
Y. Wang
VLM
59
0
0
23 Apr 2025
Attention Hijackers: Detect and Disentangle Attention Hijacking in LVLMs for Hallucination Mitigation
Beitao Chen
Xinyu Lyu
Lianli Gao
Jingkuan Song
H. Shen
63
1
0
11 Mar 2025
Can Hallucination Correction Improve Video-Language Alignment?
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
57
0
0
20 Feb 2025
Where am I? Cross-View Geo-localization with Natural Language Descriptions
Junyan Ye
Honglin Lin
Leyan Ou
Dairong Chen
Zihao Wang
Conghui He
Weijia Li
Weijia Li
76
0
0
22 Dec 2024
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
Shuyu Yang
Yaxiong Wang
Li Zhu
Zhedong Zheng
88
2
0
26 Nov 2024
Text-Guided Coarse-to-Fine Fusion Network for Robust Remote Sensing Visual Question Answering
Zhicheng Zhao
Changfu Zhou
Yu Zhang
Chenglong Li
Xiaoliang Ma
Jin Tang
66
0
0
24 Nov 2024
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
Hoin Jung
T. Jang
Xiaoqian Wang
VLM
21
2
0
10 Oct 2024
ComAlign: Compositional Alignment in Vision-Language Models
Ali Abdollah
Amirmohammad Izadi
Armin Saghafian
Reza Vahidimajd
Mohammad Mozafari
Amirreza Mirzaei
Mohammadmahdi Samiei
M. Baghshah
CoGe
VLM
25
0
0
12 Sep 2024
ExpertAF: Expert Actionable Feedback from Video
Kumar Ashutosh
Tushar Nagarajan
Georgios Pavlakos
Kris M. Kitani
Kristen Grauman
VGen
42
2
0
01 Aug 2024
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
Yu-Guan Hsieh
Cheng-Yu Hsieh
Shih-Ying Yeh
Louis Béthune
Hadi Pour Ansari
Pavan Kumar Anasosalu Vasu
Chun-Liang Li
Ranjay Krishna
Oncel Tuzel
Marco Cuturi
58
4
0
09 Jul 2024
One Perturbation is Enough: On Generating Universal Adversarial Perturbations against Vision-Language Pre-training Models
Hao Fang
Jiawei Kong
Wenbo Yu
Bin Chen
Jiawei Li
Hao Wu
Ke Xu
Ke Xu
AAML
VLM
30
13
0
08 Jun 2024
CLIPLoss and Norm-Based Data Selection Methods for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Alex Fang
Wenjing Zhou
Kevin G. Jamieson
S. Du
32
7
0
29 May 2024
Understanding the Effect of using Semantically Meaningful Tokens for Visual Representation Learning
N. Kalibhat
Priyatham Kattakinda
Arman Zarei
Nikita Seleznev
Sam Sharpe
Senthil Kumar
S. Feizi
ViT
31
0
0
26 May 2024
Revisiting Text-to-Image Evaluation with Gecko: On Metrics, Prompts, and Human Ratings
Olivia Wiles
Chuhan Zhang
Isabela Albuquerque
Ivana Kajić
Su Wang
...
Jordi Pont-Tuset
Aida Nematzadeh
Anant Nawalgaria
Jordi Pont-Tuset
Aida Nematzadeh
EGVM
117
13
0
25 Apr 2024
MergeNet: Knowledge Migration across Heterogeneous Models, Tasks, and Modalities
Kunxi Li
Tianyu Zhan
Kairui Fu
Shengyu Zhang
Kun Kuang
Jiwei Li
Zhou Zhao
Fei Wu
MoMe
22
0
0
20 Apr 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
62
12
0
05 Mar 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
26
12
0
15 Feb 2024
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
11
4
0
14 Dec 2023
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLM
CoGe
MLLM
21
16
0
30 Nov 2023
Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models
Jiayun Luo
Siddhesh Khandelwal
Leonid Sigal
Boyang Albert Li
MLLM
VLM
27
7
0
28 Nov 2023
Multiscale Superpixel Structured Difference Graph Convolutional Network for VL Representation
Siyu Zhang
Ye-Ting Chen
Fang Wang
Yaoru Sun
Jun Yang
Lizhi Bai
SSL
17
0
0
20 Oct 2023
EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
Junyi Chen
Longteng Guo
Jianxiang Sun
Shuai Shao
Zehuan Yuan
Liang Lin
Dongyu Zhang
MLLM
VLM
MoE
43
9
0
23 Aug 2023
Bridging the Gap: Exploring the Capabilities of Bridge-Architectures for Complex Visual Reasoning Tasks
Kousik Rajesh
Mrigank Raman
M. A. Karim
Pranit Chawla
VLM
23
2
0
31 Jul 2023
HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning
Chia-Wen Kuo
Z. Kira
25
21
0
25 May 2023
ONE-PEACE: Exploring One General Representation Model Toward Unlimited Modalities
Peng Wang
Shijie Wang
Junyang Lin
Shuai Bai
Xiaohuan Zhou
Jingren Zhou
Xinggang Wang
Chang Zhou
VLM
MLLM
ObjD
16
113
0
18 May 2023
Detecting and Grounding Multi-Modal Media Manipulation
Rui Shao
Tianxing Wu
Ziwei Liu
19
56
0
05 Apr 2023
Cross-Modal Causal Intervention for Medical Report Generation
Weixing Chen
Yang Liu
Ce Wang
Jiarui Zhu
Shen Zhao
Guanbin Li
Cheng-Lin Liu
Liang Lin
21
5
0
16 Mar 2023
ViperGPT: Visual Inference via Python Execution for Reasoning
Dídac Surís
Sachit Menon
Carl Vondrick
MLLM
LRM
ReLM
19
428
0
14 Mar 2023
Focusing On Targets For Improving Weakly Supervised Visual Grounding
V. Pham
Nao Mishima
ObjD
13
1
0
22 Feb 2023
OvarNet: Towards Open-vocabulary Object Attribute Recognition
Keyan Chen
Xiaolong Jiang
Yao Hu
Xu Tang
Yan Gao
Jianqi Chen
Weidi Xie
VLM
ObjD
27
38
0
23 Jan 2023
Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study
Mariya Hendriksen
Svitlana Vakulenko
E. Kuiper
Maarten de Rijke
19
5
0
12 Jan 2023
Beyond Triplet: Leveraging the Most Data for Multimodal Machine Translation
Yaoming Zhu
Zewei Sun
Shanbo Cheng
Yuyang Huang
Liwei Wu
Mingxuan Wang
21
10
0
20 Dec 2022
Position-guided Text Prompt for Vision-Language Pre-training
Alex Jinpeng Wang
Pan Zhou
Mike Zheng Shou
Shuicheng Yan
VLM
11
37
0
19 Dec 2022
G-MAP: General Memory-Augmented Pre-trained Language Model for Domain Tasks
Zhongwei Wan
Yichun Yin
Wei Zhang
Jiaxin Shi
Lifeng Shang
Guangyong Chen
Xin Jiang
Qun Liu
VLM
CLL
21
16
0
07 Dec 2022
Open-vocabulary Attribute Detection
M. A. Bravo
Sudhanshu Mittal
Simon Ging
Thomas Brox
VLM
ObjD
14
30
0
23 Nov 2022
V
2
^2
2
L: Leveraging Vision and Vision-language Models into Large-scale Product Retrieval
Wenhao Wang
Yifan Sun
Zongxin Yang
Yi Yang
VLM
8
3
0
26 Jul 2022
Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
Zi-Yi Dou
Aishwarya Kamath
Zhe Gan
Pengchuan Zhang
Jianfeng Wang
...
Ce Liu
Yann LeCun
Nanyun Peng
Jianfeng Gao
Lijuan Wang
VLM
ObjD
11
123
0
15 Jun 2022
VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
Wangchunshu Zhou
Yan Zeng
Shizhe Diao
Xinsong Zhang
CoGe
VLM
17
13
0
30 May 2022
VLP: A Survey on Vision-Language Pre-training
Feilong Chen
Duzhen Zhang
Minglun Han
Xiuyi Chen
Jing Shi
Shuang Xu
Bo Xu
VLM
79
208
0
18 Feb 2022
A Survey on Green Deep Learning
Jingjing Xu
Wangchunshu Zhou
Zhiyi Fu
Hao Zhou
Lei Li
VLM
71
79
0
08 Nov 2021
KD-VLP: Improving End-to-End Vision-and-Language Pretraining with Object Knowledge Distillation
Yongfei Liu
Chenfei Wu
Shao-Yen Tseng
Vasudev Lal
Xuming He
Nan Duan
CLIP
VLM
47
28
0
22 Sep 2021
Conceptual 12M: Pushing Web-Scale Image-Text Pre-Training To Recognize Long-Tail Visual Concepts
Soravit Changpinyo
P. Sharma
Nan Ding
Radu Soricut
VLM
273
1,077
0
17 Feb 2021
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
293
3,683
0
11 Feb 2021
Unifying Vision-and-Language Tasks via Text Generation
Jaemin Cho
Jie Lei
Hao Tan
Mohit Bansal
MLLM
249
518
0
04 Feb 2021
1