Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2004.00849
Cited By
v1
v2 (latest)
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers
2 April 2020
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
ViT
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers"
50 / 292 papers shown
Countering Multi-modal Representation Collapse through Rank-targeted Fusion
Seulgi Kim
Kiran Kokilepersaud
Mohit Prabhushankar
Ghassan AlRegib
180
0
0
09 Nov 2025
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Saeed Amizadeh
Sara Abdali
Yinheng Li
K. Koishida
214
0
0
18 Sep 2025
Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction
Qin Chao
Eunsoo Kim
Boyang Albert Li
172
0
0
18 Sep 2025
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
Qianqi Lu
Yuxiang Xie
Jing Zhang
Shiwei Zou
Yan Chen
Xidao Luan
224
0
0
16 Sep 2025
OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning
Yanqing Liu
Xianhang Li
Letian Zhang
Zirui Wang
Zeyu Zheng
Yuyin Zhou
Cihang Xie
VLM
241
5
0
01 Sep 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
333
0
0
13 Jun 2025
RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer
Haotian Ni
Yake Wei
Hang Liu
Gong Chen
Chong Peng
Hao Lin
Di Hu
OffRL
335
1
0
13 Jun 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Shibin Mei
Hang Wang
Bingbing Ni
347
0
0
16 May 2025
Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2025
Yanan Niu
Roy Sarkis
D. Psaltis
Mario Paolone
Christophe Moser
Luisa Lambertini
376
3
0
28 Feb 2025
Vision Language Models in Medicine
Beria Chingnabe Kalpelbe
Angel Gabriel Adaambiik
Wei Peng
VLM
LM&MA
431
7
0
24 Feb 2025
ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval Systems
The Web Conference (WWW), 2025
Haibo Xing
Kanefumi Matsuyama
Hao Deng
Jinxin Hu
Yu Zhang
Xiaoyi Zeng
325
6
0
22 Feb 2025
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
International Conference on Multimedia Retrieval (ICMR), 2024
Yuheng Ji
Yue Liu
Zhicheng Zhang
Zhao Zhang
Yuting Zhao
Gang Zhou
Xingwei Zhang
Xinwang Liu
Xiaolong Zheng
VLM
433
4
0
21 Feb 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
Xianrui Li
Tao Zhang
Zilong Huang
Shilin Xu
...
Yunhai Tong
Lu Qi
Jiashi Feng
Ming-Hsuan Yang
Ming-Hsuan Yang
VLM
784
68
0
07 Jan 2025
Foundations of GenIR
Jiaxin Mao
Jingtao Zhan
Wenshu Fan
296
0
0
06 Jan 2025
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Yi Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
401
2
0
25 Dec 2024
MIMIC: Multimodal Islamophobic Meme Identification and Classification
Safrin Sanzida Islam
Sahid Hossain Mustakim
Sadia Ahmmed
Md. Faiyaz Abdullah Sayeedi
Swapnil Khandoker
Syed Tasdid Azam Dhrubo
Nahid Md Lokman Hossain
261
1
0
01 Dec 2024
Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval
IEEE Transactions on Geoscience and Remote Sensing (TGRS), 2024
Zengbao Sun
Ming Zhao
Gaorui Liu
Andre Kaup
344
13
0
22 Nov 2024
A Comprehensive Survey on Visual Question Answering Datasets and Algorithms
Raihan Kabir
Naznin Haque
Md. Saiful Islam
Marium-E. Jannat
CoGe
306
12
0
17 Nov 2024
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields
C. Kennington
VLM
207
1
0
11 Nov 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
ACM Multimedia (ACM MM), 2022
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
400
9
0
16 Oct 2024
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Raja Kumar
Raghav Singhal
Pranamya Kulkarni
Deval Mehta
Kshitij S. Jadhav
502
3
0
26 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
554
9
0
23 Sep 2024
Pixels to Prose: Understanding the art of Image Captioning
Hrishikesh Singh
Aarti Sharma
Millie Pant
3DV
VLM
255
3
0
28 Aug 2024
MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce
Hao Jiang
Haoxiang Zhang
Qingshan Hou
Chaofeng Chen
Weisi Lin
Jingchang Zhang
Annan Wang
96
1
0
27 Aug 2024
Macformer: Transformer with Random Maclaurin Feature Attention
Yuhan Guo
Lizhong Ding
Ye Yuan
Guoren Wang
311
0
0
21 Aug 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Chen Ma
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
190
6
0
10 Jul 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
656
66
0
28 Jun 2024
An Image is Worth 32 Tokens for Reconstruction and Generation
Qihang Yu
Mark Weber
XueQing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen
VLM
ViT
471
236
0
11 Jun 2024
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
223
10
0
29 May 2024
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval
Rui Yang
Shuang Wang
Yi Han
Yuanheng Li
Dong Zhao
Dou Quan
Yanhe Guo
Licheng Jiao
287
11
0
29 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
231
1
0
09 May 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
492
23
0
16 Apr 2024
Bridging Vision and Language Spaces with Assignment Prediction
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
VLM
385
13
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
300
25
0
14 Apr 2024
Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation
Yichen Yan
Xingjian He
Sihan Chen
Jing Liu
ObjD
214
3
0
12 Apr 2024
GUIDE: Graphical User Interface Data for Execution
Rajat Chawla
Adarsh Jha
Muskaan Kumar
NS Mukunda
Ishaan Bhola
LLMAG
245
5
0
09 Apr 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
211
12
0
01 Apr 2024
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
VLM
354
1
0
01 Mar 2024
MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition
Jianfei Yang
Shijie Tang
Yuecong Xu
Yunjiao Zhou
Lihua Xie
326
8
0
29 Feb 2024
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor
Y. Butala
M. Russak
Jing Yu Koh
Kiran Kamble
Waseem Alshikh
Ruslan Salakhutdinov
LLMAG
546
122
0
27 Feb 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
185
6
0
31 Jan 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
291
5
0
31 Jan 2024
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Yue Yu
Xiangyuan Ren
Yuan Cheng
Wei Chu
247
6
0
31 Jan 2024
Memory-Inspired Temporal Prompt Interaction for Text-Image Classification
Xinyao Yu
Hao Sun
Ziwei Niu
Rui Qin
Zhenjia Bai
Yen-Wei Chen
Lanfen Lin
VLM
281
2
0
26 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
254
1
0
11 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
European Conference on Information Retrieval (ECIR), 2024
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
323
13
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
227
2
0
09 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
363
13
0
06 Jan 2024
TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion
Chunyang Cheng
Tianyang Xu
Xiao-Jun Wu
Hui Li
Xi Li
Zhangyong Tang
Josef Kittler
393
49
0
21 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2023
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
383
6
0
14 Dec 2023
1
2
3
4
5
6
Next
Page 1 of 6