Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales

Terms and Conditions

Twitter GitHub LinkedIn Bluesky Youtube

© 2026 ResearchTrend.AI, All rights reserved.

Home
Papers
2004.00849
Cited By

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
Transformers

v1v2 (latest)

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

2 April 2020

ArXiv (abs)PDF HTML

Papers citing "Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers"

50 / 292 papers shown

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Countering Multi-modal Representation Collapse through Rank-targeted Fusion

Kiran Kokilepersaud

Mohit Prabhushankar

Ghassan AlRegib

180

0

0

09 Nov 2025

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems

214

0

0

18 Sep 2025

Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction

Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction

Boyang Albert Li

172

0

0

18 Sep 2025

TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation

224

0

0

16 Sep 2025

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

OpenVision 2: A Family of Generative Pretrained Visual Encoders for Multimodal Learning

241

5

0

01 Sep 2025

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs

333

0

0

13 Jun 2025

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

RollingQ: Reviving the Cooperation Dynamics in Multimodal Transformer

335

1

0

13 Jun 2025

GeoMM: On Geodesic Perspective for Multi-modal Learning

GeoMM: On Geodesic Perspective for Multi-modal LearningComputer Vision and Pattern Recognition (CVPR), 2025

347

0

0

16 May 2025

Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series

Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time SeriesIEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2025

Christophe Moser

Luisa Lambertini

376

3

0

28 Feb 2025

Vision Language Models in Medicine

Vision Language Models in Medicine

Beria Chingnabe Kalpelbe

Angel Gabriel Adaambiik

431

7

0

24 Feb 2025

ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval Systems

ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval SystemsThe Web Conference (WWW), 2025

Kanefumi Matsuyama

325

6

0

22 Feb 2025

Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation

Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank AdaptationInternational Conference on Multimedia Retrieval (ICMR), 2024

433

4

0

21 Feb 2025

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

...

Ming-Hsuan Yang

Ming-Hsuan Yang

784

68

0

07 Jan 2025

Foundations of GenIR

Foundations of GenIR

296

0

0

06 Jan 2025

Improving Generated and Retrieved Knowledge Combination Through
Zero-shot Generation

Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation

401

2

0

25 Dec 2024

MIMIC: Multimodal Islamophobic Meme Identification and Classification

MIMIC: Multimodal Islamophobic Meme Identification and Classification

Safrin Sanzida Islam

Sahid Hossain Mustakim

Md. Faiyaz Abdullah Sayeedi

Swapnil Khandoker

Syed Tasdid Azam Dhrubo

Nahid Md Lokman Hossain

261

1

0

01 Dec 2024

Cross-Modal Pre-Aligned Method with Global and Local Information for
Remote-Sensing Image and Text Retrieval

Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text RetrievalIEEE Transactions on Geoscience and Remote Sensing (TGRS), 2024

344

13

0

22 Nov 2024

A Comprehensive Survey on Visual Question Answering Datasets and Algorithms

Md. Saiful Islam

Marium-E. Jannat

306

12

0

17 Nov 2024

Renaissance: Investigating the Pretraining of Vision-Language Encoders

Renaissance: Investigating the Pretraining of Vision-Language Encoders

207

1

0

11 Nov 2024

CMAL: A Novel Cross-Modal Associative Learning Framework for
Vision-Language Pre-Training

CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-TrainingACM Multimedia (ACM MM), 2022

400

9

0

16 Oct 2024

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification

Pranamya Kulkarni

Kshitij S. Jadhav

502

3

0

26 Sep 2024

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

Multi-modal Generative AI: Multi-modal LLMs, Diffusions, and the Unification

554

9

0

23 Sep 2024

Pixels to Prose: Understanding the art of Image Captioning

Pixels to Prose: Understanding the art of Image Captioning

Hrishikesh Singh

Aarti Sharma

255

3

0

28 Aug 2024

MRSE: An Efficient Multi-modality Retrieval System for Large Scale
E-commerce

MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce

Jingchang Zhang

96

1

0

27 Aug 2024

Macformer: Transformer with Random Maclaurin Feature Attention

Macformer: Transformer with Random Maclaurin Feature Attention

Ye Yuan

311

0

0

21 Aug 2024

How to Make Cross Encoder a Good Teacher for Efficient Image-Text
Retrieval?

How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?

Yuxin Chen

Chunfeng Yuan

Ying Shan

190

6

0

10 Jul 2024

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Lianghui Zhu

Xiaoxin Chen

Wenyu Liu

Xinggang Wang

656

66

0

28 Jun 2024

An Image is Worth 32 Tokens for Reconstruction and Generation

An Image is Worth 32 Tokens for Reconstruction and Generation

Daniel Cremers

Liang-Chieh Chen

471

236

0

11 Jun 2024

Multi-Modal Generative Embedding Model

Multi-Modal Generative Embedding Model

Yueyi Zhang

Mike Zheng Shou

223

10

0

29 May 2024

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing
Image-Text Retrieval

Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval

287

11

0

29 May 2024

Similarity Guided Multimodal Fusion Transformer for Semantic Location
Prediction in Social Media

Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media

231

1

0

09 May 2024

From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search

From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search

Gangyi Ding

492

23

0

16 Apr 2024

Bridging Vision and Language Spaces with Assignment Prediction

Bridging Vision and Language Spaces with Assignment Prediction

385

13

0

15 Apr 2024

TrafficVLM: A Controllable Visual Language Model for Traffic Video
Captioning

TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning

Quang Minh Dinh

Hung Phong Tran

300

25

0

14 Apr 2024

Calibration & Reconstruction: Deep Integrated Language for Referring
Image Segmentation

Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation

Jing Liu

214

3

0

12 Apr 2024

GUIDE: Graphical User Interface Data for Execution

GUIDE: Graphical User Interface Data for Execution

Ishaan Bhola

245

5

0

09 Apr 2024

SyncMask: Synchronized Attentional Masking for Fashion-centric
Vision-Language Pretraining

SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining

Chull Hwan Song

211

12

0

01 Apr 2024

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
Pre-training

Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training

...

Ji Zhang

Fei Huang

Bing Li

Weiming Hu

354

1

0

01 Mar 2024

MaskFi: Unsupervised Learning of WiFi and Vision Representations for
Multimodal Human Activity Recognition

MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition

Yuecong Xu

Lihua Xie

326

8

0

29 Feb 2024

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
Autonomous Agents for Desktop and Web

OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web

Ruslan Salakhutdinov

546

122

0

27 Feb 2024

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
for Spatial Proximity Analysis

Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis

Li Du

Shanghang Zhang

185

6

0

31 Jan 2024

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval

Ming Yang

291

5

0

31 Jan 2024

SNP-S3: Shared Network Pre-training and Significant Semantic
Strengthening for Various Video-Text Tasks

SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks

247

6

0

31 Jan 2024

Memory-Inspired Temporal Prompt Interaction for Text-Image
Classification

Memory-Inspired Temporal Prompt Interaction for Text-Image Classification

Yen-Wei Chen

281

2

0

26 Jan 2024

Efficient Vision-and-Language Pre-training with Text-Relevant Image
Patch Selection

Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection

Fei Huang

254

1

0

11 Jan 2024

CrisisKAN: Knowledge-infused and Explainable Multimodal Attention
Network for Crisis Event Classification

CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event ClassificationEuropean Conference on Information Retrieval (ECIR), 2024

323

13

0

11 Jan 2024

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
Concept Understanding

Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding

Samyak Parajuli

...

Eugenia D Fomitcheva

Somayeh Sojoudi

227

2

0

09 Jan 2024

Incorporating Visual Experts to Resolve the Information Loss in
Multimodal Large Language Models

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models

363

13

0

06 Jan 2024

TextFusion: Unveiling the Power of Textual Semantics for Controllable
Image Fusion

TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion

393

49

0

21 Dec 2023

TiMix: Text-aware Image Mixing for Effective Vision-Language
Pre-training

TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-trainingAAAI Conference on Artificial Intelligence (AAAI), 2023

Ji Zhang

383

6

0

14 Dec 2023

Page 1 of 6