ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2004.00849
  4. Cited By
Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal
  Transformers

Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers

2 April 2020
Zhicheng Huang
Zhaoyang Zeng
Bei Liu
Dongmei Fu
Jianlong Fu
    ViT
ArXivPDFHTML

Papers citing "Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers"

50 / 286 papers shown
Title
Solar Multimodal Transformer: Intraday Solar Irradiance Predictor using Public Cameras and Time Series
Yanan Niu
Roy Sarkis
D. Psaltis
Mario Paolone
Christophe Moser
Luisa Lambertini
36
0
0
28 Feb 2025
Vision Language Models in Medicine
Beria Chingnabe Kalpelbe
Angel Gabriel Adaambiik
Wei Peng
VLM
LM&MA
86
2
0
24 Feb 2025
ESANS: Effective and Semantic-Aware Negative Sampling for Large-Scale Retrieval Systems
Haibo Xing
Kanefumi Matsuyama
Hao Deng
Jinxin Hu
Yu Zhang
Xiaoyi Zeng
36
0
0
22 Feb 2025
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Enhancing Adversarial Robustness of Vision-Language Models through Low-Rank Adaptation
Yuheng Ji
Yue Liu
Zhicheng Zhang
Zhao Zhang
Yuting Zhao
Gang Zhou
Xingwei Zhang
Xinwang Liu
Xiaolong Zheng
VLM
108
4
0
21 Feb 2025
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan
X. Li
Tao Zhang
Zilong Huang
Shilin Xu
S. Ji
Yunhai Tong
Lu Qi
Jiashi Feng
Ming Yang
VLM
92
11
0
07 Jan 2025
Foundations of GenIR
Qingyao Ai
Jingtao Zhan
Y. Liu
42
0
0
06 Jan 2025
Improving Generated and Retrieved Knowledge Combination Through
  Zero-shot Generation
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Y. Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
35
1
0
25 Dec 2024
MIMIC: Multimodal Islamophobic Meme Identification and Classification
MIMIC: Multimodal Islamophobic Meme Identification and Classification
Safrin Sanzida Islam
Sahid Hossain Mustakim
Sadia Ahmmed
Md. Faiyaz Abdullah Sayeedi
Swapnil Khandoker
Syed Tasdid Azam Dhrubo
Nahid Md Lokman Hossain
64
0
0
01 Dec 2024
Cross-Modal Pre-Aligned Method with Global and Local Information for
  Remote-Sensing Image and Text Retrieval
Cross-Modal Pre-Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval
Zengbao Sun
Ming Zhao
Gaorui Liu
Andre Kaup
88
3
0
22 Nov 2024
A Comprehensive Survey on Visual Question Answering Datasets and Algorithms
Raihan Kabir
Naznin Haque
Md. Saiful Islam
Marium-E. Jannat
CoGe
29
1
0
17 Nov 2024
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Renaissance: Investigating the Pretraining of Vision-Language Encoders
Clayton Fields
C. Kennington
VLM
19
0
0
11 Nov 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for
  Vision-Language Pre-Training
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
52
9
0
16 Oct 2024
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning
  for Multimodal Classification
Harnessing Shared Relations via Multimodal Mixup Contrastive Learning for Multimodal Classification
Raja Kumar
Raghav Singhal
Pranamya Kulkarni
Deval Mehta
Kshitij Jadhav
15
0
0
26 Sep 2024
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Multi-Modal Generative AI: Multi-modal LLM, Diffusion and Beyond
Hong Chen
Xin Wang
Yuwei Zhou
Bin Huang
Yipeng Zhang
Wei Feng
Houlun Chen
Zeyang Zhang
Siao Tang
Wenwu Zhu
DiffM
47
7
0
23 Sep 2024
Pixels to Prose: Understanding the art of Image Captioning
Pixels to Prose: Understanding the art of Image Captioning
Hrishikesh Singh
Aarti Sharma
Millie Pant
3DV
VLM
25
0
0
28 Aug 2024
MRSE: An Efficient Multi-modality Retrieval System for Large Scale
  E-commerce
MRSE: An Efficient Multi-modality Retrieval System for Large Scale E-commerce
Hao Jiang
Haoxiang Zhang
Qingshan Hou
Chaofeng Chen
Weisi Lin
Jingchang Zhang
Annan Wang
14
0
0
27 Aug 2024
Macformer: Transformer with Random Maclaurin Feature Attention
Macformer: Transformer with Random Maclaurin Feature Attention
Yuhan Guo
Lizhong Ding
Ye Yuan
Guoren Wang
41
0
0
21 Aug 2024
How to Make Cross Encoder a Good Teacher for Efficient Image-Text
  Retrieval?
How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval?
Yuxin Chen
Zongyang Ma
Ziqi Zhang
Zhongang Qi
Chunfeng Yuan
Bing Li
Junfu Pu
Ying Shan
Xiaojuan Qi
Weiming Hu
33
2
0
10 Jul 2024
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model
Yuxuan Zhang
Tianheng Cheng
Lianghui Zhu
Lei Liu
Heng Liu
Longjin Ran
Xiaoxin Chen
Xiaoxin Chen
Wenyu Liu
Xinggang Wang
VLM
51
24
0
28 Jun 2024
An Image is Worth 32 Tokens for Reconstruction and Generation
An Image is Worth 32 Tokens for Reconstruction and Generation
Qihang Yu
Mark Weber
XueQing Deng
Xiaohui Shen
Daniel Cremers
Liang-Chieh Chen
VLM
ViT
44
79
0
11 Jun 2024
Multi-Modal Generative Embedding Model
Multi-Modal Generative Embedding Model
Feipeng Ma
Hongwei Xue
Guangting Wang
Yizhou Zhou
Fengyun Rao
Shilin Yan
Yueyi Zhang
Siying Wu
Mike Zheng Shou
Xiaoyan Sun
VLM
26
3
0
29 May 2024
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing
  Image-Text Retrieval
Transcending Fusion: A Multi-Scale Alignment Method for Remote Sensing Image-Text Retrieval
Rui Yang
Shuang Wang
Yi Han
Yuanheng Li
Dong Zhao
Dou Quan
Yanhe Guo
Licheng Jiao
44
3
0
29 May 2024
Similarity Guided Multimodal Fusion Transformer for Semantic Location
  Prediction in Social Media
Similarity Guided Multimodal Fusion Transformer for Semantic Location Prediction in Social Media
Zhizhen Zhang
Ning Wang
Haojie Li
Zhihui Wang
29
0
0
09 May 2024
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
From Data Deluge to Data Curation: A Filtering-WoRA Paradigm for Efficient Text-based Person Search
Jintao Sun
Zhedong Zheng
Gangyi Ding
Gangyi Ding
32
7
0
16 Apr 2024
Bridging Vision and Language Spaces with Assignment Prediction
Bridging Vision and Language Spaces with Assignment Prediction
Jungin Park
Jiyoung Lee
Kwanghoon Sohn
VLM
29
6
0
15 Apr 2024
TrafficVLM: A Controllable Visual Language Model for Traffic Video
  Captioning
TrafficVLM: A Controllable Visual Language Model for Traffic Video Captioning
Quang Minh Dinh
Minh Khoi Ho
Anh Quan Dang
Hung Phong Tran
30
6
0
14 Apr 2024
Calibration & Reconstruction: Deep Integrated Language for Referring
  Image Segmentation
Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation
Yichen Yan
Xingjian He
Sihan Chen
Jing Liu
ObjD
31
1
0
12 Apr 2024
GUIDE: Graphical User Interface Data for Execution
GUIDE: Graphical User Interface Data for Execution
Rajat Chawla
Adarsh Jha
Muskaan Kumar
NS Mukunda
Ishaan Bhola
LLMAG
27
3
0
09 Apr 2024
SyncMask: Synchronized Attentional Masking for Fashion-centric
  Vision-Language Pretraining
SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining
Chull Hwan Song
Taebaek Hwang
Jooyoung Yoon
Shunghyun Choi
Yeong Hyeon Gu
21
4
0
01 Apr 2024
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language
  Pre-training
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
VLM
27
0
0
01 Mar 2024
MaskFi: Unsupervised Learning of WiFi and Vision Representations for
  Multimodal Human Activity Recognition
MaskFi: Unsupervised Learning of WiFi and Vision Representations for Multimodal Human Activity Recognition
Jianfei Yang
Shijie Tang
Yuecong Xu
Yunjiao Zhou
Lihua Xie
27
4
0
29 Feb 2024
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist
  Autonomous Agents for Desktop and Web
OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web
Raghav Kapoor
Y. Butala
M. Russak
Jing Yu Koh
Kiran Kamble
Waseem Alshikh
Ruslan Salakhutdinov
LLMAG
51
44
0
27 Feb 2024
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models
  for Spatial Proximity Analysis
Proximity QA: Unleashing the Power of Multi-Modal Large Language Models for Spatial Proximity Analysis
Jianing Li
Xi Nan
Ming Lu
Li Du
Shanghang Zhang
40
1
0
31 Jan 2024
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based
  Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
M2-RAAP: A Multi-Modal Recipe for Advancing Adaptation-based Pre-training towards Effective and Efficient Zero-shot Video-text Retrieval
Xingning Dong
Zipeng Feng
Chunluan Zhou
Xuzheng Yu
Ming Yang
Qingpei Guo
VLM
25
2
0
31 Jan 2024
SNP-S3: Shared Network Pre-training and Significant Semantic
  Strengthening for Various Video-Text Tasks
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
Xingning Dong
Qingpei Guo
Tian Gan
Qing Wang
Jianlong Wu
Xiangyuan Ren
Yuan-Chia Cheng
Wei Chu
21
5
0
31 Jan 2024
Memory-Inspired Temporal Prompt Interaction for Text-Image
  Classification
Memory-Inspired Temporal Prompt Interaction for Text-Image Classification
Xinyao Yu
Hao Sun
Ziwei Niu
Rui Qin
Zhenjia Bai
Yen-Wei Chen
Lanfen Lin
VLM
16
2
0
26 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image
  Patch Selection
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
29
0
0
11 Jan 2024
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention
  Network for Crisis Event Classification
CrisisKAN: Knowledge-infused and Explainable Multimodal Attention Network for Crisis Event Classification
Shubham Gupta
Nandini Saini
Suman Kundu
Debasis Das
8
6
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual
  Concept Understanding
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
16
2
0
09 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in
  Multimodal Large Language Models
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
43
8
0
06 Jan 2024
TextFusion: Unveiling the Power of Textual Semantics for Controllable
  Image Fusion
TextFusion: Unveiling the Power of Textual Semantics for Controllable Image Fusion
Chunyang Cheng
Tianyang Xu
Xiao-Jun Wu
Hui Li
Xi Li
Zhangyong Tang
Josef Kittler
16
10
0
21 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language
  Pre-training
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
11
4
0
14 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
13
3
0
11 Dec 2023
Adventures of Trustworthy Vision-Language Models: A Survey
Adventures of Trustworthy Vision-Language Models: A Survey
Mayank Vatsa
Anubhooti Jain
Richa Singh
20
4
0
07 Dec 2023
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image
  Captioning
Bootstrapping Interactive Image-Text Alignment for Remote Sensing Image Captioning
Cong Yang
Zuchao Li
Lefei Zhang
29
23
0
02 Dec 2023
Mug-STAN: Adapting Image-Language Pretrained Models for General Video
  Understanding
Mug-STAN: Adapting Image-Language Pretrained Models for General Video Understanding
Ruyang Liu
Jingjia Huang
Wei-Nan Gao
Thomas H. Li
Ge Li
VLM
27
3
0
25 Nov 2023
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided
  Code-Vision Representation
ViStruct: Visual Structural Knowledge Extraction via Curriculum Guided Code-Vision Representation
Yangyi Chen
Xingyao Wang
Manling Li
Derek Hoiem
Heng Ji
25
10
0
22 Nov 2023
DRESS: Instructing Large Vision-Language Models to Align and Interact
  with Humans via Natural Language Feedback
DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback
Yangyi Chen
Karan Sikka
Michael Cogswell
Heng Ji
Ajay Divakaran
24
58
0
16 Nov 2023
TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity
  Recognition
TENT: Connect Language Models with IoT Sensors for Zero-Shot Activity Recognition
Yunjiao Zhou
Jianfei Yang
Han Zou
Lihua Xie
VLM
29
17
0
14 Nov 2023
High-Performance Transformers for Table Structure Recognition Need Early
  Convolutions
High-Performance Transformers for Table Structure Recognition Need Early Convolutions
Sheng-Hsuan Peng
Seongmin Lee
Xiaojing Wang
Rajarajeswari Balasubramaniyan
Duen Horng Chau
ViT
LMTD
19
3
0
09 Nov 2023
123456
Next