ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2026 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,235 papers shown
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Dapeng Zhang
Jin Sun
Chenghui Hu
Xiaoyan Wu
Zhenlong Yuan
R. Zhou
Fei Shen
Qingguo Zhou
LM&Ro
379
22
0
23 Sep 2025
M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
Yanxin Zhang
Liang He
Zeyi Kang
Zuheng Ming
Kaixing Zhao
Mamba
167
0
0
22 Sep 2025
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analy
Shubhashis Roy Dipta
Tz-Ying Wu
Subarna Tripathi
212
0
0
20 Sep 2025
TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities
TriSPrompt: A Hierarchical Soft Prompt Model for Multimodal Rumor Detection with Incomplete Modalities
Jiajun Chen
Yangyang Wu
Xiaoye Miao
Mengying Zhu
Meng Xi
115
3
0
18 Sep 2025
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Hierarchical Self-Attention: Generalizing Neural Attention Mechanics to Multi-Scale Problems
Saeed Amizadeh
Sara Abdali
Yinheng Li
K. Koishida
217
0
0
18 Sep 2025
Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction
Copycat vs. Original: Multi-modal Pretraining and Variable Importance in Box-office Prediction
Qin Chao
Eunsoo Kim
Boyang Albert Li
172
0
0
18 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
160
3
0
17 Sep 2025
Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval
Cross-modal Full-mode Fine-grained Alignment for Text-to-Image Person Retrieval
Hao Yin
Xin Man
Feiyu Chen
Jie Shao
Heng Tao Shen
178
0
0
17 Sep 2025
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
TFANet: Three-Stage Image-Text Feature Alignment Network for Robust Referring Image Segmentation
Qianqi Lu
Yuxiang Xie
Jing Zhang
Shiwei Zou
Yan Chen
Xidao Luan
224
0
0
16 Sep 2025
Biomedical Hypothesis Explainability with Graph-Based Context Retrieval
Biomedical Hypothesis Explainability with Graph-Based Context RetrievalbioRxiv (bioRxiv), 2025
Ilya Tyagin
Saeideh Valipour
Aliaksandra Sikirzhytskaya
M. Shtutman
Ilya Safro
144
0
0
15 Sep 2025
Knowledge-Guided Adaptive Mixture of Experts for Precipitation Prediction
Knowledge-Guided Adaptive Mixture of Experts for Precipitation Prediction
Chen Jiang
Kofi Osei
Sai Deepthi Yeddula
Dongji Feng
Wei-Shinn Ku
117
0
0
14 Sep 2025
Towards Understanding Visual Grounding in Visual Language Models
Towards Understanding Visual Grounding in Visual Language Models
Georgios Pantazopoulos
Eda B. Özyiğit
ObjD
492
4
0
12 Sep 2025
DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
DualTrack: Sensorless 3D Ultrasound needs Local and Global Context
P. Wilson
Matteo Ronchetti
Rüdiger Göbl
Viktoria Markova
Sebastian Rosenzweig
R. Prevost
P. Mousavi
O. Zettinig
101
3
0
11 Sep 2025
SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-training
SimCroP: Radiograph Representation Learning with Similarity-driven Cross-granularity Pre-trainingInternational Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Rongsheng Wang
Fenghe Tang
Qingsong Yao
Rui Yan
Xu Zhang
...
Haoran Lai
Zhiyang He
Xiaodong Tao
Zihang Jiang
S. Kevin Zhou
MedIm
198
1
0
10 Sep 2025
Parse Graph-Based Visual-Language Interaction for Human Pose Estimation
Parse Graph-Based Visual-Language Interaction for Human Pose Estimation
Shibang Liu
Xuemei Xie
G. Shi
146
0
0
09 Sep 2025
Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
Hybrid-Tower: Fine-grained Pseudo-query Interaction and Generation for Text-to-Video Retrieval
Bangxiang Lan
Ruobing Xie
Ruixiang Zhao
Xingwu Sun
Zhanhui Kang
Gang Yang
Xirong Li
188
1
0
05 Sep 2025
Artificial intelligence for representing and characterizing quantum systems
Artificial intelligence for representing and characterizing quantum systems
Yuxuan Du
Yan Zhu
Y. Zhang
Min-hsiu Hsieh
Patrick Rebentrost
...
Ya-Dong Wu
Jens Eisert
G. Chiribella
Dacheng Tao
B. Sanders
209
4
0
05 Sep 2025
Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
Attn-Adapter: Attention Is All You Need for Online Few-shot Learner of Vision-Language Model
Phuoc-Nguyen Bui
Khanh-Binh Nguyen
Hyunseung Choo
VLM
323
0
0
04 Sep 2025
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Structure-aware Contrastive Learning for Diagram Understanding of Multimodal Models
Hiroshi Sasaki
VLM
202
2
0
02 Sep 2025
Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation
Street-Level Geolocalization Using Multimodal Large Language Models and Retrieval-Augmented Generation
Yunus Serhat Bicakci
Joseph Shingleton
Anahid Basiri
135
0
0
01 Sep 2025
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Zhenwei Tang
Difan Jiao
Blair Yang
Ashton Anderson
VLMCoGe
187
1
0
25 Aug 2025
Limitations of Normalization in Attention Mechanism
Limitations of Normalization in Attention Mechanism
Timur Mudarisov
Mikhail Burtsev
Tatiana Petrova
Radu State
140
2
0
25 Aug 2025
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Explain Before You Answer: A Survey on Compositional Visual Reasoning
Fucai Ke
Joy Hsu
Zhixi Cai
Zixian Ma
Xin Zheng
...
P. D. Haghighi
Gholamreza Haffari
Ranjay Krishna
Jiajun Wu
H. Rezatofighi
ReLMCoGeLRM
406
13
0
24 Aug 2025
Cross-Attention Multimodal Fusion for Breast Cancer Diagnosis: Integrating Mammography and Clinical Data with Explainability
Cross-Attention Multimodal Fusion for Breast Cancer Diagnosis: Integrating Mammography and Clinical Data with Explainability
Muhaisin Tiyumba Nantogmah
Abdul-Barik Alhassan
Salamudeen Alhassan
175
0
0
21 Aug 2025
GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
GazeProphet: Software-Only Gaze Prediction for VR Foveated Rendering
Farhaan Ebadulla
Chiraag Mudlapur
Gaurav BV
182
0
0
19 Aug 2025
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
Ziyang Zhang
Yang Yu
Xulei Yang
S. Yeo
VLM
140
1
0
16 Aug 2025
Recent Advances in Transformer and Large Language Models for UAV Applications
Recent Advances in Transformer and Large Language Models for UAV Applications
Hamza Kheddar
Yassine Habchi
Mohamed Chahine Ghanem
Mustapha Hemis
Dusit Niyato
197
7
0
15 Aug 2025
A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering
A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering
Chenliang Zhang
Lin Wang
Yuanyuan Lu
Yusheng Qi
Kexin Wang
P. Hou
Wenshi Chen
RALM
202
1
0
14 Aug 2025
AME: Aligned Manifold Entropy for Robust Vision-Language Distillation
AME: Aligned Manifold Entropy for Robust Vision-Language Distillation
Guiming Cao
Yuming Ou
AAMLVLM
224
2
0
12 Aug 2025
FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning
FLUID: Flow-Latent Unified Integration via Token Distillation for Expert Specialization in Multimodal Learning
Van Duc Cuong
Ta Dinh Tam
Tran Duc Chinh
Nguyen Thi Hanh
165
1
0
10 Aug 2025
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Haifeng Li
Wang Guo
Haiyang Wu
Mengwei Wu
Jipeng Zhang
Qing Zhu
Yu Liu
Xin Huang
Chao Tao
190
2
0
09 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
195
0
0
09 Aug 2025
Natural Language-Driven Viewpoint Navigation for Volume Exploration via Semantic Block Representation
Natural Language-Driven Viewpoint Navigation for Volume Exploration via Semantic Block Representation
Xuan Zhao
Jun Tao
125
0
0
09 Aug 2025
MultiCheck: Strengthening Web Trust with Unified Multimodal Fact Verification
MultiCheck: Strengthening Web Trust with Unified Multimodal Fact Verification
Aditya Kishore
Gaurav Kumar
Jasabanta Patro
215
0
0
07 Aug 2025
Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features
Surformer v1: Transformer-Based Surface Classification Using Tactile and Vision Features
Manish Kansana
Elias Hossain
Shahram Rahimi
Noorbakhsh Amiri Golilarz
ViT
161
3
0
07 Aug 2025
Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
Accelerating Conditional Prompt Learning via Masked Image Modeling for Vision-Language Models
Phuoc-Nguyen Bui
Khanh-Binh Nguyen
Hyunseung Choo
VLM
307
1
0
07 Aug 2025
Does Multimodality Improve Recommender Systems as Expected? A Critical Analysis and Future Directions
Does Multimodality Improve Recommender Systems as Expected? A Critical Analysis and Future Directions
Hongyu Zhou
Yinan Zhang
Aixin Sun
Zhiqi Shen
152
2
0
07 Aug 2025
Latent Expression Generation for Referring Image Segmentation and Grounding
Latent Expression Generation for Referring Image Segmentation and Grounding
S. Yu
Joonbeom Hong
Joonseok Lee
Jeany Son
ObjD
293
2
0
07 Aug 2025
RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
RegionMed-CLIP: A Region-Aware Multimodal Contrastive Learning Pre-trained Model for Medical Image Understanding
Tianchen Fang
Guiru Liu
MedImVLM
169
3
0
07 Aug 2025
Chain of Questions: Guiding Multimodal Curiosity in Language Models
Chain of Questions: Guiding Multimodal Curiosity in Language Models
Nima Iji
Kia Dashtipour
LRM
195
0
0
06 Aug 2025
Parameter-Efficient Single Collaborative Branch for Recommendation
Parameter-Efficient Single Collaborative Branch for RecommendationACM Conference on Recommender Systems (RecSys), 2025
Marta Moscati
Shah Nawaz
Markus Schedl
BDL
201
0
0
05 Aug 2025
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
VLM4D: Towards Spatiotemporal Awareness in Vision Language Models
Shijie Zhou
Alexander Vilesov
Xuehai He
Ziyu Wan
Shuwang Zhang
Aditya Nagachandra
Di Chang
DongDong Chen
Xin Eric Wang
A. Kadambi
VLM
269
0
0
04 Aug 2025
A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving
A Unified Perception-Language-Action Framework for Adaptive Autonomous Driving
Yi Zhang
Erik Leo Haß
Kuo-Yi Chao
Nenad Petrovic
Yinglei Song
Chengdong Wu
Alois C. Knoll
187
3
0
31 Jul 2025
From Image Captioning to Visual Storytelling
From Image Captioning to Visual Storytelling
Admitos Passadakis
Yingjin Song
Albert Gatt
DiffM
278
0
0
31 Jul 2025
DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception
DeltaVLM: Interactive Remote Sensing Image Change Analysis via Instruction-guided Difference Perception
Pei Deng
Wenqian Zhou
Hanlin Wu
157
0
0
30 Jul 2025
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Weide Liu
Wei Zhou
Jun Liu
Ping Hu
Jun Cheng
Jungong Han
Weisi Lin
3DV
290
4
0
30 Jul 2025
Goal-Based Vision-Language Driving
Goal-Based Vision-Language Driving
Santosh Patapati
Trisanth Srinivasan
225
0
0
30 Jul 2025
Color as the Impetus: Transforming Few-Shot Learner
Color as the Impetus: Transforming Few-Shot Learner
Chaofei Qi
Zhitai Liu
Jianbin Qiu
VLM
318
0
0
29 Jul 2025
A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction
A Survey on Generative Model Unlearning: Fundamentals, Taxonomy, Evaluation, and Future Direction
Xiaohua Feng
Jiaming Zhang
Fengyuan Yu
C. Wang
Li Zhang
Kaixiang Li
Yuyuan Li
Chaochao Chen
Jianwei Yin
MU
355
2
0
26 Jul 2025
Closing the Modality Gap for Mixed Modality Search
Closing the Modality Gap for Mixed Modality Search
Binxu Li
Yuhui Zhang
Xiaohan Wang
Weixin Liang
Ludwig Schmidt
Serena Yeung-Levy
VLM
178
5
0
25 Jul 2025
Previous
12345...434445
Next
Page 2 of 45
Pageof 45