ResearchTrend.AI
  • Communities
  • Connect sessions
  • AI calendar
  • Organizations
  • Join Slack
  • Contact Sales
Papers
Communities
Social Events
Terms and Conditions
Pricing
Contact Sales
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 1908.02265
  4. Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for
  Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
    SSLVLM
ArXiv (abs)PDFHTML

Papers citing "ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"

50 / 2,231 papers shown
Title
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
InternVideo-Next: Towards General Video Foundation Models without Video-Text Supervision
Chenting Wang
Yuhan Zhu
Yicheng Xu
Jiange Yang
Ziang Yan
Yali Wang
Yi Wang
Limin Wang
VGen
77
0
0
01 Dec 2025
PSA-MF: Personality-Sentiment Aligned Multi-Level Fusion for Multimodal Sentiment Analysis
Heng Xie
Kang Zhu
Zhengqi Wen
Jianhua Tao
Xuefei Liu
Ruibo Fu
Changsheng Li
128
0
0
01 Dec 2025
Generative Adversarial Gumbel MCTS for Abstract Visual Composition Generation
Zirui Zhao
Boye Niu
David Hsu
W. Lee
GAN
108
0
0
01 Dec 2025
Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols
Artwork Interpretation with Vision Language Models: A Case Study on Emotions and Emotion Symbols
Sebastian Padó
Kerstin Thomas
33
0
0
28 Nov 2025
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models
Silin Cheng
Kai Han
MLLMVPVLMVLM
190
0
0
27 Nov 2025
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
Jiaying Hong
Ting Zhu
Thanet Markchom
Huizhi Liang
8
0
0
27 Nov 2025
DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models
DialBench: Towards Accurate Reading Recognition of Pointer Meter using Large Foundation Models
Futian Wang
Chaoliu Weng
Xiao Wang
Zhen Chen
Zhicheng Zhao
Jin Tang
16
0
0
26 Nov 2025
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
ScenarioCLIP: Pretrained Transferable Visual Language Models and Action-Genome Dataset for Natural Scene Analysis
Advik Sinha
Saurabh Atreya
Aashutosh A V
Sk Aziz Ali
Abhijit Das
CLIP
112
0
0
25 Nov 2025
Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning
Towards Efficient VLMs: Information-Theoretic Driven Compression via Adaptive Structural Pruning
Zhaoqi Xu
Yingying Zhang
Jian Li
Jianwei Guo
Qiannan Zhu
Hua Huang
VLM
40
0
0
24 Nov 2025
Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue
Affective Multimodal Agents with Proactive Knowledge Grounding for Emotionally Aligned Marketing Dialogue
Lin Yu
Xiaofei Han
Yifei Kang
Chiung-Yi Tseng
Danyang Zhang
Ziqian Bi
Zhimo Han
8
0
0
21 Nov 2025
C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models
C2F-Space: Coarse-to-Fine Space Grounding for Spatial Instructions using Vision-Language Models
Nayoung Oh
Dohyun Kim
Junhyeong Bang
Rohan Paul
Daehyung Park
107
0
0
19 Nov 2025
MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation
MFI-ResNet: Efficient ResNet Architecture Optimization via MeanFlow Compression and Selective Incubation
Nuolin Sun
Linyuan Wang
Haonan Wei
Lei Li
Bin Yan
113
0
0
16 Nov 2025
Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy
Large Language Models and 3D Vision for Intelligent Robotic Perception and AutonomyItalian National Conference on Sensors (INS), 2025
Vinit Mehta
Charu Sharma
Karthick Thiyagarajan
LM&Ro
348
0
0
14 Nov 2025
LEMUR: Large scale End-to-end MUltimodal Recommendation
LEMUR: Large scale End-to-end MUltimodal RecommendationComputers & graphics (CG), 2024
Xintian Han
Honggang Chen
Quan Lin
Jingyue Gao
X. Ren
...
Zhe Wang
Yuchao Zheng
Jingjian Lin
Di Wu
Junfeng Ge
OffRL
108
0
0
14 Nov 2025
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
MULTIBENCH++: A Unified and Comprehensive Multimodal Fusion Benchmarking Across Specialized Domains
Leyan Xue
Zongbo Han
Kecheng Xue
Xiaohong Liu
Guangyu Wang
C. Zhang
108
0
0
09 Nov 2025
Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Semantic-Guided Natural Language and Visual Fusion for Cross-Modal Interaction Based on Tiny Object Detection
Xian-Hong Huang
Hui-Kai Su
Chi-Chia Sun
Jun-Wei Hsieh
ObjD
324
0
0
07 Nov 2025
Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
Dynamic Routing Between Experts: A Data-Efficient Approach to Continual Learning in Vision-Language Models
Jay Mohta
Kenan E. Ak
Dimitrios Dimitriadis
Yan Xu
Mingwei Shen
CLLVLM
242
0
0
03 Nov 2025
A Retrospect to Multi-prompt Learning across Vision and Language
A Retrospect to Multi-prompt Learning across Vision and LanguageIEEE International Conference on Computer Vision (ICCV), 2023
Ziliang Chen
Xin Huang
Quanlong Guan
Liang Lin
Weiqi Luo
VPVLMVLM
349
9
0
31 Oct 2025
FOCUS: Efficient Keyframe Selection for Long Video Understanding
FOCUS: Efficient Keyframe Selection for Long Video Understanding
Zirui Zhu
Hailun Xu
Yang Luo
Yong Liu
Kanchan Sarkar
Zhenheng Yang
Yang You
124
0
0
31 Oct 2025
Masked Diffusion Captioning for Visual Feature Learning
Masked Diffusion Captioning for Visual Feature Learning
Chao Feng
Zihao Wei
Andrew Owens
DiffM
199
0
0
30 Oct 2025
Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding
Gaze-VLM:Bridging Gaze and VLMs through Attention Regularization for Egocentric Understanding
Anupam Pani
Yanchao Yang
76
0
0
24 Oct 2025
Modest-Align: Data-Efficient Alignment for Vision-Language Models
Modest-Align: Data-Efficient Alignment for Vision-Language Models
Jiaxiang Liu
Yuan Wang
Jiawei Du
Joey Tianyi Zhou
Mingkun Xu
Zuozhu Liu
VLM
112
0
0
24 Oct 2025
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Tibetan Language and AI: A Comprehensive Survey of Resources, Methods and Challenges
Cheng Huang
Nyima Tashi
Fan Gao
Yutong Liu
J. Li
...
Guojie Tang
Xiangxiang Wang
Jia Zhang
Tsengdar J. Lee
Yongbin Yu
96
0
0
22 Oct 2025
Towards a Generalizable Fusion Architecture for Multimodal Object Detection
Towards a Generalizable Fusion Architecture for Multimodal Object Detection
Jad Berjawi
Yoann Dupas
Christophe Cérin
65
0
0
20 Oct 2025
ProtoMol: Enhancing Molecular Property Prediction via Prototype-Guided Multimodal Learning
ProtoMol: Enhancing Molecular Property Prediction via Prototype-Guided Multimodal Learning
Yingxu Wang
Kunyu Zhang
Jiaxin Huang
Nan Yin
Siwei Liu
Eran Segal
112
2
0
19 Oct 2025
The Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers
The Hidden Cost of Modeling P(X): Vulnerability to Membership Inference Attacks in Generative Text Classifiers
Owais Makroo
Siva Rajesh Kasa
Sumegh Roychowdhury
Karan Gupta
Nikhil Pattisapu
Santhosh Kumar Kasa
Sumit Negi
SILM
178
0
0
17 Oct 2025
FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
FlexiReID: Adaptive Mixture of Expert for Multi-Modal Person Re-Identification
Zhen Sun
Lei Tan
Yunhang Shen
Chengmao Cai
Xing Sun
Pingyang Dai
Liujuan Cao
Rongrong Ji
68
0
0
17 Oct 2025
Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control
Hi-Agent: Hierarchical Vision-Language Agents for Mobile Device Control
Zhe Wu
Hongjin Lu
Junliang Xing
C. Zhang
Yin Zhu
...
Kai Li
Kun Shao
Jianye Hao
Jun Wang
Yuanchun Shi
LM&Ro
96
0
0
16 Oct 2025
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
MaskCaptioner: Learning to Jointly Segment and Caption Object Trajectories in Videos
Gabriel Fiastre
Antoine Yang
Cordelia Schmid
VOS
353
0
0
16 Oct 2025
Pretraining in Actor-Critic Reinforcement Learning for Robot Locomotion
Pretraining in Actor-Critic Reinforcement Learning for Robot Locomotion
Jiale Fan
Andrei Cramariuc
Tifanny Portela
Marco Hutter
88
0
0
14 Oct 2025
Unifying Vision-Language Latents for Zero-label Image Caption Enhancement
Unifying Vision-Language Latents for Zero-label Image Caption Enhancement
Sanghyun Byun
Jung Guack
Mohanad Odema
Baisub Lee
Jacob Song
Woo Seong Chung
VLM
67
0
0
14 Oct 2025
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
CoDefend: Cross-Modal Collaborative Defense via Diffusion Purification and Prompt Optimization
Fengling Zhu
Boshi Liu
Jingyu Hua
Sheng Zhong
DiffMAAML
98
0
0
13 Oct 2025
Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
A H M Rezaul Karim
Ozlem Uzuner
92
0
0
12 Oct 2025
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
BitMar: Low-Bit Multimodal Fusion with Episodic Memory for Edge Devices
Euhid Aman
Esteban Carlin
Hsing-Kuo Pao
Giovanni Beltrame
Ghaluh Indah Permata Sari
Yie-Tarng Chen
100
0
0
12 Oct 2025
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Image-to-Video Transfer Learning based on Image-Language Foundation Models: A Comprehensive Survey
Jinxuan Li
Chaolei Tan
Haoxuan Chen
Jianxin Ma
Jian-Fang Hu
Wei-Shi Zheng
Jianhuang Lai
VLM
121
1
0
12 Oct 2025
Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Learning from Mistakes: Enhancing Harmful Meme Detection via Misjudgment Risk Patterns
Wenshuo Wang
Ziyou Jiang
Junjie Wang
Mingyang Li
Jie Huang
Yuekai Huang
Zhiyuan Chang
Feiyan Duan
Qing Wang
139
0
0
10 Oct 2025
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
Weikai Huang
Jieyu Zhang
Taoyang Jia
Chenhao Zheng
Ziqi Gao
J. S. Park
Winson Han
Ranjay Krishna
185
0
0
10 Oct 2025
Zero-shot image privacy classification with Vision-Language Models
Zero-shot image privacy classification with Vision-Language Models
Alina Elena Baia
Alessio Xompero
Andrea Cavallaro
VLM
72
0
0
10 Oct 2025
The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning
The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning
Mayank Ravishankara
Varindra V. Persad Maharaj
ELM
149
0
0
05 Oct 2025
GROK: From Quantitative Biomarkers to Qualitative Diagnosis via a Grounded MLLM with Knowledge-Guided Instruction
GROK: From Quantitative Biomarkers to Qualitative Diagnosis via a Grounded MLLM with Knowledge-Guided Instruction
Zhuangzhi Gao
Hongyi Qin
He Zhao
Qinkai Yu
Feixiang Zhou
...
Uazman Alam
Alena Shantsila
Wahbi El-Bouri
Gregory Y.H. Lip
Yalin Zheng
81
0
0
05 Oct 2025
Landmark-Guided Knowledge for Vision-and-Language Navigation
Landmark-Guided Knowledge for Vision-and-Language NavigationInternational Conference on Intelligent Computing (ICIC), 2025
Dongsheng Yang
Meiling Zhu
Yinfeng Yu
LM&Ro
107
0
0
30 Sep 2025
Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
Resolving Ambiguity in Gaze-Facilitated Visual Assistant Interaction Paradigm
Zeyu Wang
Baiyu Chen
Kun Yan
Hongjing Piao
Hao Xue
Flora D. Salim
Yuanchun Shi
Yuntao Wang
84
0
0
26 Sep 2025
Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
Johnson-Lindenstrauss Lemma Guided Network for Efficient 3D Medical Segmentation
Jinpeng Lu
Linghan Cai
Yinda Chen
Guo Tang
Songhan Jiang
Haoyuan Shi
Zhiwei Xiong
118
0
0
26 Sep 2025
Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Integrating Object Interaction Self-Attention and GAN-Based Debiasing for Visual Question Answering
Zhifei Li
Feng Qiu
Yiran Wang
Yujing Xia
Kui Xiao
Miao Zhang
Yan Zhang
124
0
0
25 Sep 2025
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Retrieval over Classification: Integrating Relation Semantics for Multimodal Relation Extraction
Lei Hei
Tingjing Liao
Yingxin Pei
Yiyang Qi
Jiaqi Wang
Ruiting Li
Feiliang Ren
108
0
0
25 Sep 2025
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
Audio-Visual Separation with Hierarchical Fusion and Representation Alignment
Han Hu
Dongheng Lin
Qiming Huang
Yuqi Hou
Hyung Jin Chang
Jianbo Jiao
80
0
0
24 Sep 2025
Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation
Single-Branch Network Architectures to Close the Modality Gap in Multimodal Recommendation
Christian Ganhor
Marta Moscati
Anna Hausberger
Shah Nawaz
Markus Schedl
HAIOffRL
108
0
0
23 Sep 2025
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Pure Vision Language Action (VLA) Models: A Comprehensive Survey
Dapeng Zhang
Jin Sun
Chenghui Hu
Xiaoyan Wu
Zhenlong Yuan
R. Zhou
Fei Shen
Qingguo Zhou
LM&Ro
225
15
0
23 Sep 2025
LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA
LCMF: Lightweight Cross-Modality Mambaformer for Embodied Robotics VQA
Zeyi Kang
Liang He
Yanxin Zhang
Zuheng Ming
Kaixing Zhao
88
0
0
23 Sep 2025
M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
M3ET: Efficient Vision-Language Learning for Robotics based on Multimodal Mamba-Enhanced Transformer
Yanxin Zhang
Liang He
Zeyi Kang
Zuheng Ming
Kaixing Zhao
Mamba
122
0
0
22 Sep 2025
1234...434445
Next