Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
1908.02265
Cited By
ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
Neural Information Processing Systems (NeurIPS), 2019
6 August 2019
Jiasen Lu
Dhruv Batra
Devi Parikh
Stefan Lee
SSL
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks"
50 / 2,232 papers shown
Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu
Dinh-Thang Duong
Truong-Binh Duong
Anh-Khoi Nguyen
Thanh-Huy Nguyen
...
Jianhua Xing
Xingjian Li
Tianyang Wang
Ulas Bagci
Min Xu
VLM
280
2
0
16 Jul 2025
ChordPrompt: Orchestrating Cross-Modal Prompt Synergy for Multi-Domain Incremental Learning in CLIP
Zhiyuan Wang
Bokui Chen
VLM
LRM
210
0
0
24 Jun 2025
LaVi: Efficient Large Vision-Language Models via Internal Feature Modulation
Tongtian Yue
Longteng Guo
Yepeng Tang
Zijia Zhao
Xinxin Zhu
Hua Huang
Jing Liu
MLLM
VLM
162
2
0
20 Jun 2025
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong-Jin Liu
SongLi Wu
Sule Bai
Jiahao Wang
Yitong Wang
Yansong Tang
VLM
VOS
330
2
0
19 Jun 2025
Understanding GUI Agent Localization Biases through Logit Sharpness
Xingjian Tao
Yiwei Wang
Yujun Cai
Zhicheng YANG
Jing Tang
LLMAG
175
4
0
18 Jun 2025
Segmenting Visuals With Querying Words: Language Anchors For Semi-Supervised Image Segmentation
Numair Nadeem
Saeed Anwar
Muhammad Asad
Abdul Bais
VLM
300
0
0
16 Jun 2025
Dynamic Modality Scheduling for Multimodal Large Models via Confidence, Uncertainty, and Semantic Consistency
Hiroshi Tanaka
Anika Rao
Hana Satou
Michael Johnson
Sofia García
160
0
0
15 Jun 2025
Generative or Discriminative? Revisiting Text Classification in the Era of Transformers
Siva Rajesh Kasa
Karan Gupta
Sumegh Roychowdhury
Ashutosh Kumar
Yaswanth Biruduraju
Santhosh Kumar Kasa
Nikhil Pattisapu
Arindam Bhattacharya
Shailendra Agarwal
Vijay huddar
187
3
0
13 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
308
0
0
13 Jun 2025
Intention-Conditioned Flow Occupancy Models
Chongyi Zheng
S. Park
Sergey Levine
Benjamin Eysenbach
AI4TS
OffRL
AI4CE
304
2
0
10 Jun 2025
MrM: Black-Box Membership Inference Attacks against Multimodal RAG Systems
Peiru Yang
Jinhua Yin
Haoran Zheng
Xueying Bai
Huili Wang
Yufei Sun
Xintian Li
Shangguang Wang
Yongfeng Huang
Tao Qi
AAML
168
0
0
09 Jun 2025
Representation Decomposition for Learning Similarity and Contrastness Across Modalities for Affective Computing
Yuanhe Tian
Pengsen Cheng
Guoqing Jin
Lei Zhang
Yan Song
132
3
0
08 Jun 2025
OpenFace 3.0: A Lightweight Multitask System for Comprehensive Facial Behavior Analysis
IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2025
Jiewen Hu
Leena Mathur
Paul Pu Liang
Louis-Philippe Morency
CVBM
183
1
0
03 Jun 2025
MINT: Multimodal Instruction Tuning with Multimodal Interaction Grouping
Xiaojun Shan
Qi Cao
Xing Han
Haofei Yu
Paul Liang
280
1
0
02 Jun 2025
GETReason: Enhancing Image Context Extraction through Hierarchical Multi-Agent Reasoning
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Shikhhar Siingh
Abhinav Rawat
Chitta Baral
Vivek Gupta
342
0
0
28 May 2025
Flexible Tool Selection through Low-dimensional Attribute Alignment of Vision and Language
Guangfu Hao
Haojie Wen
Liangxuna Guo
Yang Chen
Yanchao Bi
S. Yu
320
0
0
28 May 2025
E2E Process Automation Leveraging Generative AI and IDP-Based Automation Agent: A Case Study on Corporate Expense Processing
Cheonsu Jeong
Seongmin Sim
Hyoyoung Cho
Sungsu Kim
Byounggwan Shin
275
3
0
27 May 2025
Co-AttenDWG: Co-Attentive Dimension-Wise Gating and Expert Fusion for Multi-Modal Offensive Content Detection
IEEE Transactions on Artificial Intelligence (IEEE TAI), 2025
Md. Mithun Hossain
Md. Shakil Hossain
Sudipto Chaki
M. F. Mridha
444
0
0
25 May 2025
Representation Discrepancy Bridging Method for Remote Sensing Image-Text Retrieval
Hailong Ning
Siying Wang
Tao Lei
Xiaopeng Cao
Huanmin Dou
Bin Zhao
Asoke K. Nandi
Petia Radeva
157
1
0
22 May 2025
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
Huanjin Yao
Qixiang Yin
Jingyi Zhang
Min Yang
Yibo Wang
...
Fei Su
Li Shen
Minghui Qiu
Dacheng Tao
Jiaxing Huang
LRM
299
25
0
22 May 2025
Large Language models for Time Series Analysis: Techniques, Applications, and Challenges
Feifei Shi
Xueyan Yin
Kang Wang
Wanyu Tu
Qifu Sun
Huansheng Ning
AI4TS
206
0
0
21 May 2025
Domain Adaptation of VLM for Soccer Video Understanding
Tiancheng Jiang
Henry Wang
Md Sirajus Salekin
Parmida Atighehchian
Shinan Zhang
VLM
366
3
0
20 May 2025
InstanceBEV: Unifying Instance and BEV Representation for 3D Panoptic Segmentation
Feng Li
Zhaoyue Wang
Zhaoyue Wang
Mohammad Masum Billah
Yunduan Cui
Kun Xu
329
1
0
20 May 2025
ReactDiff: Latent Diffusion for Facial Reaction Generation
Neural Networks (NN), 2025
Jiaming Li
Sheng Wang
Xin Wang
Yitao Zhu
Honglin Xiong
Zixu Zhuang
Qian Wang
DiffM
VGen
278
1
0
20 May 2025
Multi-modal contrastive learning adapts to intrinsic dimensions of shared latent variables
Yu Gui
Cong Ma
Zongming Ma
SSL
315
2
0
18 May 2025
Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models
Kai Tang
Jinhao You
Xiuqi Ge
Hanze Li
Yichen Guo
Xiande Huang
MLLM
484
3
0
18 May 2025
Enhanced Multimodal Hate Video Detection via Channel-wise and Modality-wise Fusion
Yinghui Zhang
Tailin Chen
Yuchen Zhang
Zeyu Fu
245
6
0
17 May 2025
Open Set Domain Adaptation with Vision-language models via Gradient-aware Separation
Applied and Computational Engineering (ACE), 2025
Haoyang Chen
VLM
242
0
0
16 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Shibin Mei
Hang Wang
Bingbing Ni
314
0
0
16 May 2025
DPSeg: Dual-Prompt Cost Volume Learning for Open-Vocabulary Semantic Segmentation
Computer Vision and Pattern Recognition (CVPR), 2025
Ziyu Zhao
Xiaoguang Li
Linjia Shi
Nasrin Imanpour
Song Wang
VLM
241
2
0
16 May 2025
On the Interplay of Human-AI Alignment,Fairness, and Performance Trade-offs in Medical Imaging
International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2025
Haozhe Luo
Ziyu Zhou
Zixin Shu
Aurélie Pahud de Mortanges
Robert Berke
Mauricio Reyes
226
1
0
15 May 2025
Prioritizing Image-Related Tokens Enhances Vision-Language Pre-Training
Yiran Chen
Yuan Yao
Tong Zhang
Heng Ji
VLM
352
1
0
13 May 2025
Probabilistic Embeddings for Frozen Vision-Language Models: Uncertainty Quantification with Gaussian Process Latent Variable Models
Conference on Uncertainty in Artificial Intelligence (UAI), 2025
Aishwarya Venkataramanan
P. Bodesheim
Joachim Denzler
BDL
VLM
410
2
0
08 May 2025
OpenVision: A Fully-Open, Cost-Effective Family of Advanced Vision Encoders for Multimodal Learning
Xianhang Li
Yixiao Liu
Haoqin Tu
Hongru Zhu
Cihang Xie
VLM
895
5
0
07 May 2025
Unified Multimodal Understanding and Generation Models: Advances, Challenges, and Opportunities
Wei Wei
Jintao Guo
Shanshan Zhao
Minghao Fu
Lunhao Duan
...
Guo-Hua Wang
Qing-Guo Chen
Zhao Xu
Weihua Luo
Kaifu Zhang
DiffM
1.1K
31
0
05 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
220
0
0
04 May 2025
A Survey of Robotic Navigation and Manipulation with Physics Simulators in the Era of Embodied AI
Lik Hang Kenny Wong
Xueyang Kang
Kaixin Bai
Jianwei Zhang
394
11
0
01 May 2025
Synergy-CLIP: Extending CLIP with Multi-modal Integration for Robust Representation Learning
IEEE Access (IEEE Access), 2025
Sangyeon Cho
Jangyeong Jeon
Mingi Kim
Junyeong Kim
CLIP
VLM
452
1
0
30 Apr 2025
Detecting and Mitigating Hateful Content in Multimodal Memes with Vision-Language Models
Minh-Hao Van
Xintao Wu
VLM
365
1
0
30 Apr 2025
DOPE: Dual Object Perception-Enhancement Network for Vision-and-Language Navigation
International Conference on Multimedia Retrieval (ICMR), 2025
Yinfeng Yu
Dongsheng Yang
344
2
0
30 Apr 2025
Multimodal Large Language Models for Medicine: A Comprehensive Survey
Jiarui Ye
Hao Tang
LM&MA
484
14
0
29 Apr 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
923
1
0
28 Apr 2025
A Survey of Task-Oriented Knowledge Graph Reasoning: Status, Applications, and Prospects
Guanglin Niu
Bo Li
Yangguang Lin
LRM
273
2
0
27 Apr 2025
ShapeSpeak: Body Shape-Aware Textual Alignment for Visible-Infrared Person Re-Identification
Shuanglin Yan
Neng Dong
Shuang Li
Rui Yan
Hao Tang
Jing Qin
1.0K
0
0
25 Apr 2025
A Genealogy of Foundation Models in Remote Sensing
Kevin Lane
Morteza Karimzadeh
350
1
0
24 Apr 2025
Detecting and Understanding Hateful Contents in Memes Through Captioning and Visual Question-Answering
International Conference on Conceptual Structures (ICCS), 2025
Ali Anaissi
Junaid Akram
Kunal Chaturvedi
Ali Braytee
255
2
0
23 Apr 2025
VLM as Policy: Common-Law Content Moderation Framework for Short Video Platform
Xingyu Lu
Tianke Zhang
Chang Meng
Xinyu Wang
Jinpeng Wang
...
Hai-Tao Zheng
Fan Yang
Yan Li
Di Zhang
Kun Gai
OffRL
242
6
0
21 Apr 2025
EmoSEM: Segment and Explain Emotion Stimuli in Visual Art
Jing Zhang
Dan Guo
Zhangbin Li
Meng Wang
304
0
0
20 Apr 2025
Hadamard product in deep learning: Introduction, Advances and Challenges
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2025
Grigorios G. Chrysos
Yongtao Wu
Razvan Pascanu
Philip Torr
Volkan Cevher
AAML
348
14
0
17 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
Vassilis Katsouros
Yannis Avrithis
Alexandros Potamianos
394
1
0
15 Apr 2025
Previous
1
2
3
4
5
6
...
43
44
45
Next