Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Title
Home
Papers
2004.06165
Cited By
v1
v2
v3
v4
v5 (latest)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
European Conference on Computer Vision (ECCV), 2020
13 April 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"
50 / 1,171 papers shown
Title
SLIP: Structural-aware Language-Image Pretraining for Vision-Language Alignment
Wenbo Lu
CLIP
VLM
65
0
0
04 Nov 2025
Enhancing Adversarial Transferability in Visual-Language Pre-training Models via Local Shuffle and Sample-based Attack
North American Chapter of the Association for Computational Linguistics (NAACL), 2025
Xin Liu
Aoyang Zhou
Aoyang Zhou
AAML
28
0
0
02 Nov 2025
Modest-Align: Data-Efficient Alignment for Vision-Language Models
Jiaxiang Liu
Yuan Wang
Jiawei Du
Joey Tianyi Zhou
Mingkun Xu
Zuozhu Liu
VLM
20
0
0
24 Oct 2025
See, Think, Act: Online Shopper Behavior Simulation with VLM Agents
Yimeng Zhang
Jiri Gesi
Ran Xue
Tian Wang
Ziyi Wang
...
Qingjun Cui
Yufan Guo
Jing Huang
Mubarak Shah
Dakuo Wang
OffRL
60
0
0
22 Oct 2025
Graph4MM: Weaving Multimodal Learning with Structural Information
Xuying Ning
Dongqi Fu
Tianxin Wei
Wujiang Xu
Jingrui He
16
3
0
19 Oct 2025
Class Prototypes based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos
Computer Vision and Pattern Recognition (CVPR), 2023
Rohit Gupta
Anirban Roy
Claire Christensen
Sujeong Kim
Sarah Gerard
Madeline Cincebeaux
Ajay Divakaran
Todd Grindal
M. Shah
48
19
0
13 Oct 2025
Towards Self-Refinement of Vision-Language Models with Triangular Consistency
Yunlong Deng
Guangyi Chen
Tianpei Gu
Lingjing Kong
Yan Li
Zeyu Tang
Kun Zhang
24
1
0
12 Oct 2025
Vision Language Models: A Survey of 26K Papers
Fengming Lin
3DV
VLM
70
0
0
10 Oct 2025
Conditional Representation Learning for Customized Tasks
Honglin Liu
Chao Sun
Peng Hu
Yunfan Li
Xi Peng
52
0
0
06 Oct 2025
The Artificial Intelligence Cognitive Examination: A Survey on the Evolution of Multimodal Evaluation from Recognition to Reasoning
Mayank Ravishankara
Varindra V. Persad Maharaj
ELM
85
0
0
05 Oct 2025
CoT Referring: Improving Referring Expression Tasks with Grounded Reasoning
Qihua Dong
Luis Figueroa
Handong Zhao
Kushal Kafle
Jason Kuen
Zhihong Ding
Scott D. Cohen
Y. Fu
ObjD
LRM
90
0
0
03 Oct 2025
Team Xiaomi EV-AD VLA: Caption-Guided Retrieval System for Cross-Modal Drone Navigation -- Technical Report for IROS 2025 RoboSense Challenge Track 4
L. Zhang
Erjia Xiao
Y. Zhang
Haoxiang Fu
Ruibin Hu
Yanbiao Ma
Wenbo Ding
L. Chen
Hangjun Ye
Xiaoshuai Hao
40
0
0
03 Oct 2025
Pre-training CLIP against Data Poisoning with Optimal Transport-based Matching and Alignment
Tong Zhang
Kuofeng Gao
Jiawang Bai
Leo Yu Zhang
Xin Yin
Zonghui Wang
Shouling Ji
Wenzhi Chen
24
1
0
23 Sep 2025
RACap: Relation-Aware Prompting for Lightweight Retrieval-Augmented Image Captioning
Xiaosheng Long
Hanyu Wang
Zhentao Song
Kun Luo
Hongde Liu
40
0
0
19 Sep 2025
MARS2 2025 Challenge on Multimodal Reasoning: Datasets, Methods, Results, Discussion, and Outlook
Peng Xu
Shengwu Xiong
Jiajun Zhang
Yaxiong Chen
Bowen Zhou
...
Yang Yang
Yanglin Deng
Yashu Kang
Ye Yuan
Y. Wen
LRM
43
1
0
17 Sep 2025
DyKen-Hyena: Dynamic Kernel Generation via Cross-Modal Attention for Multimodal Intent Recognition
Yifei Wang
Wenbin Wang
Yong Luo
28
0
0
12 Sep 2025
FLUX-Reason-6M & PRISM-Bench: A Million-Scale Text-to-Image Reasoning Dataset and Comprehensive Benchmark
Rongyao Fang
Aldrich Yu
Chengqi Duan
Linjiang Huang
S. Bai
Yuxuan Cai
Kun Wang
Si Liu
Xihui Liu
Xue Yang
EGVM
VGen
ReLM
LRM
142
4
0
11 Sep 2025
Prototype-Aware Multimodal Alignment for Open-Vocabulary Visual Grounding
Jiangnan Xie
Xiaolong Zheng
Liang Zheng
ObjD
93
0
0
08 Sep 2025
Embedding Font Impression Word Tags Based on Co-occurrence
Yugo Kubota
Seiichi Uchida
3DV
44
0
0
26 Aug 2025
SEAM: Semantically Equivalent Across Modalities Benchmark for Vision-Language Models
Zhenwei Tang
Difan Jiao
Blair Yang
Ashton Anderson
VLM
CoGe
58
1
0
25 Aug 2025
On the Security and Privacy of Federated Learning: A Survey with Attacks, Defenses, Frameworks, Applications, and Future Directions
Daniel Gutiérrez
Yelizaveta Falkouskaya
Jose L. Hernandez-Ramos
Aris Anagnostopoulos
I. Chatzigiannakis
A. Vitaletti
FedML
48
1
0
19 Aug 2025
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
Ziyang Zhang
Yang Yu
Xulei Yang
S. Yeo
VLM
50
0
0
16 Aug 2025
WeatherPrompt: Multi-modality Representation Learning for All-Weather Drone Visual Geo-Localization
Jiahao Wen
Hang Yu
Zhedong Zheng
60
1
0
13 Aug 2025
RORPCap: Retrieval-based Objects and Relations Prompt for Image Captioning
Jinjing Gu
Tianbao Qin
Yuanyuan Pu
Zhengpeng Zhao
VLM
40
0
0
10 Aug 2025
Adversarial Video Promotion Against Text-to-Video Retrieval
Qiwei Tian
Chenhao Lin
Zhengyu Zhao
Qian Li
Shuai Liu
Chao Shen
AAML
66
0
0
09 Aug 2025
Remote Sensing Image Intelligent Interpretation with the Language-Centered Perspective: Principles, Methods and Challenges
Haifeng Li
Wang Guo
Haiyang Wu
Mengwei Wu
Jipeng Zhang
Qing Zhu
Yu Liu
Xin Huang
Chao Tao
82
0
0
09 Aug 2025
Modality-Aware Feature Matching: A Comprehensive Review of Single- and Cross-Modality Techniques
Weide Liu
Wei Zhou
Jun Liu
Ping Hu
Jun Cheng
Jungong Han
Weisi Lin
3DV
135
2
0
30 Jul 2025
When Better Eyes Lead to Blindness: A Diagnostic Study of the Information Bottleneck in CNN-LSTM Image Captioning Models
International Journal of Computer Applications (IJCA), 2025
Hitesh Kumar Gupta
VLM
126
0
0
24 Jul 2025
Describe Anything Model for Visual Question Answering on Text-rich Images
Yen-Linh Vu
Dinh-Thang Duong
Truong-Binh Duong
Anh-Khoi Nguyen
Thanh-Huy Nguyen
...
Jianhua Xing
Xingjian Li
Tianyang Wang
Ulas Bagci
Min Xu
VLM
151
2
0
16 Jul 2025
LEGO Co-builder: Exploring Fine-Grained Vision-Language Modeling for Multimodal LEGO Assembly Assistants
Haochen Huang
Jiahuan Pei
Mohammad Aliannejadi
Xin Sun
Moonisa Ahsan
Chuang Yu
Zhaochun Ren
Pablo César
Junxiao Wang
VLM
92
0
0
07 Jul 2025
PEVLM: Parallel Encoding for Vision-Language Models
Letian Kang
Shixian Luo
Yiqiang Li
Yuxin Yin
Shenxuan Zhou
Xiaoyang Yu
Jin Yang
Yong Wu
MLLM
VLM
146
0
0
24 Jun 2025
VLN-R1: Vision-Language Navigation via Reinforcement Fine-Tuning
Zhangyang Qi
Zhixiong Zhang
Yizhou Yu
Jiaqi Wang
Hengshuang Zhao
LM&Ro
AI4TS
191
17
0
20 Jun 2025
Stepping Out of Similar Semantic Space for Open-Vocabulary Segmentation
Yong-Jin Liu
SongLi Wu
Sule Bai
Jiahao Wang
Yitong Wang
Yansong Tang
VLM
VOS
174
0
0
19 Jun 2025
Manager: Aggregating Insights from Unimodal Experts in Two-Tower VLMs and MLLMs
Xiao Xu
L. Qin
Wanxiang Che
Min-Yen Kan
MoE
VLM
164
0
0
13 Jun 2025
FREE: Fast and Robust Vision Language Models with Early Exits
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Divya J. Bajpai
M. Hanawal
VLM
77
1
0
07 Jun 2025
Distill CLIP (DCLIP): Enhancing Image-Text Retrieval via Cross-Modal Transformer Distillation
Daniel Csizmadia
Andrei Codreanu
Victor Sim
Vighnesh Prabhu
Michael Lu
Kevin Zhu
Sean O'Brien
Sean O Brien
CLIP
VLM
247
2
0
25 May 2025
GeoMM: On Geodesic Perspective for Multi-modal Learning
Computer Vision and Pattern Recognition (CVPR), 2025
Shibin Mei
Hang Wang
Bingbing Ni
154
0
0
16 May 2025
Structural-Temporal Coupling Anomaly Detection with Dynamic Graph Transformer
Chang Zong
Yueting Zhuang
Jian Shao
Weiming Lu
249
1
0
13 May 2025
Compositional Image-Text Matching and Retrieval by Grounding Entities
Madhukar Reddy Vongala
Saurabh Srivastava
Jana Kosecka
CLIP
CoGe
VLM
146
0
0
04 May 2025
Enhancing Surgical Documentation through Multimodal Visual-Temporal Transformers and Generative AI
Hugo Georgenthum
Cristian Cosentino
Fabrizio Marozzo
Pietro Liò
MedIm
739
1
0
28 Apr 2025
Symbolic Representation for Any-to-Any Generative Tasks
Computer Vision and Pattern Recognition (CVPR), 2025
Jianfei Chen
Xiaoye Zhu
Yanjie Wang
Tianyang Liu
Xinhui Chen
...
Yifei Ke
Qingbin Liu
Yiwen Yuan
Julian McAuley
Li Li
DiffM
138
0
0
24 Apr 2025
Towards Explainable AI: Multi-Modal Transformer for Video-based Image Description Generation
Lakshita Agarwal
Bindu Verma
ViT
120
0
0
23 Apr 2025
Zero-Shot, But at What Cost? Unveiling the Hidden Overhead of MILS's LLM-CLIP Framework for Image Captioning
Yassir Benhammou
Alessandro Tiberio
Gabriel Trautmann
Suman Kalyan
MLLM
VLM
121
0
0
21 Apr 2025
DeepMLF: Multimodal language model with learnable tokens for deep fusion in sentiment analysis
Efthymios Georgiou
Vassilis Katsouros
Yannis Avrithis
Alexandros Potamianos
175
1
0
15 Apr 2025
COUNTS: Benchmarking Object Detectors and Multimodal Large Language Models under Distribution Shifts
Computer Vision and Pattern Recognition (CVPR), 2025
Jiansheng Li
Xingxuan Zhang
Hao Zou
Yige Guo
Renzhe Xu
Yilong Liu
Chuzhao Zhu
Yue He
Peng Cui
VLM
165
0
0
14 Apr 2025
The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?
Hao Yin
Gunagzong Si
Zilei Wang
856
0
0
14 Apr 2025
How Can Objects Help Video-Language Understanding?
Zitian Tang
Shijie Wang
Junho Cho
Jaewook Yoo
Chen Sun
202
1
0
10 Apr 2025
Enabling Collaborative Parametric Knowledge Calibration for Retrieval-Augmented Vision Question Answering
Jiaqi Deng
Kaize Shi
Zonghan Wu
Huan Huo
Dingxian Wang
Guandong Xu
105
0
0
05 Apr 2025
COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking
Information Fusion (Inf. Fusion), 2025
Chunhui Zhang
Li Liu
Jialin Gao
Xin Sun
Hao Wen
Xi Zhou
Shiming Ge
Yucheng Wang
166
1
0
02 Apr 2025
Enhancing Image Resolution of Solar Magnetograms: A Latent Diffusion Model Approach
Francesco P. Ramunno
Paolo Massa
Vitaliy Kinakh
Brandon Panos
A. Csillaghy
Slava Voloshynovskiy
DiffM
183
0
0
31 Mar 2025
1
2
3
4
...
22
23
24
Next