Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2004.06165
Cited By
v1
v2
v3
v4
v5 (latest)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
European Conference on Computer Vision (ECCV), 2020
13 April 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"
50 / 1,171 papers shown
Title
MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning
Yiwei Ma
Guohai Xu
Xiaoshuai Sun
Jiayi Ji
Jie Lou
Debing Zhang
Rongrong Ji
371
6
0
26 Mar 2025
Unseen from Seen: Rewriting Observation-Instruction Using Foundation Models for Augmenting Vision-Language Navigation
Ziming Wei
Bingqian Lin
Yunshuang Nie
Jiaqi Chen
Shikui Ma
Hang Xu
Xiaodan Liang
298
2
0
23 Mar 2025
HA-VLN 2.0: An Open Benchmark and Leaderboard for Human-Aware Navigation in Discrete and Continuous Environments with Dynamic Multi-Human Interactions
Yifei Dong
Fengyi Wu
Qi He
Heng Li
Heng Li
...
Yuxuan Zhou
Yuxuan Zhou
Jingdong Sun
Zhi-Qi Cheng
Alexander G. Hauptmann
LM&Ro
186
1
0
18 Mar 2025
Stitch-a-Recipe: Video Demonstration from Multistep Descriptions
Chi Hsuan Wu
Kumar Ashutosh
Kristen Grauman
DiffM
194
0
0
18 Mar 2025
Lifting the Veil on Visual Information Flow in MLLMs: Unlocking Pathways to Faster Inference
Computer Vision and Pattern Recognition (CVPR), 2025
Hao Yin
Guangzong Si
Zilei Wang
181
4
0
17 Mar 2025
SuperCap: Multi-resolution Superpixel-based Image Captioning
Henry Senior
Luca Rossi
Gregory Slabaugh
Shanxin Yuan
VLM
207
0
0
11 Mar 2025
A Woman with a Knife or A Knife with a Woman? Measuring Directional Bias Amplification in Image Captions
Rahul Nair
Bhanu Tokas
Neel Shah
215
0
0
10 Mar 2025
TPC: Cross-Temporal Prediction Connection for Vision-Language Model Hallucination Reduction
Chao Wang
Weiwei Fu
Yang Zhou
MLLM
VLM
249
2
0
06 Mar 2025
CL-MoE: Enhancing Multimodal Large Language Model with Dual Momentum Mixture-of-Experts for Continual Visual Question Answering
Computer Vision and Pattern Recognition (CVPR), 2025
Tianyu Huai
Jie Zhou
Xingjiao Wu
Qin Chen
Qingchun Bai
Ze Zhou
Liang He
MoE
185
7
0
01 Mar 2025
Data Distributional Properties As Inductive Bias for Systematic Generalization
Computer Vision and Pattern Recognition (CVPR), 2025
Felipe del-Rio
Alain Raymond-Sáez
Daniel Florea
Rodrigo Toro Icarte
Julio Hurtado
Cristian B. Calderon
Á. Soto
AI4CE
272
1
0
27 Feb 2025
MICINet: Multi-Level Inter-Class Confusing Information Removal for Reliable Multimodal Classification
Tianze Zhang
Shu Shen
Chao Chen
281
0
0
27 Feb 2025
Grad-ECLIP: Gradient-based Visual and Textual Explanations for CLIP
Chenyang Zhao
Kun Wang
J. H. Hsiao
Antoni B. Chan
CLIP
203
6
0
26 Feb 2025
Can Hallucination Correction Improve Video-Language Alignment?
Annual Meeting of the Association for Computational Linguistics (ACL), 2025
Lingjun Zhao
Mingyang Xie
Paola Cascante-Bonilla
Hal Daumé III
Kwonjoon Lee
HILM
VLM
239
1
0
20 Feb 2025
Color Universal Design Neural Network for the Color Vision Deficiencies
Sunyong Seo
Jinho Park
272
2
0
12 Feb 2025
Composite Sketch+Text Queries for Retrieving Objects with Elusive Names and Complex Interactions
AAAI Conference on Artificial Intelligence (AAAI), 2024
Prajwal Gatti
Kshitij Parikh
Dhriti Prasanna Paul
Manish Gupta
Anand Mishra
396
3
0
12 Feb 2025
Vision-Language Models for Edge Networks: A Comprehensive Survey
IEEE Internet of Things Journal (IEEE IoT J.), 2025
Ahmed Sharshar
Latif U. Khan
Waseem Ullah
Mohsen Guizani
VLM
273
10
0
11 Feb 2025
Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search
The Web Conference (WWW), 2025
J. He
Shengeng Tang
Ao Liu
Lechao Cheng
Jingjing Wu
Yanyan Wei
159
2
0
05 Feb 2025
Efficient Redundancy Reduction for Open-Vocabulary Semantic Segmentation
Lin Chen
Qi Yang
Kun Ding
Tianying Wang
Gang Shen
Fei Li
Qiyuan Cao
Shiming Xiang
VLM
127
1
0
29 Jan 2025
The Quest for Visual Understanding: A Journey Through the Evolution of Visual Question Answering
Anupam Pandey
Deepjyoti Bodo
Arpan Phukan
Asif Ekbal
271
2
0
13 Jan 2025
Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation
Xinkai Du
Quanjie Han
Chao Lv
Yi Liu
Yalin Sun
Hao Shu
Hongbo Shan
Maosong Sun
RALM
269
0
0
25 Dec 2024
Reframing Image Difference Captioning with BLIP2IDC and Synthetic Augmentation
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Gautier Evennou
Antoine Chaffin
Vivien Chappelier
Ewa Kijak
DiffM
200
1
0
20 Dec 2024
G-VEval: A Versatile Metric for Evaluating Image and Video Captions Using GPT-4o
AAAI Conference on Artificial Intelligence (AAAI), 2024
Tony Cheng Tong
Sirui He
Z. Shao
Dit-Yan Yeung
192
12
0
18 Dec 2024
Bringing Multimodality to Amazon Visual Search System
Knowledge Discovery and Data Mining (KDD), 2024
Xinliang Zhu
Michael Huang
Han Ding
Jinyu Yang
Kelvin Chen
...
Son Dinh Tran
Benjamin Z. Yao
Doug Gray
Anuj Bindal
Arnab Dhua
191
7
0
17 Dec 2024
Overview of TREC 2024 Medical Video Question Answering (MedVidQA) Track
D. Gupta
Dina Demner-Fushman
LM&MA
155
1
0
15 Dec 2024
UCDR-Adapter: Exploring Adaptation of Pre-Trained Vision-Language Models for Universal Cross-Domain Retrieval
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Haoyu Jiang
Zhi-Qi Cheng
Gabriel Moreira
Jiawen Zhu
Yuxuan Zhou
Bukun Ren
Jun-Yan He
Jingdong Sun
Xian-Sheng Hua
VLM
207
0
0
14 Dec 2024
Advances in Transformers for Robotic Applications: A Review
Nikunj Sanghai
Nik Bear Brown
AI4CE
247
4
0
13 Dec 2024
Composed Image Retrieval for Training-Free Domain Conversion
IEEE Workshop/Winter Conference on Applications of Computer Vision (WACV), 2024
Nikos Efthymiadis
Bill Psomas
Zakaria Laskar
Konstantinos Karantzalos
Yannis Avrithis
Ondřej Chum
Giorgos Tolias
236
2
0
04 Dec 2024
Learning to Reason Iteratively and Parallelly for Complex Visual Reasoning Scenarios
Neural Information Processing Systems (NeurIPS), 2024
Shantanu Jaiswal
Debaditya Roy
Basura Fernando
Cheston Tan
ReLM
LRM
247
4
0
20 Nov 2024
A Comprehensive Survey on Visual Question Answering Datasets and Algorithms
Raihan Kabir
Naznin Haque
Md. Saiful Islam
Marium-E. Jannat
CoGe
173
5
0
17 Nov 2024
CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation
Dengke Zhang
Fagui Liu
Quan Tang
VLM
407
2
0
15 Nov 2024
Spider: Any-to-Many Multimodal LLM
Jinxiang Lai
Jie Zhang
Jun Liu
Jian Li
Xiaocheng Lu
Song Guo
MLLM
392
4
0
14 Nov 2024
ViTOC: Vision Transformer and Object-aware Captioner
Feiyang Huang
251
2
0
09 Nov 2024
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
Neural Information Processing Systems (NeurIPS), 2024
Jaeyoo Park
Jin Young Choi
Jeonghyung Park
Bohyung Han
VLM
47
7
0
08 Nov 2024
Semantic-Aligned Adversarial Evolution Triangle for High-Transferability Vision-Language Attack
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2024
Yang Liu
Sensen Gao
Qing Guo
Ke Ma
Yihao Huang
Simeng Qin
Yang Liu
Ivor Tsang Fellow
Xiaochun Cao
AAML
143
5
0
04 Nov 2024
Aggregate-and-Adapt Natural Language Prompts for Downstream Generalization of CLIP
Neural Information Processing Systems (NeurIPS), 2024
Chen Huang
Skyler Seto
Samira Abnar
David Grangier
Navdeep Jaitly
J. Susskind
VLM
159
4
0
31 Oct 2024
Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Donghoon Kim
Gusang Lee
Kyuhong Shim
B. Shim
205
5
0
29 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
International Journal of Computer Vision (IJCV), 2024
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
181
2
0
23 Oct 2024
EVC-MF: End-to-end Video Captioning Network with Multi-scale Features
Tian-Zi Niu
Zhen-Duo Chen
Xin Luo
Xin-Shun Xu
108
0
0
22 Oct 2024
RA-BLIP: Multimodal Adaptive Retrieval-Augmented Bootstrapping Language-Image Pre-training
IEEE transactions on multimedia (IEEE TMM), 2024
Muhe Ding
Yang Ma
Pengda Qin
Yue Yu
Yuhong Li
Liqiang Nie
140
2
0
18 Oct 2024
Fine-Grained Verifiers: Preference Modeling as Next-token Prediction in Vision-Language Alignment
International Conference on Learning Representations (ICLR), 2024
Chenhang Cui
An Zhang
Yiyang Zhou
Zhaorun Chen
Gelei Deng
Huaxiu Yao
Tat-Seng Chua
420
12
0
18 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
ACM Multimedia (ACM MM), 2022
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
265
9
0
16 Oct 2024
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
Computer Vision and Pattern Recognition (CVPR), 2024
Jian Yang
Dacheng Yin
Yizhou Zhou
Fengyun Rao
Wei-dong Zhai
Yang Cao
Zheng-jun Zha
DiffM
225
11
0
14 Oct 2024
Prompting Video-Language Foundation Models with Domain-specific Fine-grained Heuristics for Video Question Answering
Ting Yu
Kunhao Fu
Shuhui Wang
Qingming Huang
Jun Yu
207
6
0
12 Oct 2024
Positive-Augmented Contrastive Learning for Vision-and-Language Evaluation and Training
International Journal of Computer Vision (IJCV), 2024
Sara Sarto
Nicholas Moratelli
Marcella Cornia
Lorenzo Baraldi
Rita Cucchiara
175
8
0
09 Oct 2024
Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity
Hanqi Jiang
Xixuan Hao
Yuzhou Huang
Chong Ma
Jiaxun Zhang
Yi Pan
Ruimao Zhang
MedIm
265
1
0
01 Oct 2024
TROPE: TRaining-Free Object-Part Enhancement for Seamlessly Improving Fine-Grained Zero-Shot Image Captioning
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
Joshua Forster Feinglass
Yezhou Yang
131
0
0
30 Sep 2024
Multi-modal Generative AI: Multi-modal LLMs, Diffusions and the Unification
X. Wang
Yuwei Zhou
Bin Huang
Hong Chen
Wenwu Zhu
DiffM
285
1
0
23 Sep 2024
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
Lin Li
Guikun Chen
Hanrong Shi
Jun Xiao
Long Chen
243
22
0
21 Sep 2024
KALE: An Artwork Image Captioning System Augmented with Heterogeneous Graph
International Joint Conference on Artificial Intelligence (IJCAI), 2024
Yanbei Jiang
Krista A. Ehinger
Jey Han Lau
SLR
185
6
0
17 Sep 2024
QTG-VQA: Question-Type-Guided Architectural for VideoQA Systems
Zhixian He
Pengcheng Zhao
Fuwei Zhang
Shujin Lin
209
0
0
14 Sep 2024
Previous
1
2
3
4
5
...
22
23
24
Next