Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
2004.06165
Cited By
v1
v2
v3
v4
v5 (latest)
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
European Conference on Computer Vision (ECCV), 2020
13 April 2020
Xiujun Li
Xi Yin
Chunyuan Li
Pengchuan Zhang
Xiaowei Hu
Lei Zhang
Lijuan Wang
Houdong Hu
Li Dong
Furu Wei
Yejin Choi
Jianfeng Gao
VLM
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks"
50 / 1,171 papers shown
Title
MeaCap: Memory-Augmented Zero-shot Image Captioning
Zequn Zeng
Yan Xie
Hao Zhang
Chiyu Chen
Zhengjue Wang
Boli Chen
VLM
194
33
0
06 Mar 2024
Enhancing Vision-Language Pre-training with Rich Supervisions
Yuan Gao
Kunyu Shi
Pengkai Zhu
Edouard Belval
Oren Nuriel
Srikar Appalaraju
Shabnam Ghadar
Vijay Mahadevan
Zhuowen Tu
Stefano Soatto
VLM
CLIP
288
15
0
05 Mar 2024
Non-autoregressive Sequence-to-Sequence Vision-Language Models
Kunyu Shi
Qi Dong
Luis Goncalves
Zhuowen Tu
Stefano Soatto
VLM
245
4
0
04 Mar 2024
Semantics-enhanced Cross-modal Masked Image Modeling for Vision-Language Pre-training
Haowei Liu
Yaya Shi
Haiyang Xu
Chunfen Yuan
Qinghao Ye
...
Mingshi Yan
Ji Zhang
Fei Huang
Bing Li
Weiming Hu
VLM
133
0
0
01 Mar 2024
VIXEN: Visual Text Comparison Network for Image Difference Captioning
Alexander Black
Jing Shi
Yifei Fai
Tu Bui
John Collomosse
173
8
0
29 Feb 2024
Automatic Creative Selection with Cross-Modal Matching
Alex Kim
Jia Huang
Rob Monarch
Jerry Kwac
Anikesh Kamath
P. Khurd
Kailash Thiyagarajan
Goodman Gu
VLM
111
0
0
28 Feb 2024
Demonstrating and Reducing Shortcuts in Vision-Language Representation Learning
Maurits J. R. Bleeker
Mariya Hendriksen
Andrew Yates
Maarten de Rijke
VLM
195
9
0
27 Feb 2024
VCD: A Dataset for Visual Commonsense Discovery in Images
Xiangqing Shen
Yurun Song
Siwei Wu
Rui Xia
190
7
0
27 Feb 2024
Bridging the Gap between 2D and 3D Visual Question Answering: A Fusion Approach for 3D VQA
Wentao Mo
Yang Liu
111
23
0
24 Feb 2024
NaVid: Video-based VLM Plans the Next Step for Vision-and-Language Navigation
JIazhao Zhang
Kunyu Wang
Rongtao Xu
Gengze Zhou
Yicong Hong
Xiaomeng Fang
Qi Wu
Dongbin Zhao
Wang He
LM&Ro
415
124
0
24 Feb 2024
SIMPLOT: Enhancing Chart Question Answering by Distilling Essentials
Wonjoong Kim
S. Park
Yeonjun In
Seokwon Han
Chanyoung Park
LRM
ReLM
174
3
0
22 Feb 2024
Distinctive Image Captioning: Leveraging Ground Truth Captions in CLIP Guided Reinforcement Learning
Antoine Chaffin
Ewa Kijak
Vincent Claveau
148
2
0
21 Feb 2024
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation
Yunxin Li
Baotian Hu
Tong Lu
Lin Ma
Yuxin Ding
Min Zhang
182
3
0
21 Feb 2024
WinoViz: Probing Visual Properties of Objects Under Different States
Woojeong Jin
Tejas Srinivasan
Jesse Thomason
Xiang Ren
153
1
0
21 Feb 2024
ProtChatGPT: Towards Understanding Proteins with Large Language Models
Chao Wang
Hehe Fan
Ruijie Quan
Yi Yang
165
20
0
15 Feb 2024
Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search
Yifei Yuan
Clemencia Siro
Mohammad Aliannejadi
Maarten de Rijke
Wai Lam
82
13
0
12 Feb 2024
GeReA: Question-Aware Prompt Captions for Knowledge-based Visual Question Answering
Ziyu Ma
Shutao Li
Bin Sun
Jianfei Cai
Zuxiang Long
Fuyan Ma
160
7
0
04 Feb 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
292
303
0
24 Jan 2024
ModaVerse: Efficiently Transforming Modalities with LLMs
Computer Vision and Pattern Recognition (CVPR), 2024
Xinyu Wang
Bohan Zhuang
Qi Wu
106
17
0
12 Jan 2024
Efficient Vision-and-Language Pre-training with Text-Relevant Image Patch Selection
Wei Ye
Chaoya Jiang
Haiyang Xu
Chenhao Ye
Chenliang Li
Mingshi Yan
Shikun Zhang
Songhang Huang
Fei Huang
VLM
118
1
0
11 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
92
2
0
09 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
220
11
0
06 Jan 2024
Detours for Navigating Instructional Videos
Computer Vision and Pattern Recognition (CVPR), 2024
Kumar Ashutosh
Zihui Xue
Tushar Nagarajan
Kristen Grauman
318
6
0
03 Jan 2024
Social Media Ready Caption Generation for Brands
Himanshu Maheshwari
Koustava Goswami
Apoorv Saxena
Balaji Vasan Srinivasan
115
1
0
03 Jan 2024
COSMO: COntrastive Streamlined MultimOdal Model with Interleaved Pre-Training
Alex Jinpeng Wang
Linjie Li
Kevin Qinghong Lin
Jianfeng Wang
Kevin Lin
Zhengyuan Yang
Lijuan Wang
Mike Zheng Shou
VLM
VGen
146
15
0
01 Jan 2024
3VL: Using Trees to Improve Vision-Language Models' Interpretability
IEEE Transactions on Image Processing (IEEE TIP), 2023
Nir Yellinek
Leonid Karlinsky
Raja Giryes
CoGe
VLM
436
7
0
28 Dec 2023
DocMSU: A Comprehensive Benchmark for Document-level Multimodal Sarcasm Understanding
Hang Du
Gu Nan
Sicheng Zhang
Binzhu Xie
Junrui Xu
Hehe Fan
Qimei Cui
Xiaofeng Tao
Xudong Jiang
90
8
0
26 Dec 2023
Cycle-Consistency Learning for Captioning and Grounding
Ning Wang
Jiajun Deng
Mingbo Jia
ObjD
135
12
0
23 Dec 2023
Unveiling Backbone Effects in CLIP: Exploring Representational Synergies and Variances
Cristian Rodriguez-Opazo
Edison Marrese-Taylor
Ehsan Abbasnejad
Hamed Damirchi
Ignacio M. Jara
Felipe Bravo-Marquez
Anton Van Den Hengel
VLM
117
1
0
22 Dec 2023
Towards More Faithful Natural Language Explanation Using Multi-Level Contrastive Learning in VQA
Chengen Lai
Shengli Song
Shiqi Meng
Jingyang Li
Sitong Yan
Guangneng Hu
164
8
0
21 Dec 2023
ECAMP: Entity-centered Context-aware Medical Vision Language Pre-training
Rongsheng Wang
Qingsong Yao
Zihang Jiang
Zhiyang He
Xiaodong Tao
Zihang Jiang
S.Kevin Zhou
MedIm
VLM
197
0
0
20 Dec 2023
Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pretraining
Bumsoo Kim
Yeonsik Jo
Jinhyung Kim
S. Kim
VLM
165
10
0
19 Dec 2023
Expediting Contrastive Language-Image Pretraining via Self-distilled Encoders
Bumsoo Kim
Jinhyung Kim
Yeonsik Jo
S. Kim
VLM
200
5
0
19 Dec 2023
Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model
Shraman Pramanick
Guangxing Han
Rui Hou
Sayan Nag
Ser-Nam Lim
Nicolas Ballas
Qifan Wang
Rama Chellappa
Amjad Almahairi
VLM
MLLM
258
46
0
19 Dec 2023
Context Disentangling and Prototype Inheriting for Robust Visual Grounding
Wei Tang
Liang Li
Xuejing Liu
Lu Jin
Jinhui Tang
Zechao Li
178
38
0
19 Dec 2023
Pedestrian Attribute Recognition via CLIP based Prompt Vision-Language Fusion
Tianlin Li
Jiandong Jin
Chenglong Li
Jin Tang
Cheng Zhang
Wei Wang
VLM
141
30
0
17 Dec 2023
p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models
Haoyuan Wu
Xinyun Zhang
Peng Xu
Peiyu Liao
Xufeng Yao
Bei Yu
VLM
93
0
0
17 Dec 2023
Data-Efficient Multimodal Fusion on a Single GPU
Computer Vision and Pattern Recognition (CVPR), 2023
Noël Vouitsis
Zhaoyan Liu
S. Gorti
Valentin Villecroze
Jesse C. Cresswell
Guangwei Yu
Gabriel Loaiza-Ganem
Anthony L. Caterini
209
7
0
15 Dec 2023
Improving Cross-modal Alignment with Synthetic Pairs for Text-only Image Captioning
AAAI Conference on Artificial Intelligence (AAAI), 2023
Zhiyue Liu
Jinyuan Liu
Fanrong Ma
CLIP
VLM
148
17
0
14 Dec 2023
TiMix: Text-aware Image Mixing for Effective Vision-Language Pre-training
AAAI Conference on Artificial Intelligence (AAAI), 2023
Chaoya Jiang
Wei Ye
Haiyang Xu
Qinghao Ye
Mingshi Yan
Ji Zhang
Shikun Zhang
CLIP
VLM
173
5
0
14 Dec 2023
A Survey of Generative AI for Intelligent Transportation Systems
Huan Yan
Yong Li
130
14
0
13 Dec 2023
Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation
Annual Meeting of the Association for Computational Linguistics (ACL), 2023
Wenting Chen
Linlin Shen
Jingyang Lin
Jiebo Luo
Xiang Li
Yixuan Yuan
MedIm
170
25
0
13 Dec 2023
EZ-CLIP: Efficient Zeroshot Video Action Recognition
Shahzad Ahmad
S. Chanda
Yogesh S Rawat
VLM
199
11
0
13 Dec 2023
ToViLaG: Your Visual-Language Generative Model is Also An Evildoer
Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023
Xinpeng Wang
Xiaoyuan Yi
Han Jiang
Shanlin Zhou
Zhihua Wei
Xing Xie
160
22
0
13 Dec 2023
A Foundational Multimodal Vision Language AI Assistant for Human Pathology
Ming Y. Lu
Bowen Chen
Drew F. K. Williamson
Richard J. Chen
Kenji Ikamura
...
Ivy Liang
L. Le
Tong Ding
Anil V. Parwani
Faisal Mahmood
MedIm
LM&MA
126
26
0
13 Dec 2023
Interfacing Foundation Models' Embeddings
Neural Information Processing Systems (NeurIPS), 2023
Xueyan Zou
Linjie Li
Jianfeng Wang
Jianwei Yang
Mingyu Ding
...
Hao Zhang
Shilong Liu
Arul Aravinthan
Yong Jae Lee
Lijuan Wang
34
3
0
12 Dec 2023
RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning
Jiashuo Fan
Yaoyuan Liang
Leyao Liu
Shao-Lun Huang
Lei Zhang
215
5
0
11 Dec 2023
MAFA: Managing False Negatives for Vision-Language Pre-training
Jaeseok Byun
Dohoon Kim
Taesup Moon
VLM
265
9
0
11 Dec 2023
Open-Vocabulary Segmentation with Semantic-Assisted Calibration
Yong Liu
Sule Bai
Guanbin Li
Yitong Wang
Yansong Tang
VLM
144
39
0
07 Dec 2023
Deep Multimodal Fusion for Surgical Feedback Classification
Rafal Kocielnik
Elyssa Y. Wong
Timothy N. Chu
Lydia Lin
De-An Huang
Jiayun Wang
A. Anandkumar
Andrew J. Hung
93
3
0
06 Dec 2023
Previous
1
2
3
4
5
6
...
22
23
24
Next