Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
1504.00325
Cited By
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,387 papers shown
Title
ArcSin: Adaptive ranged cosine Similarity injected noise for Language-Driven Visual Tasks
Yang Liu
Xiaomin Yu
Gongyu Zhang
Christos Bergeles
Prokar Dasgupta
Alejandro Granados
Sebastien Ourselin
42
2
0
27 Feb 2024
Towards Open-ended Visual Quality Comparison
Haoning Wu
Hanwei Zhu
Zicheng Zhang
Erli Zhang
Chaofeng Chen
...
Qiong Yan
Xiaohong Liu
Guangtao Zhai
Shiqi Wang
Weisi Lin
AAML
59
49
0
26 Feb 2024
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
Santiago Castro
Amir Ziai
Avneesh Saluja
Zhuoning Yuan
Rada Mihalcea
MLLM
CoGe
VLM
34
5
0
22 Feb 2024
Vision-Language Navigation with Embodied Intelligence: A Survey
Peng Gao
Peng Wang
Feng Gao
Fei-Yue Wang
Ruyue Yuan
LM&Ro
37
2
0
22 Feb 2024
VL-Trojan: Multimodal Instruction Backdoor Attacks against Autoregressive Visual Language Models
Jiawei Liang
Siyuan Liang
Man Luo
Aishan Liu
Dongchen Han
Ee-Chien Chang
Xiaochun Cao
42
37
0
21 Feb 2024
CODIS: Benchmarking Context-Dependent Visual Comprehension for Multimodal Large Language Models
Fuwen Luo
Chi Chen
Zihao Wan
Zhaolu Kang
Qidong Yan
...
Xiaoyue Mi
Peng Li
Ning Ma
Maosong Sun
Yang Liu
40
5
0
21 Feb 2024
A Multimodal In-Context Tuning Approach for E-Commerce Product Description Generation
Yunxin Li
Baotian Hu
Wenhan Luo
Lin Ma
Yuxin Ding
Min-Ling Zhang
53
1
0
21 Feb 2024
CounterCurate: Enhancing Physical and Semantic Visio-Linguistic Compositional Reasoning via Counterfactual Examples
Jianrui Zhang
Mu Cai
Tengyang Xie
Yong Jae Lee
LRM
43
18
0
20 Feb 2024
ConVQG: Contrastive Visual Question Generation with Multimodal Guidance
Li Mi
Syrielle Montariol
J. Castillo-Navarro
Xianjie Dai
Antoine Bosselut
D. Tuia
30
4
0
20 Feb 2024
Language-guided Image Reflection Separation
Haofeng Zhong
Yuchen Hong
Shuchen Weng
Jinxiu Liang
Boxin Shi
26
7
0
19 Feb 2024
Interpretable Embedding for Ad-hoc Video Search
Jiaxin Wu
Chong-Wah Ngo
16
29
0
19 Feb 2024
Cobra Effect in Reference-Free Image Captioning Metrics
Zheng Ma
Changxin Wang
Yawen Ouyang
Fei Zhao
Jianbing Zhang
Shujian Huang
Jiajun Chen
30
2
0
18 Feb 2024
Assessing News Thumbnail Representativeness: Counterfactual text can enhance the cross-modal matching ability
Yejun Yoon
Seunghyun Yoon
Kunwoo Park
21
0
0
17 Feb 2024
PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter
Junfei Xiao
Zheng Xu
Alan L. Yuille
Shen Yan
Boyu Wang
33
3
0
16 Feb 2024
Recovering the Pre-Fine-Tuning Weights of Generative Models
Eliahu Horwitz
Jonathan Kahana
Yedid Hoshen
50
9
0
15 Feb 2024
Social Reward: Evaluating and Enhancing Generative AI through Million-User Feedback from an Online Creative Community
Arman Isajanyan
Artur Shatveryan
David Kocharyan
Zhangyang Wang
Humphrey Shi
EGVM
68
5
0
15 Feb 2024
DoRA: Weight-Decomposed Low-Rank Adaptation
Shih-yang Liu
Chien-Yi Wang
Hongxu Yin
Pavlo Molchanov
Yu-Chiang Frank Wang
Kwang-Ting Cheng
Min-Hung Chen
27
340
0
14 Feb 2024
OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM
Yutao Hu
Tian-Xin Li
Quanfeng Lu
Wenqi Shao
Junjun He
Yu Qiao
Ping Luo
ELM
LM&MA
32
51
0
14 Feb 2024
Visually Dehallucinative Instruction Generation
Sungguk Cha
Jusung Lee
Younghyun Lee
Cheoljong Yang
MLLM
22
5
0
13 Feb 2024
A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs
Zicheng Zhang
Haoning Wu
Erli Zhang
Guangtao Zhai
Weisi Lin
VLM
24
8
0
11 Feb 2024
Cacophony: An Improved Contrastive Audio-Text Model
Ge Zhu
Jordan Darefsky
Zhiyao Duan
AuLLM
43
11
0
10 Feb 2024
GPTs Are Multilingual Annotators for Sequence Generation Tasks
Juhwan Choi
Eunju Lee
Kyohoon Jin
Youngbin Kim
25
10
0
08 Feb 2024
Question Aware Vision Transformer for Multimodal Reasoning
Roy Ganz
Yair Kittenplon
Aviad Aberdam
Elad Ben Avraham
Oren Nuriel
Shai Mazor
Ron Litman
42
20
0
08 Feb 2024
Get What You Want, Not What You Don't: Image Content Suppression for Text-to-Image Diffusion Models
Senmao Li
J. Weijer
Taihang Hu
Fahad Shahbaz Khan
Qibin Hou
Yaxing Wang
Jian Yang
DiffM
45
27
0
08 Feb 2024
Variance Alignment Score: A Simple But Tough-to-Beat Data Selection Method for Multimodal Contrastive Learning
Yiping Wang
Yifang Chen
Wendan Yan
Kevin G. Jamieson
S. Du
28
5
0
03 Feb 2024
Can MLLMs Perform Text-to-Image In-Context Learning?
Yuchen Zeng
Wonjun Kang
Yicong Chen
Hyung Il Koo
Kangwook Lee
MLLM
33
9
0
02 Feb 2024
SCO-VIST: Social Interaction Commonsense Knowledge-based Visual Storytelling
Eileen Wang
S. Han
Josiah Poon
11
5
0
01 Feb 2024
EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning
Jaeyeon Kim
Jaeyoon Jung
Jinjoo Lee
Sang Hoon Woo
CLIP
VLM
23
21
0
31 Jan 2024
EarthGPT: A Universal Multi-modal Large Language Model for Multi-sensor Image Comprehension in Remote Sensing Domain
Wei Zhang
Miaoxin Cai
Tong Zhang
Zhuang Yin
Xuerui Mao
24
88
0
30 Jan 2024
Towards Unified Interactive Visual Grounding in The Wild
Jie Xu
Hanbo Zhang
Qingyi Si
Yifeng Li
Xuguang Lan
Tao Kong
LM&Ro
30
5
0
30 Jan 2024
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
Qingpei Guo
Furong Xu
Hanxiao Zhang
Wang Ren
Ziping Ma
Lin Ju
Jian Wang
Jingdong Chen
Ming Yang
VLM
MLLM
27
2
0
29 Jan 2024
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
Yue Fan
Jing Gu
KAI-QING Zhou
Qianqi Yan
Shan Jiang
Ching-Chen Kuo
Xinze Guan
Xin Eric Wang
29
7
0
29 Jan 2024
MM-LLMs: Recent Advances in MultiModal Large Language Models
Duzhen Zhang
Yahan Yu
Jiahua Dong
Chenxing Li
Dan Su
Chenhui Chu
Dong Yu
OffRL
LRM
52
179
0
24 Jan 2024
Common-Sense Bias Modeling for Classification Tasks
Miao Zhang
Zee fryer
Ben Colman
Ali Shahriyari
Gaurav Bharaj
30
0
0
24 Jan 2024
Enhancing Object Detection Performance for Small Objects through Synthetic Data Generation and Proportional Class-Balancing Technique: A Comparative Study in Industrial Scenarios
Jibinraj Antony
Vinit Hegiste
Ali Nazeri
Hooman Tavakoli
Snehal Walunj
Christiane Plociennik
Martin Ruskowski
31
2
0
23 Jan 2024
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities
Boyuan Chen
Zhuo Xu
Sean Kirmani
Brian Ichter
Danny Driess
Pete Florence
Dorsa Sadigh
Leonidas J. Guibas
Fei Xia
LRM
ReLM
49
206
0
22 Jan 2024
Text-to-Image Cross-Modal Generation: A Systematic Review
Maciej Żelaszczyk
Jacek Mañdziuk
35
3
0
21 Jan 2024
LLMRA: Multi-modal Large Language Model based Restoration Assistant
Xiaoyu Jin
Yuan Shi
Bin Xia
Wenming Yang
36
4
0
21 Jan 2024
CBVS: A Large-Scale Chinese Image-Text Benchmark for Real-World Short Video Search Scenarios
Xiangshuo Qiao
Xianxin Li
Xiaozhe Qu
Jie M. Zhang
Yang Liu
Yu Luo
Cihang Jin
Jin Ma
VLM
33
0
0
19 Jan 2024
Supervised Fine-tuning in turn Improves Visual Foundation Models
Xiaohu Jiang
Yixiao Ge
Yuying Ge
Dachuan Shi
Chun Yuan
Ying Shan
VLM
CLIP
46
8
0
18 Jan 2024
MM-Interleaved: Interleaved Image-Text Generative Modeling via Multi-modal Feature Synchronizer
Changyao Tian
Xizhou Zhu
Yuwen Xiong
Weiyun Wang
Zhe Chen
...
Tong Lu
Jie Zhou
Hongsheng Li
Yu Qiao
Jifeng Dai
AuLLM
85
42
0
18 Jan 2024
Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer
Junhao Zheng
Qianli Ma
Zhen Liu
Binquan Wu
Hu Feng
CLL
26
14
0
17 Jan 2024
COCO is "ALL'' You Need for Visual Instruction Fine-tuning
Xiaotian Han
Yiqi Wang
Bohan Zhai
Quanzeng You
Hongxia Yang
VLM
MLLM
33
2
0
17 Jan 2024
Let's Go Shopping (LGS) -- Web-Scale Image-Text Dataset for Visual Concept Understanding
Yatong Bai
Utsav Garg
Apaar Shanker
Haoming Zhang
Samyak Parajuli
...
Eugenia D Fomitcheva
E. Branson
Aerin Kim
Somayeh Sojoudi
Kyunghyun Cho
16
2
0
09 Jan 2024
CaMML: Context-Aware Multimodal Learner for Large Models
Yixin Chen
Shuai Zhang
Boran Han
Tong He
Bo Li
VLM
32
4
0
06 Jan 2024
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
Xin He
Longhui Wei
Lingxi Xie
Qi Tian
43
8
0
06 Jan 2024
4DGen: Grounded 4D Content Generation with Spatial-temporal Consistency
Yuyang Yin
Dejia Xu
Zhangyang Wang
Yao-Min Zhao
Yunchao Wei
3DGS
47
72
0
28 Dec 2023
Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action
Jiasen Lu
Christopher Clark
Sangho Lee
Zichen Zhang
Savya Khosla
Ryan Marten
Derek Hoiem
Aniruddha Kembhavi
VLM
MLLM
37
144
0
28 Dec 2023
Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey
Jiaxing Huang
Jingyi Zhang
Kai Jiang
Han Qiu
Shijian Lu
41
22
0
27 Dec 2023
Cloud-Device Collaborative Learning for Multimodal Large Language Models
Guanqun Wang
Jiaming Liu
Chenxuan Li
Junpeng Ma
Yuan Zhang
...
Kevin Zhang
Maurice Chong
Ray Zhang
Yijiang Liu
Shanghang Zhang
41
7
0
26 Dec 2023
Previous
1
2
3
...
6
7
8
...
26
27
28
Next