Communities
Connect sessions
AI calendar
Organizations
Join Slack
Contact Sales
Search
Open menu
Home
Papers
All Papers
0 / 0 papers shown
Home
Papers
1504.00325
Cited By
v1
v2 (latest)
Microsoft COCO Captions: Data Collection and Evaluation Server
1 April 2015
Xinlei Chen
Hao Fang
Nayeon Lee
Ramakrishna Vedantam
Saurabh Gupta
Piotr Dollar
C. L. Zitnick
Re-assign community
ArXiv (abs)
PDF
HTML
Papers citing
"Microsoft COCO Captions: Data Collection and Evaluation Server"
50 / 1,519 papers shown
jina-clip-v2: Multilingual Multimodal Embeddings for Text and Images
Andreas Koukounas
Georgios Mastrapas
Bo Wang
Mohammad Kalim Akram
Sedigheh Eslami
Michael Gunther
Isabelle Mohr
Saba Sturua
Scott Martens
Nan Wang
VLM
794
20
0
11 Dec 2024
FireFlow: Fast Inversion of Rectified Flow for Image Semantic Editing
Yingying Deng
Xiangyu He
Changwang Mei
Peisong Wang
Fan Tang
289
35
0
10 Dec 2024
Visual Lexicon: Rich Image Features in Language Space
Computer Vision and Pattern Recognition (CVPR), 2024
Xudong Wang
Xingyi Zhou
Alireza Fathi
Trevor Darrell
Cordelia Schmid
VLM
208
7
0
09 Dec 2024
JAPAGEN: Efficient Few/Zero-shot Learning via Japanese Training Dataset Generation with LLM
Pacific Asia Conference on Language, Information and Computation (PACLIC), 2024
Takuro Fujii
Satoru Katsumata
202
0
0
09 Dec 2024
Learning to Correction: Explainable Feedback Generation for Visual Commonsense Reasoning Distractor
ACM Multimedia (MM), 2024
Jiali Chen
Xusen Hei
Yuqi Xue
Yuancheng Wei
Jiayuan Xie
Yi Cai
Qing Li
MLLM
LRM
315
11
0
08 Dec 2024
EgoPlan-Bench2: A Benchmark for Multimodal Large Language Model Planning in Real-World Scenarios
Lu Qiu
Yuying Ge
Yi Chen
Yixiao Ge
Mingyu Ding
Xihui Liu
LLMAG
LRM
396
18
0
05 Dec 2024
Partially Conditioned Patch Parallelism for Accelerated Diffusion Model Inference
XiuYu Zhang
Zening Luo
Michelle E. Lu
DiffM
179
3
0
04 Dec 2024
AdvDreamer Unveils: Are Vision-Language Models Truly Ready for Real-World 3D Variations?
Shouwei Ruan
Hanqin Liu
Yao Huang
Xiaoqi Wang
Caixin Kang
Hang Su
Yinpeng Dong
Xingxing Wei
VGen
650
1
0
04 Dec 2024
ScImage: How Good Are Multimodal Large Language Models at Scientific Text-to-Image Generation?
Leixin Zhang
Steffen Eger
Yinjie Cheng
Weihe Zhai
Jonas Belouadi
Christoph Leiter
Simone Paolo Ponzetto
Fahimeh Moafian
Zhixue Zhao
MLLM
370
4
0
03 Dec 2024
Progress-Aware Video Frame Captioning
Computer Vision and Pattern Recognition (CVPR), 2024
Zihui Xue
Joungbin An
Xitong Yang
Kristen Grauman
600
6
0
03 Dec 2024
Align-KD: Distilling Cross-Modal Alignment Knowledge for Mobile Vision-Language Model
Qianhan Feng
Wenshuo Li
Tong Lin
Xinghao Chen
VLM
302
7
0
02 Dec 2024
Perception of Visual Content: Differences Between Humans and Foundation Models
International Conference on Web and Social Media (ICWSM), 2024
Nardiena A. Pratama
Shaoyang Fan
Gianluca Demartini
VLM
431
0
0
28 Nov 2024
VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis
Donggoo Kang
Dasol Jeong
Hyunmin Lee
Sangwoo Park
Hasil Park
Sunkyu Kwon
Yeongjoon Kim
Joonki Paik
MLLM
VLM
336
1
0
27 Nov 2024
Chain of Attack: On the Robustness of Vision-Language Models Against Transfer-Based Adversarial Attacks
Computer Vision and Pattern Recognition (CVPR), 2024
Peng Xie
Yequan Bie
Jianda Mao
Yangqiu Song
Yang Wang
Hao Chen
Kani Chen
AAML
348
7
0
24 Nov 2024
Lifelong Knowledge Editing for Vision Language Models with Low-Rank Mixture-of-Experts
Computer Vision and Pattern Recognition (CVPR), 2024
Qizhou Chen
Chengyu Wang
Dakan Wang
Taolin Zhang
Wangyue Li
Xiaofeng He
KELM
369
5
0
23 Nov 2024
Neuro-Symbolic Evaluation of Text-to-Video Models using Formal Verification
Computer Vision and Pattern Recognition (CVPR), 2024
S P Sharan
Minkyu Choi
Sahil Shah
Harsh Goel
Mohammad Omama
Sandeep Chinchali
EGVM
635
5
0
22 Nov 2024
PSA-VLM: Enhancing Vision-Language Model Safety through Progressive Concept-Bottleneck-Driven Alignment
Zhendong Liu
Yuanbi Nie
Yingshui Tan
Xiangyu Yue
Qiushi Cui
Chongjun Wang
Xiaoyong Zhu
Jian Xu
Bo Zheng
526
0
0
18 Nov 2024
SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization
Computer Vision and Pattern Recognition (CVPR), 2024
Hongrui Jia
Chaoya Jiang
Haiyang Xu
Wei Ye
Mengfan Dong
Ming Yan
Ji Zhang
Fei Huang
Shikun Zhang
MLLM
385
7
0
17 Nov 2024
BlueLM-V-3B: Algorithm and System Co-Design for Multimodal Large Language Models on Mobile Devices
Computer Vision and Pattern Recognition (CVPR), 2024
Xudong Lu
Yinghao Chen
Cheng Chen
Hui Tan
Boheng Chen
...
Aojun Zhou
Yafei Wen
Xiaoxin Chen
Shuai Ren
Jiaming Song
197
19
0
16 Nov 2024
EVOKE: Elevating Chest X-ray Report Generation via Multi-View Contrastive Learning and Patient-Specific Knowledge
Kang Liu
Zhuoqi Ma
Kun Xie
Zhicheng Jiao
Qiguang Miao
Ruixuan Liu
Tianyi Liu
Kun Xie
Zhicheng Jiao
119
0
0
15 Nov 2024
Bridging the Visual Gap: Fine-Tuning Multimodal Models with Knowledge-Adapted Captions
North American Chapter of the Association for Computational Linguistics (NAACL), 2024
Moran Yanuka
Assaf Ben-Kish
Yonatan Bitton
Idan Szpektor
Raja Giryes
VLM
488
4
0
13 Nov 2024
Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
Neural Information Processing Systems (NeurIPS), 2024
Jaeyoo Park
Jin Young Choi
Jeonghyung Park
Bohyung Han
VLM
139
8
0
08 Nov 2024
Image Understanding Makes for A Good Tokenizer for Image Generation
Neural Information Processing Systems (NeurIPS), 2024
Luting Wang
Yang Zhao
Zijian Zhang
Jiashi Feng
Si Liu
Bingyi Kang
VLM
203
9
0
07 Nov 2024
MME-Finance: A Multimodal Finance Benchmark for Expert-level Understanding and Reasoning
Ziliang Gan
Shilin Zhou
D. Zhang
Yu Lu
Che Liu
...
Haipang Wu
Chaoyou Fu
Z. Xu
Rongjunchen Zhang
Yong Dai
269
28
0
05 Nov 2024
Classification Done Right for Vision-Language Pre-Training
Neural Information Processing Systems (NeurIPS), 2024
Zilong Huang
Qinghao Ye
Bingyi Kang
Jiashi Feng
Haoqi Fan
CLIP
VLM
415
7
0
05 Nov 2024
Phase Diagram of Vision Large Language Models Inference: A Perspective from Interaction across Image and Instruction
Houjing Wei
Hakaze Cho
Yuting Shi
MLLM
245
1
0
01 Nov 2024
MoLE: Enhancing Human-centric Text-to-image Diffusion via Mixture of Low-rank Experts
Neural Information Processing Systems (NeurIPS), 2024
Jie Zhu
Yukang Chen
Mingyu Ding
Ping Luo
Leye Wang
Jingdong Wang
DiffM
159
10
0
30 Oct 2024
Controlling Language and Diffusion Models by Transporting Activations
International Conference on Learning Representations (ICLR), 2024
P. Rodríguez
Arno Blaas
Stephen Zhang
Luca Zappella
N. Apostoloff
Marco Cuturi
Xavier Suau
LLMSV
324
15
0
30 Oct 2024
VL-Cache: Sparsity and Modality-Aware KV Cache Compression for Vision-Language Model Inference Acceleration
International Conference on Learning Representations (ICLR), 2024
Dezhan Tu
Danylo Vashchilenko
Yuzhe Lu
Panpan Xu
VLM
240
22
0
29 Oct 2024
What Factors Affect Multi-Modal In-Context Learning? An In-Depth Exploration
Neural Information Processing Systems (NeurIPS), 2024
L. Qin
Qiguang Chen
Hao Fei
Zhi Chen
Min Li
Wanxiang Che
207
26
0
27 Oct 2024
Human-Object Interaction Detection Collaborated with Large Relation-driven Diffusion Models
Neural Information Processing Systems (NeurIPS), 2024
Liulei Li
Wenguan Wang
Yue Yang
235
21
0
26 Oct 2024
Sensor2Text: Enabling Natural Language Interactions for Daily Activity Tracking Using Wearable Sensors
Proceedings of the ACM on Interactive Mobile Wearable and Ubiquitous Technologies (IMWUT), 2024
Wenqiang Chen
Jiaxuan Cheng
Leyao Wang
Wei Zhao
Wojciech Matusik
264
14
0
26 Oct 2024
A Combinatorial Approach to Neural Emergent Communication
International Conference on Computational Linguistics (COLING), 2024
Zheyuan Zhang
147
1
0
24 Oct 2024
Probabilistic Language-Image Pre-Training
International Conference on Learning Representations (ICLR), 2024
Sanghyuk Chun
Wonjae Kim
Song Park
Sangdoo Yun
MLLM
VLM
CLIP
1.2K
14
2
24 Oct 2024
ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
International Journal of Computer Vision (IJCV), 2024
Zhiwei Hao
Jianyuan Guo
Li Shen
Yong Luo
Han Hu
Yonggang Wen
VLM
287
4
0
23 Oct 2024
Offline Evaluation of Set-Based Text-to-Image Generation
Negar Arabzadeh
Fernando Diaz
Junfeng He
EGVM
212
1
0
22 Oct 2024
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss
Zesen Cheng
Hang Zhang
Kehan Li
Sicong Leng
Zhiqiang Hu
Fei Wu
Deli Zhao
Xin Li
Lidong Bing
155
3
0
22 Oct 2024
Mini-InternVL: A Flexible-Transfer Pocket Multimodal Model with 5% Parameters and 90% Performance
Zhangwei Gao
Zhe Chen
Erfei Cui
Yiming Ren
Weiyun Wang
...
Lewei Lu
Tong Lu
Yu Qiao
Jifeng Dai
Wenhai Wang
VLM
395
87
0
21 Oct 2024
TIPS: Text-Image Pretraining with Spatial awareness
International Conference on Learning Representations (ICLR), 2024
Kevis-Kokitsi Maninis
Kaifeng Chen
Soham Ghosh
Arjun Karpur
Koert Chen
...
Jan Dlabal
Dan Gnanapragasam
Mojtaba Seyedhosseini
Howard Zhou
Andre Araujo
VLM
436
17
0
21 Oct 2024
EVA: An Embodied World Model for Future Video Anticipation
Yatian Wang
Hengyuan Zhang
Chun-Kai Fan
Xingqun Qi
Rongyu Zhang
...
Chi-Min Chan
Wei Xue
Wenhan Luo
Shanghang Zhang
Wenhan Luo
VGen
229
17
0
20 Oct 2024
Debiasing Large Vision-Language Models by Ablating Protected Attribute Representations
Neale Ratzlaff
Matthew Lyle Olson
Musashi Hinck
Shao-Yen Tseng
Vasudev Lal
Phillip Howard
375
4
0
17 Oct 2024
Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation
Computer Vision and Pattern Recognition (CVPR), 2024
Chengyue Wu
Xiaokang Chen
Z. F. Wu
Yiyang Ma
Xingchao Liu
...
Wen Liu
Zhenda Xie
Xingkai Yu
Chong Ruan
Ping Luo
AI4TS
390
264
0
17 Oct 2024
Temporal-Enhanced Multimodal Transformer for Referring Multi-Object Tracking and Segmentation
Changcheng Xiao
Qiong Cao
Yujie Zhong
Xiang Zhang
Tao Wang
Canqun Yang
L. Lan
210
3
0
17 Oct 2024
CMAL: A Novel Cross-Modal Associative Learning Framework for Vision-Language Pre-Training
ACM Multimedia (ACM MM), 2022
Zhiyuan Ma
Jianjun Li
Guohui Li
Kaiyan Huang
VLM
377
9
0
16 Oct 2024
Learning to Customize Text-to-Image Diffusion In Diverse Context
Taewook Kim
Wei Chen
Qiang Qiu
DiffM
217
6
0
14 Oct 2024
Multi-granularity Contrastive Cross-modal Collaborative Generation for End-to-End Long-term Video Question Answering
IEEE Transactions on Image Processing (TIP), 2024
Ting Yu
Kunhao Fu
Jian Zhang
Qingming Huang
Jun Yu
218
6
0
12 Oct 2024
Dynamic Multimodal Evaluation with Flexible Complexity by Vision-Language Bootstrapping
International Conference on Learning Representations (ICLR), 2024
Yue Yang
Shanghang Zhang
Wenqi Shao
Kaipeng Zhang
Yi Bin
Yu Wang
Ping Luo
428
15
0
11 Oct 2024
A Unified Debiasing Approach for Vision-Language Models across Modalities and Tasks
Neural Information Processing Systems (NeurIPS), 2024
Hoin Jung
T. Jang
Xiaoqian Wang
VLM
198
16
0
10 Oct 2024
Mono-InternVL: Pushing the Boundaries of Monolithic Multimodal Large Language Models with Endogenous Visual Pre-training
Computer Vision and Pattern Recognition (CVPR), 2024
Gen Luo
Xue Yang
Wenhan Dou
Zhaokai Wang
Jifeng Dai
Jifeng Dai
Yu Qiao
Xizhou Zhu
VLM
MLLM
361
66
0
10 Oct 2024
Insight Over Sight: Exploring the Vision-Knowledge Conflicts in Multimodal LLMs
Annual Meeting of the Association for Computational Linguistics (ACL), 2024
Xiaoyuan Liu
Wenxuan Wang
Youliang Yuan
Shu Yang
Qiuzhi Liu
Pinjia He
Zhaopeng Tu
927
2
0
10 Oct 2024
Previous
1
2
3
4
5
6
...
29
30
31
Next