Papers
Communities
Events
Blog
Pricing
Search
Open menu
Home
Papers
2312.00081
Cited By
Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding
30 November 2023
Wujian Peng
Sicheng Xie
Zuyao You
Shiyi Lan
Zuxuan Wu
VLM
CoGe
MLLM
Re-assign community
ArXiv
PDF
HTML
Papers citing
"Synthesize, Diagnose, and Optimize: Towards Fine-Grained Vision-Language Understanding"
20 / 20 papers shown
Title
SITE: towards Spatial Intelligence Thorough Evaluation
W. Wang
Reuben Tan
Pengyue Zhu
Jianwei Yang
Zhengyuan Yang
Lijuan Wang
Andrey Kolobov
Jianfeng Gao
Boqing Gong
39
0
0
08 May 2025
Decoupled Global-Local Alignment for Improving Compositional Understanding
Xiaoxing Hu
Kaicheng Yang
J. Z. Wang
Haoran Xu
Ziyong Feng
Y. Wang
VLM
45
0
0
23 Apr 2025
Can Text-to-Video Generation help Video-Language Alignment?
Luca Zanella
Massimiliano Mancini
Willi Menapace
Sergey Tulyakov
Yiming Wang
Elisa Ricci
DiffM
VGen
55
0
0
24 Mar 2025
Concept-as-Tree: Synthetic Data is All You Need for VLM Personalization
Ruichuan An
Kai Zeng
Ming Lu
Sihan Yang
Renrui Zhang
Huitong Ji
Qizhe Zhang
Y. Luo
Hao Liang
Wentao Zhang
56
0
0
17 Mar 2025
Enhancing Vision-Language Compositional Understanding with Multimodal Synthetic Data
Haoxin Li
Boyang Li
CoGe
67
0
0
03 Mar 2025
SCA3D: Enhancing Cross-modal 3D Retrieval via 3D Shape and Caption Paired Data Augmentation
Junlong Ren
Hao Wu
Hui Xiong
H. Wang
63
0
0
26 Feb 2025
Safety Without Semantic Disruptions: Editing-free Safe Image Generation via Context-preserving Dual Latent Reconstruction
J. Vice
Naveed Akhtar
Richard I. Hartley
Ajmal Saeed Mian
Ajmal Mian
DiffM
82
0
0
21 Nov 2024
VCBench: A Controllable Benchmark for Symbolic and Abstract Challenges in Video Cognition
Chenglin Li
Qianglong Chen
Zhi Li
Feng Tao
Yin Zhang
29
0
0
14 Nov 2024
TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
Maitreya Patel
Abhiram Kusumba
Sheng Cheng
Changhoon Kim
Tejas Gokhale
Chitta Baral
Yezhou Yang
CoGe
CLIP
45
7
0
04 Nov 2024
Progressive Compositionality in Text-to-Image Generative Models
Xu Han
Linghao Jin
Xiaofeng Liu
Paul Pu Liang
CoGe
93
2
0
22 Oct 2024
Preserving Multi-Modal Capabilities of Pre-trained VLMs for Improving Vision-Linguistic Compositionality
Youngtaek Oh
Jae-Won Cho
Dong-Jin Kim
In So Kweon
Junmo Kim
VLM
CoGe
CLIP
14
4
0
07 Oct 2024
A Survey on Multimodal Benchmarks: In the Era of Large AI Models
Lin Li
Guikun Chen
Hanrong Shi
Jun Xiao
Long Chen
34
8
0
21 Sep 2024
A Survey on Evaluation of Multimodal Large Language Models
Jiaxing Huang
Jingyi Zhang
LM&MA
ELM
LRM
43
20
0
28 Aug 2024
Diagnosing the Compositional Knowledge of Vision Language Models from a Game-Theoretic View
Jin Wang
Shichao Dong
Yapeng Zhu
Kelu Yao
Weidong Zhao
Chao Li
Ping Luo
CoGe
LRM
30
2
0
27 May 2024
SimDA: Simple Diffusion Adapter for Efficient Video Generation
Zhen Xing
Qi Dai
Hang-Rui Hu
Zuxuan Wu
Yu-Gang Jiang
VGen
DiffM
12
81
0
18 Aug 2023
Equivariant Similarity for Vision-Language Foundation Models
Tan Wang
Kevin Qinghong Lin
Linjie Li
Chung-Ching Lin
Zhengyuan Yang
Hanwang Zhang
Zicheng Liu
Lijuan Wang
CoGe
30
44
0
25 Mar 2023
CyCLIP: Cyclic Contrastive Language-Image Pretraining
Shashank Goel
Hritik Bansal
S. Bhatia
Ryan A. Rossi
Vishwa Vinay
Aditya Grover
CLIP
VLM
163
131
0
28 May 2022
BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Junnan Li
Dongxu Li
Caiming Xiong
S. Hoi
MLLM
BDL
VLM
CLIP
382
4,010
0
28 Jan 2022
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
Chao Jia
Yinfei Yang
Ye Xia
Yi-Ting Chen
Zarana Parekh
Hieu H. Pham
Quoc V. Le
Yun-hsuan Sung
Zhen Li
Tom Duerig
VLM
CLIP
293
2,875
0
11 Feb 2021
Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation
Golnaz Ghiasi
Yin Cui
A. Srinivas
Rui Qian
Tsung-Yi Lin
E. D. Cubuk
Quoc V. Le
Barret Zoph
ISeg
223
835
0
13 Dec 2020
1