Learning to Compose Soft Prompts for Compositional Zero-Shot Learning

International Conference on Learning Representations (ICLR), 2022

7 April 2022

ArXiv (abs)PDF HTML Github (88★)

Abstract

We introduce compositional soft prompting (CSP), a parameter-efficient learning technique to improve the zero-shot compositionality of large-scale pretrained vision-language models (VLMs). VLMs can represent arbitrary classes as natural language prompts in their flexible text encoders, but they underperform state-of-the-art methods on compositional zero-shot benchmark tasks. To improve VLMs, we propose a novel form of soft prompting. We treat the attributes and objects that are composed to define classes as learnable tokens of vocabulary and tune them on multiple prompt compositions. During inference, we recompose the learned attribute-object vocabulary in new combinations. We show that CSP outperforms the original VLM on benchmark datasets by an average of 10.9 percentage points on AUC. CSP also outperforms CoOp, a soft prompting method that tunes the prefix context, by an average of 5.8 percentage points on AUC. We perform additional experiments to show that CSP improves generalization to attribute-only classification, higher-order attribute-attribute-object compositions, and combinations of pretrained attributes and fine-tuned objects.

View on arXiv

Comments on this paper