v1v2 (latest)

Efficient In-Context Learning in Vision-Language Models for Egocentric Videos

Conference on Empirical Methods in Natural Language Processing (EMNLP), 2023

28 November 2023

ArXiv (abs)PDF HTML Github (124★)

Abstract

Recent advancements in text-only large language models (LLMs) have highlighted the benefit of in-context learning for adapting to new tasks with a few demonstrations. However, extending in-context learning to large vision-language models (VLMs) using a huge amount of naturalistic vision-language data has shown limited success, particularly for egocentric videos, due to high data collection costs. We propose a novel training method $\mathbb{E}$ fficient $\mathbb{I}$ n-context $\mathbb{L}$ earning on $\mathbb{E}$ gocentric $\mathbb{V}$ ideos ( $\mathbb{EILEV}$ ), which elicits in-context learning in VLMs for egocentric videos without requiring massive, naturalistic egocentric video datasets. $\mathbb{EILEV}$ involves architectural and training data adaptations to allow the model to process contexts interleaved with video clips and narrations, sampling of in-context examples with clusters of similar verbs and nouns, use of data with skewed marginal distributions with a long tail of infrequent verbs and nouns, as well as homonyms and synonyms. Our evaluations show that $\mathbb{EILEV}$ -trained models outperform larger VLMs trained on a huge amount of naturalistic data in in-context learning. Furthermore, they can generalize to not only out-of-distribution, but also novel, rare egocentric videos and texts via in-context learning, demonstrating potential for applications requiring cost-effective training, and rapid post-deployment adaptability. Our code and demo are available at \url{https://github.com/yukw777/EILEV}.

View on arXiv

Comments on this paper