ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2309.09858
30
3

Unsupervised Open-Vocabulary Object Localization in Videos

18 September 2023
Ke Fan
Zechen Bai
Tianjun Xiao
Dominik Zietlow
Max Horn
Zixu Zhao
Carl-Johann Simon-Gabriel
Mike Zheng Shou
Francesco Locatello
Bernt Schiele
Thomas Brox
Zheng-Wei Zhang
Yanwei Fu
Tong He
ArXivPDFHTML
Abstract

In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via an object-centric approach with slot attention and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks.

View on arXiv
Comments on this paper