ResearchTrend.AI
  • Papers
  • Communities
  • Events
  • Blog
  • Pricing
Papers
Communities
Social Events
Terms and Conditions
Pricing
Parameter LabParameter LabTwitterGitHubLinkedInBlueskyYoutube

© 2025 ResearchTrend.AI, All rights reserved.

  1. Home
  2. Papers
  3. 2410.00388
48
3

Find Everything: A General Vision Language Model Approach to Multi-Object Search

1 October 2024
Daniel Choi
Angus Fung
Haitong Wang
Aaron Hao Tan
ArXivPDFHTML
Abstract

The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason about multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. Experiments in both simulated and real-world settings showed that Finder outperforms existing methods using deep reinforcement learning and VLMs. Ablation and scalability studies further validated our design choices and robustness with increasing numbers of target objects, respectively. Website:this https URL

View on arXiv
@article{choi2025_2410.00388,
  title={ Find Everything: A General Vision Language Model Approach to Multi-Object Search },
  author={ Daniel Choi and Angus Fung and Haitong Wang and Aaron Hao Tan },
  journal={arXiv preprint arXiv:2410.00388},
  year={ 2025 }
}
Comments on this paper