DST-Det: Simple Dynamic Self-Training for Open-Vocabulary Object
Detection
- ObjDVLM
Open-vocabulary object detection (OVOD) aims to detect the objects beyond the set of classes observed during training. This work presents a simple yet effective strategy that leverages the zero-shot classification ability of pre-trained vision-language models (VLM), such as CLIP, to directly discover proposals of possible novel classes. Unlike previous works that ignore novel classes during training and rely solely on the region proposal network (RPN) for novel object detection, our method selectively filters proposals based on specific design criteria. The resulting sets of identified proposals serve as pseudo-labels of potential novel classes during the training phase. This self-training strategy improves the recall and accuracy of novel classes without requiring additional annotations or datasets. We further propose a simple offline pseudo-label generation strategy to refine the object detector. Empirical evaluations on three datasets, including LVIS, V3Det, and COCO, demonstrate significant improvements over the baseline performance without incurring additional parameters or computational costs during inference. In particular, compared with previous F-VLM, our method achieves a 1.7\% improvement on the LVIS dataset. We also achieve over 6.5\% improvement on the recent challenging V3Det dataset. When combined with the recent method CLIPSelf, our method also achieves 46.7 novel class AP on COCO without introducing extra data for pertaining.
View on arXiv