When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models
- ObjDVLM

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is 0.00050) and GPT-4 (0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.
View on arXiv