33
v1v2 (latest)

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

Main:28 Pages
12 Figures
Bibliography:2 Pages
4 Tables
Abstract

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is 10,800,andtheaccuracyadvantageonlypaysoffbeyond55millioninferences(151,000images/dayforoneyear).OndiverseproductcategoriesGeminiachieves52.310,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini (0.00050) and GPT-4 (0.00067)overYOLO(0.00067) over YOLO (0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.

View on arXiv
Comments on this paper