v1v2 (latest)

When Does Supervised Training Pay Off? The Hidden Economics of Object Detection in the Era of Vision-Language Models

13 October 2025

Samer Al-Hamadani

ObjD

VLM

ArXiv (abs)PDF HTML Github (47636★)

Main:28 Pages

12 Figures

Bibliography:2 Pages

4 Tables

Abstract

Object detection traditionally relies on costly manual annotation. We present the first comprehensive cost-effectiveness analysis comparing supervised YOLO and zero-shot vision-language models (Gemini Flash 2.5 and GPT-4). Evaluated on 5,000 stratified COCO images and 500 diverse product images, combined with Total Cost of Ownership modeling, we derive break-even thresholds for architecture selection. Results show supervised YOLO attains 91.2% accuracy versus 68.5% for Gemini and 71.3% for GPT-4 on standard categories; the annotation expense for a 100-category system is $10,800, and the accuracy advantage only pays off beyond 55 million inferences (151,000 images/day for one year). On diverse product categories Gemini achieves 52.3% and GPT-4 55.1%, while supervised YOLO cannot detect untrained classes. Cost-per-correct-detection favors Gemini ($ 0.00050) and GPT-4 ( $0.00067) over YOLO ($ 0.143) at 100,000 inferences. We provide decision frameworks showing that optimal architecture choice depends on inference volume, category stability, budget, and accuracy requirements.

View on arXiv

Comments on this paper