Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review

24 February 2025

Abstract

Recent advancements in machine learning (ML) and deep learning (DL), particularly through the introduction of foundational models (FMs), have significantly enhanced surgical scene understanding within minimally invasive surgery (MIS). This paper surveys the integration of state-of-the-art ML and DL technologies, including Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and foundational models like the Segment Anything Model (SAM), into surgical workflows. These technologies improve segmentation accuracy, instrument tracking, and phase recognition in surgical endoscopic video analysis. The paper explores the challenges these technologies face, such as data variability and computational demands, and discusses ethical considerations and integration hurdles in clinical settings. Highlighting the roles of FMs, we bridge the technological capabilities with clinical needs and outline future research directions to enhance the adaptability, efficiency, and ethical alignment of AI applications in surgery. Our findings suggest that substantial progress has been made; however, more focused efforts are required to achieve seamless integration of these technologies into clinical workflows, ensuring they complement surgical practice by enhancing precision, reducing risks, and optimizing patient outcomes.

View on arXiv

@article{khan2025_2502.14886,
  title={ Surgical Scene Understanding in the Era of Foundation AI Models: A Comprehensive Review },
  author={ Ufaq Khan and Umair Nawaz and Adnan Qayyum and Shazad Ashraf and Muhammad Bilal and Junaid Qadir },
  journal={arXiv preprint arXiv:2502.14886},
  year={ 2025 }
}

Comments on this paper