EAAI Journal 2026 Journal Article
A lightweight and real-time surgical action detection framework using multi-contextual and decoupled representations
- Siming Zheng
- A.S.M. Sharifuzzaman Sagar
- Yu Chen
- Jun Hoong Chan
- Zehao Yu
- Shi Ying
- Jianfeng Lu
Accurate detection of surgical actions in minimally invasive procedures is a critical step toward developing intelligent operative assistance systems. In this work, we propose Surgical You Only Look Once detector (Surg-YOLO), an efficient and high-precision surgical action detection framework built upon the YOLO version 11 (YOLOv11) architecture, specifically optimized for the spatio-temporal complexities of surgical environments. Surg-YOLO integrates three key architectural innovations: the Enhanced Spatial Pyramid Pooling-Fast (ESPPF) module for capturing rich multi-scale spatial features; the Spatio-Temporal Multi-scale Context Aggregation Module (ST-MCAM), which enhances temporal reasoning and contextual awareness across frames; and the Decoupled Dual-Branch Prediction Head (DDPH) for independently refining classification and localization tasks. Extensive experiments on a large-scale surgical action dataset demonstrate that Surg-YOLO significantly outperforms existing baseline models, achieving superior detection accuracy across multiple evaluation thresholds. Qualitative visualizations further validate the model’s ability to localize subtle and concurrent surgical actions with high precision. These results highlight Surg-YOLO’s potential as a reliable solution for real-time surgical action detection.