AAAI Conference 2025 Conference Paper
Stable Mean Teacher for Semi-supervised Video Action Detection
- Akash Kumar
- Sirshapan Mitra
- Yogesh Singh Rawat
In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatio-temporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end student-teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatio-temporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, which leads to coherent temporal detections. We evaluate our approach on four different spatio-temporal detection benchmarks: UCF101-24, JHMDB21, AVA, and Youtube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain.