Author name cluster

Yogesh Singh Rawat

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers

1 author row

AAAI Conference 2025 Conference Paper

Stable Mean Teacher for Semi-supervised Video Action Detection

Akash Kumar
Sirshapan Mitra
Yogesh Singh Rawat

In this work, we focus on semi-supervised learning for video action detection. Video action detection requires spatio-temporal localization in addition to classification, and a limited amount of labels makes the model prone to unreliable predictions. We present Stable Mean Teacher, a simple end-to-end student-teacher-based framework that benefits from improved and temporally consistent pseudo labels. It relies on a novel ErrOr Recovery (EoR) module, which learns from students' mistakes on labeled samples and transfers this to the teacher to improve pseudo labels for unlabeled samples. Moreover, existing spatio-temporal losses do not take temporal coherency into account and are prone to temporal inconsistencies. To overcome this, we present Difference of Pixels (DoP), a simple and novel constraint focused on temporal consistency, which leads to coherent temporal detections. We evaluate our approach on four different spatio-temporal detection benchmarks: UCF101-24, JHMDB21, AVA, and Youtube-VOS. Our approach outperforms the supervised baselines for action detection by an average margin of 23.5% on UCF101-24, 16% on JHMDB21, and 3.3% on AVA. Using merely 10% and 20% of data, it provides a competitive performance compared to the supervised baseline trained on 100% annotations on UCF101-24 and JHMDB21 respectively. We further evaluate its effectiveness on AVA for scaling to large-scale datasets and Youtube-VOS for video object segmentation, demonstrating its generalization capability to other tasks in the video domain.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Semi-supervised Active Learning for Video Action Detection

Ayush Singh
Aayush J Rana
Akash Kumar
Shruti Vyas
Yogesh Singh Rawat

In this work, we focus on label efficient learning for video action detection. We develop a novel semi-supervised active learning approach which utilizes both labeled as well as un- labeled data along with informative sample selection for ac- tion detection. Video action detection requires spatio-temporal localization along with classification, which poses several challenges for both active learning (informative sample se- lection) as well as semi-supervised learning (pseudo label generation). First, we propose NoiseAug, a simple augmenta- tion strategy which effectively selects informative samples for video action detection. Next, we propose fft-attention, a novel technique based on high-pass filtering which enables effective utilization of pseudo label for SSL in video action detection by emphasizing on relevant activity region within a video. We evaluate the proposed approach on three different bench- mark datasets, UCF-101-24, JHMDB-21, and Youtube-VOS. First, we demonstrate its effectiveness on video action detec- tion where the proposed approach outperforms prior works in semi-supervised and weakly-supervised learning along with several baseline approaches in both UCF101-24 and JHMDB- 21. Next, we also show its effectiveness on Youtube-VOS for video object segmentation demonstrating its generalization capability for other dense prediction tasks in videos.

PDF Details DOI