MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

Qiang Qi; Wenqi Shang; Xiao Wang; Yanjie Liang; Shuyuan Lin

doi:10.1609/aaai.v40i10.37798

Back to AAAI

AAAI 2026

MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

Conference Paper AAAI Technical Track on Computer Vision VII Artificial Intelligence

PDF Details DOI

Abstract

Video object detection is a fundamental yet challenging task in computer vision. Recently, DETR-based methods have gained prominence in this domain owing to their powerful global modeling capabilities. However, these methods are still confronted with two key limitations: frame-agnostic initialization of object queries and scale-agnostic attention mechanisms, which hinder their capability to capture the appearance variations of dynamic objects and model the temporal consistency across frames. To alleviate these limitations, we propose a multiscale-aware transformer diffusion network (MSTDiff), a novel framework designed for the video object detection task, including two technical improvements over existing methods. First, we design a diffusion-driven adaptive query module, which models the object query distribution through a diffusion process conditioned on input frames, enabling an adaptive and content-aware initialization of object queries. Second, we develop a multiscale-aware transformer encoder module, which combines multi-head convolutional units with attention mechanisms to enhance multi-scale feature representations while preserving global dependence modeling. We conduct extensive experiments on the public ImageNet VID dataset, and the results demonstrate that our MSTDiff achieves 87.7% mAP with ResNet-101, outperforming most previous state-of-the-art video object detection methods.

MSTDiff: Multiscale-Aware Transformer Diffusion Network for Video Object Detection

Abstract

Authors

Keywords

Context