DiffusionPose: Markov-Optimized Diffusion Model for Human Pose Estimation

Zhigang Wang; Zhenguang Liu; Shaojing Fan; Sifan Wu; Yingying Jiao

doi:10.1609/aaai.v40i12.38012

Back to AAAI

AAAI 2026

DiffusionPose: Markov-Optimized Diffusion Model for Human Pose Estimation

Conference Paper AAAI Technical Track on Computer Vision IX Artificial Intelligence

PDF Details DOI

Abstract

Video-based human pose estimation has long been a nontrivial task due to its dynamic nature and challenging detection scenarios such as occlusion and defocus. Inspired by the success of diffusion models, researchers have applied them to video pose estimation, outperforming traditional joint detection methods. However, existing diffusion model-based methods still face challenges like slow convergence and unstable pose generation. To tackle these issues, we propose DiffusionPose, a novel framework for video pose estimation that integrates diffusion models with optimization strategies: (1) We combine the emerging Mamba with Transformers to balance global and local spatio-temporal modeling. (2) We integrate Markov Random Fields into the reverse diffusion process to enhance the denoising of pose heatmaps, particularly addressing the issue of confused generation of occluded joints. (3) We mathematically formulate a Markov objective to supervise the heatmap denoising process, enabling the model to generate anatomically plausible skeletons. Our method achieves state-of-the-art performance on three large-scale benchmark datasets. Interestingly, it shows surprising robustness in challenging video scenarios, improving the accuracy of the most difficult ankle joint by 16.9% compared to the previous best diffusion model-based method on the Challenging-PoseTrack dataset.

DiffusionPose: Markov-Optimized Diffusion Model for Human Pose Estimation

Abstract

Authors

Keywords

Context