Author name cluster

Sifan Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

1 author row

AAAI Conference 2026 Conference Paper

Attentive Keypoint Identification: Progressive Spatiotemporal Refinement for Video-based Human Pose Estimation

Sifan Wu
Haipeng Chen
Yingda Lyu
Shaojing Fan
Zhigang Wang
Zhenguang Liu
Yingying Jiao

Video-based human pose estimation has vast applications such as action recognition, sports analytics, and crime detection. However, this task is challenging as it involves interpreting both spatial context and temporal dynamics to accurately localize human anatomical keypoints in video sequences. Current approaches, often based on attention mechanisms, perform well but struggle in challenging scenarios like rapid motion and pose occlusion. We attribute these failures to two fundamental limitations: spatial uniformity, where models indiscriminately assign attention to both joint-relevant features and background clutter, thereby introducing spatial noise; and temporal rigidity, an inability to adapt to large joint displacements, resulting in severe feature misalignment during rapid motion. To overcome these challenges, we introduce PSTPose, a novel progressive spatiotemporal refinement framework. Specifically, to address the spatial uniformity problem, we propose a Discriminative Feature Enhancement (DFE) module that emphasizes joint-relevant features and a Feature Cluster Grouping (FCG) module that forms compact, semantically meaningful regions. For the temporal rigidity problem, we introduce a Deformable Spatiotemporal Fusion (DSF) module that adaptively aligns features across consecutive frames via deformation-aware sampling. This design ensures robust keypoint localization, particularly in cluttered and dynamic scenes. Extensive experiments on three large-scale benchmarks, PoseTrack2017, PoseTrack2018, PoseTrack21, demonstrate that PSTPose establishes a new state-of-the-art.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DiffusionPose: Markov-Optimized Diffusion Model for Human Pose Estimation

Zhigang Wang
Zhenguang Liu
Shaojing Fan
Sifan Wu
Yingying Jiao

Video-based human pose estimation has long been a nontrivial task due to its dynamic nature and challenging detection scenarios such as occlusion and defocus. Inspired by the success of diffusion models, researchers have applied them to video pose estimation, outperforming traditional joint detection methods. However, existing diffusion model-based methods still face challenges like slow convergence and unstable pose generation. To tackle these issues, we propose DiffusionPose, a novel framework for video pose estimation that integrates diffusion models with optimization strategies: (1) We combine the emerging Mamba with Transformers to balance global and local spatio-temporal modeling. (2) We integrate Markov Random Fields into the reverse diffusion process to enhance the denoising of pose heatmaps, particularly addressing the issue of confused generation of occluded joints. (3) We mathematically formulate a Markov objective to supervise the heatmap denoising process, enabling the model to generate anatomically plausible skeletons. Our method achieves state-of-the-art performance on three large-scale benchmark datasets. Interestingly, it shows surprising robustness in challenging video scenarios, improving the accuracy of the most difficult ankle joint by 16.9% compared to the previous best diffusion model-based method on the Challenging-PoseTrack dataset.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Dual Coding Theory in Action: Language-Assisted Human Pose Estimation in Videos

Sifan Wu
Haipeng Chen
Yingda Lyu
Shaojing Fan
Zhigang Wang
Zhenguang Liu
Yingying Jiao

Video-based human pose estimation aims to localize keypoints across frames, enabling robust analysis of human motion in applications such as sports, surveillance, and healthcare. However, existing methods rely solely on visual cues, limiting their robustness in complex scenes involving occlusion, motion blur, or poor lighting. In contrast, dual coding theory from psychology suggests that human cognition is inherently multimodal: we learn by integrating visual perception with linguistic context to form structured, semantic understandings of the world. Visual input provides concrete spatiotemporal grounding, while language offers symbolic abstraction that enhances reasoning and generalization. Motivated by this cognitive principle, we present the first framework that explicitly incorporates language as an auxiliary modality to enhance video-based pose estimation. To address the lack of paired video-text datasets, we first employ a Multimodal Large Language Model (MLLM) to generate textual descriptions of human interactions from videos. We then propose a novel coarse-to-fine multimodal alignment pipeline: a cross-modal semantic interaction module establishes initial grounding between spatiotemporal visual features and textual embeddings, while an optimal transport-based feature matching mechanism enforces fine-grained, geometry-aware alignment. This cognitively inspired design enables more accurate and robust pose estimation, especially in visually challenging scenes like occlusion and motion blur. Extensive experiments on three benchmarks confirm that our method consistently outperforms state-of-the-art approaches.

PDF Details DOI

AAAI Conference 2026 Conference Paper

What to Ask Next? Probing the Imaginative Reasoning of LLMs with TurtleSoup Puzzles

Mengtao Zhou
Sifan Wu
Huan Zhang
Qi Sima
Bang Liu

We investigate the capacity of Large Language Models (LLMs) for imaginative reasoning—the proactive construction, testing, and revision of hypotheses in information-sparse environments. Existing benchmarks, often static or focused on social deduction, fail to capture the dynamic, exploratory nature of this reasoning process. To address this gap, we introduce a comprehensive research framework based on the classic "Turtle Soup" game, integrating a benchmark, an agent, and an evaluation protocol. We present TurtleSoup-Bench, the first large-scale, bilingual, interactive benchmark for imaginative reasoning, comprising 800 turtle soup stories sourced from both the Internet and expert authors. We also propose Mosaic-Agent, a novel agent designed to assess LLMs' performance in this setting. To evaluate reasoning quality, we develop a multi-dimensional protocol measuring logical consistency, detail completion, and conclusion alignment. Experiments with leading LLMs reveal clear capability limits, common failure patterns, and a significant performance gap compared to humans. Our work offers new insights into LLMs' imaginative reasoning and establishes a foundation for future research on exploratory agent behavior.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Causal-Inspired Multitask Learning for Video-Based Human Pose Estimation

Haipeng Chen
Sifan Wu
Zhigang Wang
Yifang Yin
Yingying Jiao
Yingda Lyu
Zhenguang Liu

Video-based human pose estimation has long been a fundamental yet challenging problem in computer vision. Previous studies focus on spatio-temporal modeling through the enhancement of architecture design and optimization strategies. However, they overlook the causal relationships in the joints, leading to models that may be overly tailored and thus estimate poorly to challenging scenes. Therefore, adequate causal reasoning capability, coupled with good interpretability of model, are both indispensable and prerequisite for achieving reliable results. In this paper, we pioneer a causal perspective on pose estimation and introduce a causal-inspired multitask learning framework, consisting of two stages. In the first stage, we try to endow the model with causal spatio-temporal modeling ability by introducing two self-supervision auxiliary tasks. Specifically, these auxiliary tasks enable the network to infer challenging keypoints based on observed keypoint information, thereby imbuing causal reasoning capabilities into the model and making it robust to challenging scenes. In the second stage, we argue that not all feature tokens contribute equally to pose estimation. Prioritizing causal (keypoint-relevant) tokens is crucial to achieve reliable results, which could improve the interpretability of the model. To this end, we propose a Token Causal Importance Selection module to identify the causal tokens and non-causal tokens (e.g., background and objects). Additionally, non-causal tokens could provide potentially beneficial cues but may be redundant. We further introduce a non-causal tokens clustering module to merge the similar non-causal tokens. Extensive experiments show that our method outperforms state-of-the-art methods on three large-scale benchmark datasets.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Optimizing Human Pose Estimation Through Focused Human and Joint Regions

Yingying Jiao
Zhigang Wang
Zhenguang Liu
Shaojing Fan
Sifan Wu
Zheqi Wu
Zhuoyue Xu

Human pose estimation has given rise to a broad spectrum of novel and compelling applications, including action recognition, sports analysis, as well as surveillance. However, accurate video pose estimation remains an open challenge. One aspect that has been overlooked so far is that existing methods learn motion clues from all pixels rather than focusing on the target human body, making them easily misled and disrupted by unimportant information such as background changes or movements of other people. Additionally, while the current Transformer-based pose estimation methods has demonstrated impressive performance with global modeling, they struggle with local context perception and precise positional identification. In this paper, we try to tackle these challenges from three aspects: (1) We propose a bilayer Human-Keypoint Mask module that performs coarse-to-fine visual token refinement, which gradually zooms in on the target human body and keypoints while masking out unimportant figure regions. (2) We further introduce a novel deformable cross attention mechanism and a bidirectional separation strategy to adaptively aggregate spatial and temporal motion clues from constrained surrounding contexts. (3) We mathematically formulate the deformable cross attention, constraining that the model focuses solely on the regions centered at the target person body. Empirically, our method achieves state-of-the-art performance on three large-scale benchmark datasets. A remarkable highlight is that our method achieves an 84.8 mean Average Precision (mAP) on the challenging wrist joint, which significantly outperforms the 81.5 mAP achieved by the current state-of-the-art method on the PoseTrack2017 dataset.

PDF Details DOI

AAAI Conference 2025 Conference Paper

SpatioTemporal Learning for Human Pose Estimation in Sparsely-Labeled Videos

Yingying Jiao
Zhigang Wang
Sifan Wu
Shaojing Fan
Zhenguang Liu
Zhuoyue Xu
Zheqi Wu

Human pose estimation in videos remains a challenge, largely due to the reliance on extensive manual annotation of large datasets, which is expensive and labor-intensive. Furthermore, existing approaches often struggle to capture long-range temporal dependencies and overlook the complementary relationship between temporal pose heatmaps and visual features. To address these limitations, we introduce STDPose, a novel framework that enhances human pose estimation by learning spatiotemporal dynamics in sparsely-labeled videos. STDPose incorporates two key innovations: 1) A novel Dynamic-Aware Mask to capture long-range motion context, allowing for a nuanced understanding of pose changes. 2) A system for encoding and aggregating spatiotemporal representations and motion dynamics to effectively model spatiotemporal relationships, improving the accuracy and robustness of pose estimation. STDPose establishes a new performance benchmark for both video pose propagation (i.e., propagating pose annotations from labeled frames to unlabeled frames) and pose estimation tasks, across three large-scale evaluation datasets. Additionally, utilizing pseudo-labels generated by pose propagation, STDPose achieves competitive performance with only 26.7% labeled data.

PDF Details DOI

AAAI Conference 2023 Conference Paper

Identify Event Causality with Knowledge and Analogy

Sifan Wu
Ruihui Zhao
Yefeng Zheng
Jian Pei
Bang Liu

Event causality identification (ECI) aims to identify the causal relationship between events, which plays a crucial role in deep text understanding. Due to the diversity of real-world causality events and difficulty in obtaining sufficient training data, existing ECI approaches have poor generalizability and struggle to identify the relation between seldom seen events. In this paper, we propose to utilize both external knowledge and internal analogy to improve ECI. On the one hand, we utilize a commonsense knowledge graph called ConceptNet to enrich the description of an event sample and reveal the commonalities or associations between different events. On the other hand, we retrieve similar events as analogy exam- ples and glean useful experiences from such analogous neigh- bors to better identify the relationship between a new event pair. By better understanding different events through exter- nal knowledge and making an analogy with similar events, we can alleviate the data sparsity issue and improve model gener- alizability. Extensive evaluations on two benchmark datasets show that our model outperforms other baseline methods by around 18% on the F1-value on average

PDF Details DOI

IJCAI Conference 2022 Conference Paper

Copy Motion From One to Another: Fake Motion Video Generation

Zhenguang Liu
Sifan Wu
Chejian Xu
Xiang Wang
Lei Zhu
Shuang Wu
Fuli Feng

One compelling application of artificial intelligence is to generate a video of a target person performing arbitrary desired motion (from a source person). While the state-of-the-art methods are able to synthesize a video demonstrating similar broad stroke motion details, they are generally lacking in texture details. A pertinent manifestation appears as distorted face, feet, and hands, and such flaws are very sensitively perceived by human observers. Furthermore, current methods typically employ GANs with a L2 loss to assess the authenticity of the generated videos, inherently requiring a large amount of training samples to learn the texture details for adequate video generation. In this work, we tackle these challenges from three aspects: 1) We disentangle each video frame into foreground (the person) and background, focusing on generating the foreground to reduce the underlying dimension of the network output. 2) We propose a theoretically motivated Gromov-Wasserstein loss that facilitates learning the mapping from a pose to a foreground image. 3) To enhance texture details, we encode facial features with geometric guidance and employ local GANs to refine the face, feet, and hands. Extensive experiments show that our method is able to generate realistic target person videos, faithfully copying complex motions from a source person. Our code and datasets are released at https: //github. com/Sifann/FakeMotion.

PDF Details DOI

JBHI Journal 2022 Journal Article

Semi-Supervised Learning for Automatic Atrial Fibrillation Detection in 24-Hour Holter Monitoring

Peng Zhang
Yuting Chen
Fan Lin
Sifan Wu
Xiaoyun Yang
Qiang Li

Paroxysmal atrial fibrillation (AF) is generally diagnosed by long-term dynamic electrocardiogram (ECG) monitoring. Identifying AF episodes from long-term ECG data can place a heavy burden on clinicians. Many machine-learning-based automatic AF detection methods have been proposed to solve this issue. However, these methods require numerous annotated data to train the model, and the annotation of AF in long-term ECG is extremely time-consuming. Reducing the demand for labeled data can effectively improve the clinical practicability of automatic AF detection methods. In this study, we developed a novel semi-supervised learning method that generated modified low-entropy labels of unlabeled samples for training a deep learning model to automatically detect paroxysmal AF in 24 h Holter monitoring data. Our method employed a 1D CNN-LSTM neural network with RR intervals as input and used few labeled training data with numerous unlabeled data for training the neural network. This method was evaluated using a 24 h Holter monitoring dataset collected from 1000 paroxysmal AF patients. Using labeled samples from only 10 patients for model training, our method achieved a sensitivity of 97. 8%, specificity of 97. 9%, and accuracy of 97. 9% in five-fold cross-validation. Compared to the supervised learning method with complete labeled samples, the detection accuracy of our method was only 0. 5% lower, while the workload of data annotation was significantly reduced by more than 98%. In general, this is the first study to apply semi-supervised learning techniques for automatic AF detection using ECG. Our method can effectively reduce the demand for AF data annotations and can improve the clinical practicability of automatic AF detection.

Details DOI

AAAI Conference 2021 Conference Paper

Knowledge Refinery: Learning from Decoupled Label

Qianggang Ding
Sifan Wu
Tao Dai
Hao Sun
Jiadong Guo
Zhang-Hua Fu
Shutao Xia

Recently, a variety of regularization techniques have been widely applied in deep neural networks, which mainly focus on the regularization of weight parameters to encourage generalization effectively. Label regularization techniques are also proposed with the motivation of softening the labels while neglecting the relation of classes. Among them, the technique of knowledge distillation proposes to distill the soft label, which contains the knowledge of class relations. However, this technique needs to pre-train an extra cumbersome teacher model. In this paper, we propose a method called Knowledge Refinery (KR), which enables the neural network to learn the relation of classes on-the-fly without the teacher-student training strategy. We propose the definition of decoupled labels, which consist of the original hard label and the residual label. To exhibit the generalization of KR, we evaluate our method in both fields of computer vision and natural language processing. Our empirical results show consistent performance gains under all experimental settings.

PDF Details

NeurIPS Conference 2020 Conference Paper

Adversarial Sparse Transformer for Time Series Forecasting

Sifan Wu
Xi Xiao
Qianggang Ding
Peilin Zhao
Ying Wei
Junzhou Huang

Many approaches have been proposed for time series forecasting, in light of its significance in wide applications including business demand prediction. However, the existing methods suffer from two key limitations. Firstly, most point prediction models only predict an exact value of each time step without flexibility, which can hardly capture the stochasticity of data. Even probabilistic prediction using the likelihood estimation suffers these problems in the same way. Besides, most of them use the auto-regressive generative mode, where ground-truth is provided during training and replaced by the network’s own one-step ahead output during inference, causing the error accumulation in inference. Thus they may fail to forecast time series for long time horizon due to the error accumulation. To solve these issues, in this paper, we propose a new time series forecasting model -- Adversarial Sparse Transformer (AST), based on Generated Adversarial Networks (GANs). Specifically, AST adopts a Sparse Transformer as the generator to learn a sparse attention map for time series forecasting, and uses a discriminator to improve the prediction performance from sequence level. Extensive experiments on several real-world datasets show the effectiveness and efficiency of our method.

PDF Details

IJCAI Conference 2020 Conference Paper

Hierarchical Multi-Scale Gaussian Transformer for Stock Movement Prediction

Qianggang Ding
Sifan Wu
Hao Sun
Jiadong Guo
Jian Guo

Predicting the price movement of finance securities like stocks is an important but challenging task, due to the uncertainty of financial markets. In this paper, we propose a novel approach based on the Transformer to tackle the stock movement prediction task. Furthermore, we present several enhancements for the proposed basic Transformer. Firstly, we propose a Multi-Scale Gaussian Prior to enhance the locality of Transformer. Secondly, we develop an Orthogonal Regularization to avoid learning redundant heads in the multi-head self-attention mechanism. Thirdly, we design a Trading Gap Splitter for Transformer to learn hierarchical features of high-frequency finance data. Compared with other popular recurrent neural networks such as LSTM, the proposed method has the advantage to mine extremely long-term dependencies from financial time series. Experimental results show our proposed models outperform several competitive methods in stock price prediction tasks for the NASDAQ exchange market and the China A-shares market.

PDF Details DOI