Author name cluster

Dan Guo

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

1 author row

AAAI Conference 2026 Conference Paper

A Closer Look at Knowledge Distillation in Spiking Neural Network Training

Xu Liu
Na Xia
Jinxing Zhou
Jingyuan Xu
Dan Guo

Spiking Neural Networks (SNNs) become popular due to excellent energy efficiency, yet facing challenges for effective model training. Recent works improve this by introducing knowledge distillation (KD) techniques, with the pre-trained artificial neural networks (ANNs) used as teachers and the target SNNs as students. This is commonly accomplished through a straightforward element-wise alignment of intermediate features and prediction logits from ANNs and SNNs, often neglecting the intrinsic differences between their architectures. Specifically, ANN's outputs exhibit a continuous distribution, whereas SNN's outputs are characterized by sparsity and discreteness. To mitigate this issue, we introduce two innovative KD strategies. Firstly, we propose the Saliency-scaled Activation Map Distillation (SAMD), which aligns the spike activation map of the student SNN with the class-aware activation map of the teacher ANN. Rather than performing KD directly on the raw features of ANN and SNN, our SAMD directs the student to learn from saliency activation maps that exhibit greater semantic and distribution consistency. Additionally, we propose a Noise-smoothed Logits Distillation (NLD), which utilizes Gaussian noise to smooth the sparse logits of student SNN, facilitating the alignment with continuous logits from teacher ANN. Extensive experiments on multiple datasets demonstrate the effectiveness of our methods.

PDF Details DOI

AAAI Conference 2026 Conference Paper

AgentMental: An Interactive Multi-Agent Framework for Explainable and Adaptive Mental Health Assessment

Jinpeng Hu
Ao Wang
Qianqian Xie
Zhuo Li
Hui Ma
Dan Guo

Mental health assessment is crucial for early intervention and effective treatment, yet traditional clinician-based approaches are limited by the shortage of qualified professionals. Recent advances in artificial intelligence have sparked growing interest in automated psychological assessment, yet most existing approaches are constrained by their reliance on static text analysis, limiting their ability to capture deeper and more informative insights that emerge through dynamic interaction and iterative questioning. Therefore, in this paper, we propose a multi-agent framework for mental health evaluation that simulates clinical doctor-patient dialogues, with specialized agents assigned to questioning, adequacy evaluation, scoring, and updating. In detail, we introduce an adaptive questioning mechanism in which an evaluation agent assesses the adequacy of user responses to determine the necessity of generating targeted follow-up queries to address ambiguity and missing information. Additionally, we employ a tree-structured memory in which the root node encodes the user's basic information, while child nodes (e.g., topic and statement) organize key information according to distinct symptom categories and interaction turns. This memory is dynamically updated throughout the interaction to reduce redundant questioning and enhance the information extraction and contextual tracking capabilities. Experimental results on the DAIC-WOZ dataset illustrate the effectiveness of our proposed method, which achieves better performance than existing approaches. Our code is released at \url{https://github.com/MindIntLab-HFUT/AgentMental}.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Bidirectional Counterfactual Distillation for Review-Based Recommendation

Sheng Sang
Shujie Li
Shuaiyang Li
Kang Liu
Teng Li
Wei Jia
Dan Guo
Feng Xue

Review-based recommendation methods typically integrate multiple behaviors, including interactions, reviews, and ratings, to model user preferences. To effectively extract preference signals from diverse behaviors, some studies train multiple student models to capture distinct behavioral patterns, and leverage online distillation to facilitate collaborative learning among them. However, we argue that these techniques suffer from bias contamination from rating distributions and feature homogenization during cross-behavior knowledge transfer: (1) Rating distribution bias, arising from non-uniform historical ratings, propagates across behaviors through distillation, contaminating the true preference representations of other behaviors. (2) Static distillation strategies often lead to homogenized behavioral features, hindering the learning of behavior-specific preferences. To address these issues, we propose a novel Bidirectional Counterfactual Distillation (BiCoD) framework for review-based recommendation. In BiCoD, we first design an adversarial counterfactual distillation module to suppress the impact of non-uniform rating distributions on distillation, thereby preventing it from contaminating the user's true preference representations across behaviors. Subsequently, we introduce a stage-aware bidirectional distillation strategy to enhance the distinctiveness of behavioral features, facilitating the effective learning of behavior-specific preferences. Extensive experiments on five real-world datasets validate the effectiveness and superiority of the proposed framework.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization

Jinxing Zhou
Ziheng Zhou
Yanghao Zhou
Yuxin Mao
Zhangling Duan
Dan Guo

The Dense Audio-Visual Event Localization (DAVEL) task aims to temporally localize events in untrimmed videos that occur simultaneously in both the audio and visual modalities. This paper explores DAVEL under a new and more challenging weakly-supervised setting (W-DAVEL task), where only video-level event labels are provided and the temporal boundaries of each event are unknown. We address W-DAVEL by exploiting cross-modal salient anchors, which are defined as reliable timestamps that are well predicted under weak supervision and exhibit highly consistent event semantics across audio and visual modalities. Specifically, we propose a Mutual Event Agreement Evaluation module, which generates an agreement score by measuring the discrepancy between the predicted audio and visual event classes. Then, the agreement score is utilized in a Cross-modal Salient Anchor Identification module, which identifies the audio and visual anchor features through global-video and local temporal window identification mechanisms. The anchor features after multimodal integration are fed into an Anchor-based Temporal Propagation module to enhance event semantic encoding in the original temporal audio and visual features, facilitating better temporal localization under weak supervision. We establish benchmarks for W-DAVEL on both the UnAV-100 and ActivityNet1.3 datasets. Extensive experiments demonstrate that our method achieves state-of-the-art performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

LinProVSR: Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition

Feng Xue
Baochao Zhu
Wei Jia
Shujie Li
Yu Li
Jinrui Zhang
Shengeng Tang
Dan Guo

Visual Speech Recognition (VSR), commonly known as lipreading, enables the recognition of spoken text by analyzing lip visual features. Due to the subtlety of lip movements, its recognition is much harder than other motion recognition tasks. Existing VSR models face the challenge of viseme ambiguity when processing phonemes with similar pronunciations—multiple phonemes share similar viseme features, leading to a notable drop in lipreading accuracy. To address this issue, this study proposes a Linguistics-Knowledge Guided Progressive Disambiguation Network for Visual Speech Recognition(LinProVSR) framework. First, an ambiguous sample set is constructed based on linguistic knowledge to provide supervisory signals for the model's training. Then, a Progressive Contrastive Disambiguation Network (PCDN) is designed, which progressively enhances the model's ability to capture the subtle viseme differences corresponding to similar phonemes through viseme-phoneme contrastive disambiguation in the encoding stage and text contrastive disambiguation in the decoding stage. Furthermore, we pioneer the Ambiguous Word Error Rate (AWER) metric specifically for evaluating recognition of phonetically ambiguous text, and verify the effectiveness of the proposed method on multiple public datasets, achieving a significant breakthrough especially in distinguishing visually similar phonemes.

PDF Details DOI

AAAI Conference 2026 Conference Paper

SIAM: Towards Generalizable Articulated Object Modeling via Single Robot-Object Interaction

Yuyan Liu
Li Zhang
Di Wu
Yan Zhang
Anran Huang
Zhi Wang
Liu Liu
Dan Guo

Articulated object modeling, which represents interconnected rigid bodies with their geometry, part segmentation, articulation tree, and physical properties, is crucial for robotic perception and manipulation. Recently existing methods like SAGCI leverage Interactive Perception (IP) to refine models through robot interaction. However, SAGCI suffers from prior-dependency (requiring initialization), neglects kinematic/dynamic constraints, and generates non-watertight meshes. To overcome these limitations, we propose SIAM, a novel framework for efficient and generalizable Single-Interaction Articulated Modeling. Given an initial point cloud, SIAM first enables minimal robot interaction to trigger object motion. It then precisely segments parts by analyzing point cloud differences pre- and post-interaction. For joint parameter estimation, we introduce an optimization incorporating novel kinematic energy constraints, enhancing physical consistency. Finally, we reconstruct a high-quality, topologically watertight mesh by learning 3D Gaussian Primitives from multi-view RGB-D observations under deformation. Extensive experiments on the PartNet-Mobility benchmark demonstrate state-of-the-art articulation modeling performance. Successful real-world deployment with an xArm robot further validates the framework's practicality and transferability. SIAM achieves accurate, prior-free modeling with significantly reduced interaction cost.

PDF Details DOI

TIST Journal 2025 Journal Article

Alleviating Confirmation Bias in Learning with Noisy Labels via Two-Network Collaboration

Chenglong Xu
Peipei Song
Shengeng Tang
Dan Guo
Xun Yang

Deep neural networks (DNNs) have achieved remarkable success in various computer vision tasks, e.g., image classification. However, most of the existing models depend heavily on annotated data, where label noise is inevitable. Training with such noisy data negatively impacts the generalization performance of DNNs. To this end, recent advances in learning with noisy labels (LNL) adopt the sample selection strategy that identifies clean samples from the noisy dataset to update DNNs, using semi-supervised learning where rejected samples are treated as unlabeled data. However, existing LNL methods often overlook the varying fitting difficulties of different classes, resulting in suboptimal sample selection and confirmation bias, and consequently, the errors accumulate during semi-supervised training. In this article, we propose a novel method, TNCollab, which aims at alleviating confirmation bias in both sample selection and semi-supervised training stages via two-network collaboration. Specifically, we introduce a class-adaptive threshold for sample selection to address the varying fitting difficulties across different classes. Additionally, we construct a hard set consisting of samples where the two networks disagree and introduce a noise-robust loss to extract potentially useful information while maintaining robustness against label noise. Furthermore, we propose a dual consistency loss to ensure consistent predictions between the networks across different augmented views of the same sample, facilitating mutual learning. Extensive experiments demonstrate that TNCollab achieves state-of-the-art performance on image classification and facial expression recognition tasks, particularly on CIFAR-10, CIFAR-100, WebVision, Clothing1M, Tiny-ImageNet, and RAF-DB datasets, showing improved visual understanding and generalization capabilities. Our codes are available at https://github.com/Delete12137/TNCollab.

Details DOI