EAAI Journal 2026 Journal Article
BBANet: Bilateral biological auditory-inspired neural network for heart sound classification
- Yang Tan
- Haojie Zhang
- Jingwen Xu
- Hanhan Wu
- Kun Qian
- Bin Hu
- Yoshiharu Yamamoto
- Björn W. Schuller
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
JBHI Journal 2026 Journal Article
Computer audition-based methods have attracted a great deal of attention in the field of disease detection due to their significant advantages, e. g. , non-invasive and convenient operation. Among them, the introduction of information representations inspired by human auditory perception, e. g. , Mel-frequency transformation, gives it great potential to approach and even exceed the limits of the human auditory system. However, according to previous research, it remains challenging to fairly assess whether information representations inspired by human auditory perception have a significant positive effect on disease detection. Moreover, performance differences among various information representations and their underlying causes are yet to be thoroughly investigated and analyzed. To this end, we propose an interpretable comparative study on information representations inspired by human auditory perception for disease detection. First, the detection accuracy of different information representations are investigated on two sound datasets (a psychological and a physiological disease) based on the classical model and the proposed Temporal-Spatial Multi-Scale Perception Network. Then, the noise robustness of these information representations are compared by introducing Gaussian noise with varying signal-to-noise ratios (SNRs). Finally, by combining the human auditory perception mechanism and explainable AI techniques, we analyze the reasons for performance differences among various information representations from qualitative and quantitative perspectives. Experimental results demonstrate that information representations inspired by human auditory perception can improve the performance of disease detection with statistical significance. Furthermore, Gammatone Frequency Cepstral Coefficients (GFCCs) outperform other information representations by achieving the highest accuracy, particularly under noisy conditions. The interpretable results further reveal the underlying reasons for GFCC's superior performance, highlighting its ability to capture critical auditory features robustly across varying noise levels. These findings emphasize the potential of auditory perception-inspired representations in advancing computer audition-based disease detection systems and provide a solid foundation for future research in this domain.
JBHI Journal 2026 Journal Article
Anxiety disorders (AD) are prevalent psychiatric conditions that profoundly impact adolescent neural development. Abnormal delta–beta cross-frequency coupling (CFC) has been identified as a key electrophysiological marker of altered neural dynamics in individuals with AD. However, most existing studies focus on static analysis within restricted brain regions and predefined frequency bands, which limits the understanding of large-scale dynamic neural communication. Therefore, we propose a novel cross-frequency coupling directed brain network (CFCDBN) framework, which integrates personalized CFC estimation and causal information flow modeling to capture the dynamic interactions of the brain network in AD. Personalized CFC significantly improves the precise representation of AD-related neural dynamics by adaptive frequency band division and individualized oscillation feature extraction, overcoming the limitations of traditional CFC methods. The analysis reveals significant delta-beta coupling abnormalities in the left hemisphere of AD, accompanied by disrupted directional pathways involving the thalamus, precuneus, and insula. These findings suggest impaired emotional and cognitive communication from the subcortical to cortical regions. To validate the efficacy of CFCDBN in distinguishing AD patients from healthy individuals, we developed a direction-aware graph neural network (DA-GNN) model that uses CFCDBN representations as input to capture dynamic neural patterns in causal brain connectivity. Experimental results show that the model consistently outperforms traditional machine learning methods and undirected GNN baselines in automatic AD identification, achieving a classification accuracy of 77. 8%, and confirming the value of CFCDBN as a robust biomarker for AD-related network dysfunction. These findings not only deepen our understanding of the neural dynamics underlying AD, but also lay the foundation for personalized and mechanism-driven neuromodulation strategies. The core implementation of the CFCDBN framework is available on GitHub: https://github.com/wdxcjnb6/CFCDBN.
AAAI Conference 2026 Conference Paper
Cytological images originate from exfoliated cells, collected via liquid-based slides and digitized into whole slide images (WSIs). Unlike histological WSIs that exhibit continuous and well-structured tissue, cytological WSIs are sparse in spatial distribution and unstructured in cellular relationships. Typically, the nucleus serves as the primary diagnostic feature, while surrounding cytoplasmic information plays a supportive role. These unique characteristics limit the development of effective foundation models and hinder the transferability of histology-based models for cytopathology. To address this, we propose **Cyto-SSL**, the first self-supervised pretraining framework for cytological images. It introduces **Nuclei-Centered Perturbation**, which highlights individual nuclei by perturbing non-nuclear regions. We also design an SR-Transformer module, which complements this by using sparse attention to concentrate on diagnostically relevant scattered cells, while iRPE helps model to capture local spatial relationships and avoids unnecessary attention to irrelevant global structures. Experimental results show that **Cyto-SSL** enhances performance across diverse cytological datasets and Multiple Instance Learning (MIL) methods. On a WSI-level dataset, it achieved 95.67% accuracy and outperformed ImageNet-pretrained ResNet-50 by 11.33%, demonstrating superior feature representation for cytological analysis. Additionally, **Cyto-SSL** modules are plug-and-play, easily integrated into other pretraining frameworks, yielding a 2.6% accuracy gain across different SSL methods.
JBHI Journal 2026 Journal Article
Sparse-channel EEG emotion recognition focuses on selecting specific brain regions or a small number of channels to achieve efficient and robust emotion recognition. Although previous studies have demonstrated excellent performance using dense EEG signals, sparse-channel EEG poses a challenge to recognition performance due to its limited feature representation capability. To address these challenges, we propose a Decoupled Feature Interaction (DFI) method to improve sparse-channel EEG emotion recognition. The proposed method flexibly focuses on decoupled features while enabling adaptive cross-feature information interaction, aiming to enhance the contribution of each feature in sparse EEG data. Specifically, we design a self-supervised auxiliary task that enhances representation learning while generating augmented data. The representations of the original and augmented data are decoupled into two components: invariant features and adaptive features. DFI supervises these decoupled features in a high-dimensional space to maximize their separation. Each decoupled component is dynamically attended to within DFI, with cross-attention applied to adaptive features and self-attention applied to invariant features, enabling both inter- and intra-feature interactions. We evaluate the proposed method on public datasets, and the results consistently demonstrate its superiority over existing emotion recognition methods. To evaluate the model under real-world conditions, we constructed a private dataset containing 3-channel electroencephalogram recordings. On this dataset, DFI achieved an accuracy of 98. 58% and an F1 score of 98. 92% in binary emotion classification, clearly demonstrating its superiority over existing methods.
JBHI Journal 2026 Journal Article
Depression is a prevalent mental disorder with severe socio-economic implications, and its early identification and intervention are crucial for mitigating disease progression. However, existing machine learning and deep learning-based approaches for depression recognition exhibit limited generalization across individuals, making them less adaptable to new subjects and restricting their practical applications. To address this issue, we propose a cross-subject depression recognition method based on Multi-Source Few-Shot Adaptation (MSFSA) using electroencephalography (EEG). The proposed method integrates multi-source domain adaptation and ensemble learning strategies. Specifically, the multi-source domain adaptation module employs an alternating training mechanism combining unsupervised domain adaptation and few-shot adaptation, reducing the model's dependency on specific subjects. Meanwhile, ensemble learning improves model robustness and stability by aggregating multiple model predictions, reducing the impact of individual model biases and enhancing classification reliability. Experiments were conducted on the public MODMA EEG dataset, comprising 53 subjects (24 patients with major depressive disorder and 29 healthy controls). With a theoretical chance level of 50% for the cross-subject classification setting, the results demonstrate that, compared with traditional machine learning methods, existing EEG-based depression recognition models, and advanced domain adaptation algorithms, leveraging the Alpha and low-Gamma band features as the key contributing factors, the proposed method achieves a significant improvement in accuracy, reaching 87. 12%, which outperforms the state-of-the-art HEMAsNet (80. 67%) and WDANet (70. 94%) on the same dataset under the 10-fold cross-subject validation protocol. These findings indicate that the proposed approach effectively reduces subject dependency in EEG-based depression recognition and provides a promising solution for improving cross-subject adaptability.
JBHI Journal 2026 Journal Article
Depression is a common and serious mental disorder, characterized by persistent low mood, loss of interest, cognitive dysfunction, and physiological changes. Patients may experience symptoms such as sleep disturbances, changes in appetite, fatigue, and low self-esteem, with severe cases potentially leading to suicidal behavior. There are differences in emotional processing and attention allocation between patients with depression and healthy controls, eye movement characteristics such as fixation patterns, saccade amplitude, and attentional bias have been used as physiological signals for depression detection. Many researchers have developed depression recognition models based on ocular imaging. However, convolutional neural networks, which utilize local receptive fields, can only capture local features in ocular imaging. This paper proposes Multi-Scale Temporal-Frequency Attention Network (MTFNet), which innovatively integrates Multi-Scale time-frequency domain attention into the Video Swin Transformer. Through Multi-Scale Temporal-Frequency Attention Module (MTFAM), MTFNet learns the most important regions in eye movement images, enabling it to capture features more effectively from sequential data and gain a deeper understanding of the structure within eye movement images. Experimental results show that the proposed method achieves a high accuracy of 76. 8% on a self-collected eye movement image dataset, outperforming most models. This work provides a novel approach to research on depression recognition based on eye movement images.
JBHI Journal 2026 Journal Article
Depression remains a leading cause of suicide among college students, highlighting the need for effective and scalable screening methods. Internet usage behavior has shown strong potential for identifying depressive tendencies, but privacy concerns limit its practical use. In this study, we propose a privacy-conscious cross-scale adaptive transformer designed for irregular time series data derived from weakly private online behavior, such as application categories and usage patterns, while excluding content-sensitive or personally identifiable information. Our model incorporates an adaptive sampling strategy to unify temporal resolutions and uses a cross-scale attention mechanism to capture depression-related behavioral patterns. We compared several classic models for irregular time series data, and the proposed method outperformed them, offering a promising, non-intrusive approach for depression detection based on privacy-conscious online activity patterns.
AAAI Conference 2026 Conference Paper
Spatial transcriptomics provides unprecedented opportunities to analyze gene patterns while preserving spatial tissue architecture. However, traditional deep learning methods for spatial transcriptomics analysis face significant challenges in multi-modal data integration, spatial dependency modeling, and biological knowledge incorporation, while existing large language models lack explicit spatial modeling capabilities for transcriptomic data. So we first present a Spatial Transcriptomics Embedding with Large Language Models (ST-LLM), a novel simple and effective approach that transforms intricate spatial graph structures into structured textual representations suitable for large language models (LLMs). ST-LLM dynamically constructs graph adjacency construction using reinforcement learning paradigms to adaptively optimize spatial relationships, converts the resulting graphs into hierarchical textual descriptions with spatial context, and leverages pre-trained semantic understanding to generate high-dimensional spatial-aware representations. Comprehensive experiments on 14 datasets demonstrate that ST-LLM achieves comparable or better performance than traditional model. ST-LLM shows that LLMs embeddings provide a new simple and effective path to encoding spatial transcriptomics biological knowledge.
AAAI Conference 2026 Conference Paper
Gradient perturbation mechanisms, such as differential privacy (DP), aim to defend against gradient inversion attacks (GIA) by injecting noise into the shared gradients. Recent studies have shown that DP-based defenses lack robustness against advanced GIAs. However, existing gradient inversion methods typically rely on iterative refinement and assume static noise, resulting in low efficiency and limited reconstruction fidelity under high-noise conditions. In this paper, we propose Venom, a novel gradient inversion attack method based on a liquid diffusion mechanism. Venom reconstructs private data directly from DP-protected gradients without requiring any prior knowledge of the noise distribution. Specifically, we design a Structural Prior Extraction (SPE) module that analytically extracts deep feature representations from perturbed gradients through energy-based aggregation, enabling stable pre-reconstruction of users' latent data features. We further introduce a Diffusion-driven Liquid Recovery Network (Diff-LRN) for high-fidelity image reconstruction. Unlike traditional diffusion models that rely on iterative sampling with predefined noise schedules, Diff-LRN performs deterministic single-step reconstruction using adaptive liquid neural dynamics to handle spatially heterogeneous noise patterns. Experiments across four benchmarks demonstrate that Venom achieves a speedup of up to 38,343× over state-of-the-art attacks while maintaining high reconstruction fidelity under strong DP settings. These results challenge prevailing assumptions about DP robustness and underscore the need for more resilient privacy-preserving mechanisms in federated learning.
ICML Conference 2025 Conference Paper
We introduce Adjoint Sampling, a highly scalable and efficient algorithm for learning diffusion processes that sample from unnormalized densities, or energy functions. It is the first on-policy approach that allows significantly more gradient updates than the number of energy evaluations and model samples, allowing us to scale to much larger problem settings than previously explored by similar methods. Our framework is theoretically grounded in stochastic optimal control and shares the same theoretical guarantees as Adjoint Matching, being able to train without the need for corrective measures that push samples towards the target distribution. We show how to incorporate key symmetries, as well as periodic boundary conditions, for modeling molecules in both cartesian and torsional coordinates. We demonstrate the effectiveness of our approach through extensive experiments on classical energy functions, and further scale up to neural network-based energy models where we perform amortized conformer generation across many molecular systems. To encourage further research in developing highly scalable sampling methods, we plan to open source these challenging benchmarks, where successful methods can directly impact progress in computational chemistry. Code & and benchmarks provided at https: //github. com/facebookresearch/adjoint_sampling.
JBHI Journal 2025 Journal Article
The development of affective computing and medical electronic technologies has led to the emergence of Artificial Intelligence (AI)-based methods for the early detection of depression. However, previous studies have often overlooked the necessity for the AI-assisted diagnosis system to be wearable and accessible in practical scenarios for depression recognition. In this work, we present an on-board executable multi-feature transfer-enhanced fusion model for our custom-designed wearable three-lead Electroencephalogram (EEG) sensor, based on EEG data collected from 73 depressed patients and 108 healthy controls. Experimental results show that the proposed model exhibits low-computational complexity (65. 0 K parameters), promising Floating-Point Operations (FLOPs) performance (25. 6 M), real-time processing (1. 5 s/execution), and low power consumption (320. 8 mW). Furthermore, it requires only 202. 0 KB of Random Access Memory (RAM) and 279. 6 KB of Read-Only Memory (ROM) when deployed on the EEG sensor. Despite its low computational and spatial complexity, the model achieves a notable classification accuracy of 95. 2%, specificity of 94. 0%, and sensitivity of 96. 9% under independent test conditions. These results underscore the potential of deploying the model on the wearable three-lead EEG sensor for assisting in the diagnosis of depression.
EAAI Journal 2025 Journal Article
EAAI Journal 2025 Journal Article
ICRA Conference 2025 Conference Paper
This paper addresses a distributed leader-follower formation control problem for a group of agents, each using a body-fixed camera with a limited field of view (FOV) for state estimation. The main challenge arises from the need to coordinate the agents' movements with their cameras' FOV to maintain visibility of the leader for accurate and reliable state estimation. To address this challenge, we propose a novel perception-aware distributed leader-follower safe control scheme that incorporates FOV limits as state constraints. A Control Barrier Function (CBF) based quadratic program is employed to ensure the forward invariance of a safety set defined by these constraints. Furthermore, new neural network based and double bounding boxes based estimators, combined with temporal filters, are developed to estimate system states directly from real-time image data, providing consistent performance across various environments. Comparison results in the Gazebo simulator demonstrate the effectiveness and robustness of the proposed framework in two distinct environments.
ICLR Conference 2025 Conference Paper
The rapid advancements in Vision-Language Models (VLMs) have shown great potential in tackling mathematical reasoning tasks that involve visual context. Unlike humans who can reliably apply solution steps to similar problems with minor modifications, we found that state-of-the-art VLMs like GPT-4o can consistently fail in these scenarios, revealing limitations in their mathematical reasoning capabilities. In this paper, we investigate the **mathematical reasoning robustness** in VLMs and evaluate how well these models perform under different variants of the same question, such as changes in visual numerical values or function graphs. While several vision-based math benchmarks have been developed to assess VLMs' problem-solving capabilities, these benchmarks contain only static sets of problems and cannot easily evaluate mathematical reasoning robustness. To fill this gap, we introduce **DynaMath**, a dynamic visual math benchmark designed for in-depth assessment of VLMs. **DynaMath** includes 501 high-quality, multi-topic *seed* questions, *each represented as a Python program*. Those programs are carefully designed and annotated to enable the automatic generation of a much larger set of *concrete* questions, including many different types of visual and textual variations. **DynaMath** allows us to evaluate the generalization ability of VLMs, by assessing their performance under varying input conditions of a seed question. We evaluated 14 state-of-the-art VLMs with 5,010 generated concrete questions (10 per seed question). Our results show that the worst-case model accuracy, defined as the percentage of correctly answered seed questions in all 10 variants, is significantly lower than the average-case accuracy. In addition, many models show high consistency in answering these questions -- the incorrectness of a certain variant of a seed question is not only due to inherent randomness. Our analysis emphasizes the need to study the robustness of VLMs' reasoning abilities, and **DynaMath** provides valuable insights to guide the development of more reliable models for mathematical reasoning.
EAAI Journal 2025 Journal Article
IJCAI Conference 2025 Conference Paper
Long video understanding with Large Language Models (LLMs) enables the description of objects that are not explicitly present in the training data. However, continuous changes in known objects and the emergence of new ones require up-to-date knowledge of objects and their dynamics for effective understanding of the open world. To alleviate this, we propose an efficient Retrieval-Enhanced Video Understanding method, dubbed REVU, which leverages external knowledge to enhance the performance of open-world learning. First, REVU introduces an extensible external text-object memory with minimal text-visual mapping, involving static and dynamic multimodal information to help LLMs-based models align text and vision features. Second, REVU retrieves object information from external databases and dynamically integrates frame-specific data from videos, enabling effective knowledge aggregation to comprehend the open world. We conducted experiments on multiple benchmark datasets, and our model demonstrates strong adaptability to out-of-domain data without requiring additional fine-tuning or re-training. Experiments on benchmark video understanding datasets reveal that our model achieves state-of-the-art performance and robust generalization.
JBHI Journal 2025 Journal Article
Federated learning (FL) has gained prominence in electroencephalogram (EEG)-based emotion recognition because of its ability to enable secure collaborative training without centralized data. However, traditional FL faces challenges due to model and data heterogeneity in smart healthcare settings. For example, medical institutions have varying computational resources, which creates a need for personalized local models. Moreover, EEG data from medical institutions typically face data heterogeneity issues stemming from limitations in participant availability, ethical constraints, and cultural differences among subjects, which can slow model convergence and degrade model performance. To address these challenges, we propose FedKDC, a novel FL framework that incorporates clustered knowledge distillation (CKD). This method introduces a consensus-based distributed learning mechanism to facilitate the clustering process. It then enhances the convergence speed through intraclass distillation and reduces the negative impact of heterogeneity through interclass distillation. Additionally, we introduce a DriftGuard mechanism to mitigate client drift, along with an entropy reducer to decrease the entropy of aggregated knowledge. The framework is validated on the SEED, SEED-IV, SEED-FRA, and SEED-GER datasets, demonstrating its effectiveness in scenarios where both the data and the models are heterogeneous. Experimental results show that FedKDC outperforms other FL frameworks in emotion recognition, achieving a maximum average accuracy of 85. 2%, and in convergence efficiency, with faster and more stable convergence.
JBHI Journal 2025 Journal Article
Segmentation of cell nuclei from three-dimensional (3D) volumetric fluorescence microscopy images is crucial for biological and clinical analyses. In recent years, convolutional neural networks have become the reliable 3D medical image segmentation standard. However, convolutional layers are limited by their finite receptive fields and weight-sharing mechanisms. Consequently, they struggle to effectively model long-range dependencies and spatial correlations, which may lead to inadequate nuclei segmentation. Moreover, the diversity in nuclear appearance and density poses additional challenges. This work proposes a lightweight multi-layer deep aggregation network, MLDA-Net, incorporating Wide Receptive Field Attention (WRFA). This module effectively simulates the large receptive field generated by self-attention in the Swin Transformer while requiring fewer model parameters. This design implements an extended global sensory field that enhances the ability to capture a wide range of spatial information. In addition, the multiple cross-attention (MCA) module in MLDA-Net enhances the output features of different resolutions from the encoder while maintaining global effectiveness. The Multi-Path Aggregation Feature Pyramid Network (MAFPN) receives multi-scale outputs from the MCA module, generating a robust hierarchical feature pyramid for the final prediction. MLDA-Net outperforms state-of-the-art networks, including 3DU-Net, nnFormer, UNETR, SwinUNETR, and 3DUXNET, on the 3D volumetric datasets NucMM and MitoEM. It achieves average performance improvements of 4% to 7% in F1 score, MIoU, and PQ metrics, thereby establishing new benchmark results.
AAAI Conference 2025 Conference Paper
Recent works on remote PhotoPlethysmoGraphy (rPPG) estimation typically use techniques like CNNs and Transformers to encode implicit features from facial videos for prediction. These methods learn to directly map facial videos to the static values of rPPG signals, overlooking the inherent dynamic characteristics of rPPG sequence. Moreover, the rPPG signal is extremely weak and highly susceptible to interference from various sources of noise, including illumination conditions, head movements, and variations in skin tone. To address these limitations, we propose a Physiology-based dynamicity disentangled diffusion (PhysDiff) model particularly designed for robust rPPG estimation. PhysDiff leverages the diffusion model to learn the distribution of quasi-periodic rPPG signal and uses a dynamicity disentanglement strategy to capture two dynamic characteristics in temporal rPPG signal, i.e., trend and amplitude. This disentanglement is motivated by the underlying dynamic physiological processes of vasodilation and vasoconstriction, ensuring a more precise representation of the rPPG signal. The disentangled components are then used as pivotal conditions in the proposed spatial-temporal hybrid denoiser for rPPG reconstruction. Besides, we introduce a periodicity-based multi-hypothesis selection strategy in model inference, which compares the natural periodicity of multiple generated rPPG hypotheses and selects the most favorable one as the final prediction. Extensive experiments on four datasets demonstrate that our PhysDiff significantly outperforms prior methods on both intra-dataset and cross-dataset testing.
JBHI Journal 2025 Journal Article
Schizophrenia (SZ) is a severe mental disorder characterized by hallucinations, delusions, cognitive impairments, and social withdrawal. It leads to a series of brain abnormalities, particularly the deformation of the hippocampus and amygdala, which are highly associated with emotion, memory, and motivation. Most previous studies have used the hippocampal and amygdaloid volume, whereas surface-based morphometry reflects nuclear deformation more finely, but it is unclear the hippocampal and amygdaloid morphometry relates to schizophrenic pathology and its potential as a biomarker. In this study, we extracted individual multivariate morphometry statistics (MMS) of hippocampus and amygdala from MRI images and analyzed the morphometric differences between groups. After dictionary learning and max pooling, we obtain reduced dimensional features and use machine learning algorithms for individual diagnosis. The results showed that the hippocampus of the schizophrenia group was significantly atrophied bilaterally and the atrophied areas were symmetrical. Subregions of the amygdala are both atrophied and expanded, and in particular, the right amygdala shows a greater degree and extent of deformation. Using the random forest classifier, the accuracy of classification using hippocampal and amygdaloid morphometric features are 94. 52% and 94. 57%, respectively, and the accuracy of classification combining the two morphometric features reached 96. 57%. Our study demonstrates the efficacy of MMS in identifying morphometric differences of the hippocampus and amygdala between healthy controls and schizophrenic, and these findings emphasize the potential of MMS as a reliable biomarker for the diagnosis of schizophrenia.
NeurIPS Conference 2025 Conference Paper
Modern engineering, spanning electrical, mechanical, aerospace, civil, and computer disciplines, stands as a cornerstone of human civilization and the foundation of our society. However, engineering design poses a fundamentally different challenge for large language models (LLMs) compared with traditional textbook-style problem solving or factual question answering. Although existing benchmarks have driven progress in areas such as language understanding, code synthesis, and scientific problem solving, real-world engineering design demands the synthesis of domain knowledge, navigation of complex trade-offs, and management of the tedious processes that consume much of practicing engineers' time. Despite these shared challenges across engineering disciplines, no benchmark currently captures the unique demands of engineering design work. In this work, we introduce EngDesign, an Engineering Design benchmark that evaluates LLMs' abilities to perform practical design tasks across nine engineering domains. Unlike existing benchmarks that focus on factual recall or question answering, EngDesign uniquely emphasizes LLMs' ability to synthesize domain knowledge, reason under constraints, and generate functional, objective-oriented engineering designs. Each task in EngDesign represents a real-world engineering design problem, accompanied by a detailed task description specifying design goals, constraints, and performance requirements. EngDesign pioneers a simulation-based evaluation paradigm that moves beyond textbook knowledge to assess genuine engineering design capabilities and shifts evaluation from static answer checking to dynamic, simulation-driven functional verification, marking a crucial step toward realizing the vision of engineering Artificial General Intelligence (AGI).
JBHI Journal 2025 Journal Article
Facial expressions have been widely used for depression recognition because it is intuitive and convenient to access. Pupil diameter contains rich emotional information that is already reflected in facial video streams. However, the spatiotemporal correlation between pupillary changes and facial behavior changes induced by emotional stimuli has not been explored in existing studies. This paper presents a novel multimodal fusion algorithm - Trial Selection Tensor Canonical Correlation Analysis (TSTCCA) to optimize the feature space and build a more robust depression recognition model, which innovatively combines the spatiotemporal relevance and complementarity between facial expression and pupil diameter features. TSTCCA explores the interaction between trials and obtains an effective fusion representation of two modalities from a trial subset related to depression. The experimental results show that TSTCCA achieves the highest accuracy of 78. 81% with the subset of 25 trials.
NeurIPS Conference 2025 Conference Paper
Learning-based neural network (NN) control policies have shown impressive empirical performance. However, obtaining stability guarantees and estimates of the region of attraction of these learned neural controllers is challenging due to the lack of stable and scalable training and verification algorithms. Although previous works in this area have achieved great success, much conservatism remains in their frameworks. In this work, we propose a novel two-stage training framework to jointly synthesize a controller and a Lyapunov function for continuous-time systems. By leveraging a Zubov‑inspired region of attraction characterization to directly estimate stability boundaries, we propose a novel training-data sampling strategy and a domain-updating mechanism that significantly reduces the conservatism in training. Moreover, unlike existing works on continuous-time systems that rely on an SMT solver to formally verify the Lyapunov condition, we extend state-of-the-art neural network verifier $\alpha, \beta$-CROWN with the capability of performing automatic bound propagation through the Jacobian of dynamical systems and a novel verification scheme that avoids expensive bisection. To demonstrate the effectiveness of our approach, we conduct numerical experiments by synthesizing and verifying controllers on several challenging nonlinear systems across multiple dimensions. We show that our training can yield region of attractions with volume $5 - 1. 5\cdot 10^{5}$ times larger compared to the baselines, and our verification on continuous systems can be up to $40-10{, }000$ times faster compared to the traditional SMT solver dReal. Our code is available at https: //github. com/Verified-Intelligence/Two-Stage_Neural_Controller_Training.
JBHI Journal 2024 Journal Article
Functional connectivity (FC) networks, built from analyses of resting-state magnetic resonance imaging (rs-fMRI), serve as efficacious biomarkers for identifying Autism Spectrum Disorders (ASD) patients. Given the neurobiological heterogeneity across individuals and the unique presentation of ASD symptoms, the fusion of individualized information into diagnosis becomes essential. However, this aspect is overlooked in most methods. Furthermore, the existing methods typically focus on studying direct pairwise connections between brain ROIs, while disregarding interactions between indirectly connected neighbors. To overcome above challenges, we build common FC and individualized FC by tangent pearson embedding (TP) and common orthogonal basis extraction (COBE) respectively, and present a novel multiview brain transformer (MBT) aimed at effectively fusing common and indivinformation of subjects. MBT is mainly constructed by transformer layers with diffusion kernel (DK), fusion quality-inspired weighting module (FQW), similarity loss and orthonormal clustering fusion readout module (OCFRead). DK transformer can incorporate higher-order random walk methods to capture wider interactions among indirectly connected brain regions. FQW promotes adaptive fusion of features between views, and similarity loss and OCFRead are placed on the last layer to accomplish the ultimate integration of information. In our method, TP, DK and FQW modules all help to model wider connectivity in the brain that make up for the shortcomings of traditional methods. We conducted experiments on the public ABIDE dataset based on AAL and CC200 respectively. Our framework has shown promising results, outperforming state-of-the-art methods on both templates. This suggests its potential as a valuable approach for clinical ASD diagnosis.
JBHI Journal 2024 Journal Article
Brain anatomical age is an effective feature to assess the status of the brain, such as atypical development and aging. Although some deep learning models have been developed for estimating infant brain age, the performance of these models was unsatisfactory because few of them considered the developmental characteristics of brain anatomy during the perinatal period—the most rapid and complex developmental stage across the lifespan. The present study proposed an attention-based hemispheric relation inference network (HRINet) that takes advantage of the nature of brain structural lateralization during early development. This model captures the inter-hemispheric relationship using a graph attention mechanism and transmits lateralization information as features to describe the interactive development between bilateral hemispheres. The HRINet was used to estimate the brain age of 531 preterm and full-term neonates from the Developing Human Connectome Project (dHCP) database based on two metrics (mean curvature and sulcal depth) characterizing the folding morphology of the cortex. Our results showed that the HRINet outperformed other benchmark models in fitting the perinatal brain age, with mean absolute error of 0. 53 and determination coefficient of 0. 89. We also verified the generalizability of the HRINet on an extra independent dataset collected from the Gansu Provincial Maternity and Child-care Hospital. Furthermore, by applying the best-performing model to an independent dataset consisting of 47 scans of preterm infants at term-equivalent age, we showed that the predicted age was significantly lower than the chronological age, suggesting a delayed development of premature brains. Our results demonstrate the effectiveness and generalizability of the HRINet in estimating infant brain age, providing promising clinical applications for assessing neonatal brain maturity.
RLJ Journal 2024 Journal Article
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets. Recent studies have demonstrated that LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions. However, interactions with LLMs can be time-consuming. In many practical scenarios, it requires a significant amount of storage space that can only be deployed on remote cloud servers. Additionally, using commercial LLMs can be costly since they may charge based on usage frequency. In this paper, we explore how to enable intelligent cost-effective interactions between a down stream task oriented agent and an LLM. We find that this problem can be naturally formulated by a Markov decision process (MDP), and propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions to accomplish a target task. On one side, When2Ask discourages unnecessary redundant interactions, while on the other side, it enables the agent to identify and follow useful instructions from the LLM. This enables the agent to halt an ongoing plan and transition to a more suitable one based on new environmental observations. Experiments on MiniGrid and Habitat environments that entail planning sub-goals demonstrate that When2Ask learns to solve target tasks with only a few necessary interactions with the LLM, significantly reducing interaction costs in testing environments compared with baseline methods. Our code is available at: https://github.com/ZJLAB-AMMI/LLM4RL.
RLC Conference 2024 Conference Paper
Large language models (LLMs) encode a vast amount of world knowledge acquired from massive text datasets. Recent studies have demonstrated that LLMs can assist an embodied agent in solving complex sequential decision making tasks by providing high-level instructions. However, interactions with LLMs can be time-consuming. In many practical scenarios, it requires a significant amount of storage space that can only be deployed on remote cloud servers. Additionally, using commercial LLMs can be costly since they may charge based on usage frequency. In this paper, we explore how to enable intelligent cost-effective interactions between a down stream task oriented agent and an LLM. We find that this problem can be naturally formulated by a Markov decision process (MDP), and propose When2Ask, a reinforcement learning based approach that learns when it is necessary to query LLMs for high-level instructions to accomplish a target task. On one side, When2Ask discourages unnecessary redundant interactions, while on the other side, it enables the agent to identify and follow useful instructions from the LLM. This enables the agent to halt an ongoing plan and transition to a more suitable one based on new environmental observations. Experiments on MiniGrid and Habitat environments that entail planning sub-goals demonstrate that When2Ask learns to solve target tasks with only a few necessary interactions with the LLM, significantly reducing interaction costs in testing environments compared with baseline methods. Our code is available at: https: //github. com/ZJLAB-AMMI/LLM4RL.
JBHI Journal 2024 Journal Article
Recently, psychophysiological computing has received considerable attention. Due to easy acquisition at a distance and less conscious initiation, gait-based emotion recognition is considered as a valuable research branch in the field of psychophysiological computing. However, most existing methods rarely explore the spatio-temporal context of gait, which limits the ability to capture the higher-order relationship between emotion and gait. In this paper, we utilize a range of research, including psychophysiological computing and artificial intelligence, to propose an integrated emotion perception framework called EPIC, which can find novel joint topology and generate thousands of synthetic gaits by spatio-temporal interaction context. First, we analyze the joint coupling among non-adjacent joints by calculating Phase Lag Index (PLI), which can discover the latent connection among body joints. Second, to synthesize more sophisticated and accurate gait sequences, we explore the effect of spatio-temporal constraints, and propose a new loss function that utilizes the Dynamic Time Warping (DTW) algorithm and pseudo-velocity curve to constrain the output of Gated Recurrent Units (GRU). Finally, Spatial Temporal Graph Convolution Networks (ST-GCN) is used to classify emotions using the generation and the real data. Experimental results demonstrate our approach achieves the accuracy of 89. 66%, and outperforms the state-of-the-art methods on Emotion-Gait dataset.
JBHI Journal 2024 Journal Article
Ubiquitous sensing has been widely applied in smart healthcare, providing an opportunity for intelligent heart sound auscultation. However, smart devices contain sensitive information, raising user privacy concerns. To this end, federated learning (FL) has been adopted as an effective solution, enabling decentralised learning without data sharing, thus preserving data privacy in the Internet of Health Things (IoHT). Nevertheless, traditional FL requires the same architectural models to be trained across local clients and global servers, leading to a lack of model heterogeneity and client personalisation. For medical institutions with private data clients, this study proposes Fed-MStacking, a heterogeneous FL framework that incorporates a stacking ensemble learning strategy to support clients in building their own models. The secondary objective of this study is to address scenarios involving local clients with data characterised by inconsistent labelling. Specifically, the local client contains only one case type, and the data cannot be shared within or outside the institution. To train a global multi-class classifier, we aggregate missing class information from all clients at each institution and build meta-data, which then participates in FL training via a meta-learner. We apply the proposed framework to a multi-institutional heart sound database. The experiments utilise random forests (RFs), feedforward neural networks (FNNs), and convolutional neural networks (CNNs) as base classifiers. The results show that the heterogeneous stacking of local models performs better compared to homogeneous stacking.
AAAI Conference 2024 Conference Paper
Focus stacking is a technique in computational photography, and it synthesizes a single all-in-focus image from different focal plane images. It is difficult for previous works to produce a high-quality all-in-focus image that meets two goals: high-fidelity to its source images and good visual effects without defects or abnormalities. This paper proposes a novel method based on optical imaging process analysis and modeling. Based on a foreground segmentation - diffusion elimination architecture, the foreground segmentation makes most of the areas in full-focus images heritage information from the source images to achieve high fidelity; diffusion elimination models the physical imaging process and is specially used to solve the transition region (TR) problem that is a long-term neglected issue and degrades visual effects of synthesized images. Based on extensive experiments on simulated dataset, existing realistic dataset and our proposed BetaFusion dataset, the results show that our proposed method can generate high-quality all-in-focus images by achieving two goals simultaneously, especially can successfully solve the TR problem and eliminate the visual effect degradation of synthesized images caused by the TR problem.
JBHI Journal 2024 Journal Article
Depression is a prevalent mental disorder that affects a significant portion of the global population. Despite recent advancements in EEG-based depression recognition models rooted in machine learning and deep learning approaches, many lack comprehensive consideration of depression's pathogenesis, leading to limited neuroscientific interpretability. To address these issues, we propose a hemisphere asymmetry network (HEMAsNet) inspired by the brain for depression recognition from EEG signals. HEMAsNet employs a combination of multi-scale Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) blocks to extract temporal features from both hemispheres of the brain. Moreover, the model introduces a unique ‘Callosum-like’ block, inspired by the corpus callosum's pivotal role in facilitating inter-hemispheric information transfer within the brain. This block enhances information exchange between hemispheres, potentially improving depression recognition accuracy. To validate the performance of HEMAsNet, we first confirmed the asymmetric features of frontal lobe EEG in the MODMA dataset. Subsequently, our method achieved a depression recognition accuracy of 0. 8067, indicating its effectiveness in increasing classification performance. Furthermore, we conducted a comprehensive investigation from spatial and frequency perspectives, demonstrating HEMAsNet's innovation in explaining model decisions. The advantages of HEMAsNet lie in its ability to achieve more accurate and interpretable recognition of depression through the simulation of physiological processes, integration of spatial information, and incorporation of the Callosum-like block.
IJCAI Conference 2024 Conference Paper
Recent studies have uncovered the potential of Large Language Models (LLMs) in addressing complex sequential decision-making tasks through the provision of high-level instructions. However, LLM-based agents lack specialization in tackling specific target problems, particularly in real-time dynamic environments. Additionally, deploying an LLM-based agent in practical scenarios can be both costly and time-consuming. On the other hand, reinforcement learning (RL) approaches train agents that specialize in the target task but often suffer from low sampling efficiency and high exploration costs. In this paper, we introduce a novel framework that addresses these challenges by training a smaller, specialized student RL agent using instructions from an LLM-based teacher agent. By incorporating the guidance from the teacher agent, the student agent can distill the prior knowledge of the LLM into its own model. Consequently, the student agent can be trained with significantly less data. Moreover, through further training with environment feedback, the student agent surpasses the capabilities of its teacher for completing the target task. We conducted experiments on challenging MiniGrid and Habitat environments, specifically designed for embodied AI research, to evaluate the effectiveness of our framework. The results clearly demonstrate that our approach achieves superior performance compared to strong baseline methods. Our code is available at https: //github. com/ZJLAB-AMMI/LLM4Teach.
YNIMG Journal 2024 Journal Article
JBHI Journal 2024 Journal Article
Text content analysis for depression detection using machine learning techniques has become a prominent area of research. However, previous studies focused mainly on analyzing the textual content, neglecting the fundamental factors driving text generation. Consequently, existing models face the challenge of poor generalization to out-of-domain data as they struggle to capture the crucial features of depression. To address this, we propose a novel computational perspective of “stimulus-response patterns” that brings us closer to the essence of clinical diagnosis of depression. Adopting this computational perspective allows us to conceptually unify diverse datasets and generalize this perspective to common datasets in the field. We introduce the Stimulus-Response Patterns-aware Network (SRP-Net) as an exemplary approach within this computational perspective. To assess the performance of the SRP-Net, we constructed a multi-stimulus dataset and conducted experimental evaluations, demonstrating its exceptional cross-stimulus generalizability. Furthermore, we demonstrated the promising performance of SPR-Net in real medical scenarios and conducted an interpretability analysis of the stimulus-response patterns. Our research investigates the critical role of stimulus-response patterns in enhancing the generalizability of text-based depression detection models, which can potentially facilitate data-driven depression detection to approach the diagnostic accuracy of psychiatrists.
AAAI Conference 2024 Conference Paper
Link prediction is a fundamental task of graph machine learning, and Graph Neural Network (GNN) based methods have become the mainstream approach due to their good performance. However, the typical practice learns node representations through neighborhood aggregation, lacking awareness of the structural relationships between target nodes. Recently, some methods have attempted to address this issue by node labeling tricks. However, they still rely on the node-centric neighborhood message passing of GNNs, which we believe involves two limitations in terms of information perception and transmission for link prediction. First, it cannot perceive long-range structural information due to the restricted receptive fields. Second, there may be information loss of node-centric model on link-centric task. In addition, we empirically find that the neighbor node features could introduce noise for link prediction. To address these issues, we propose a structural information enhanced link prediction framework, which involves removing the neighbor node features while fitting neighborhood graph structures more focused through GNN. Furthermore, we introduce Binary Structural Transformer (BST) to encode the structural relationships between target nodes, complementing the deficiency of GNN. Our approach achieves remarkable results on multiple popular benchmarks, including ranking first on ogbl-ppa, ogbl-citation2 and Pubmed.
JBHI Journal 2023 Journal Article
Depression is a heterogeneous syndrome with certain individual differences among subjects. Exploring a feature selection method that can effectively mine the commonness intra-groups and the differences inter-groups in depression recognition is therefore of great significance. This study proposed a new clustering-fusion feature selection method. Hierarchical clustering (HC) algorithm was used to capture the heterogeneity distribution of subjects. Average and similarity network fusion (SNF) algorithms were adopted to characterize the brain network atlas of different populations. Differences analysis was also utilized to obtain the features with discriminant performance. Experiments showed that compared with traditional feature selection methods, HCSNF method yielded the optimal classification results of depression recognition in both sensor and source layers of electroencephalography (EEG) data. Especially in the beta band of EEG data at sensor layer, the classification performance was improved by more than 6%. Moreover, the long-distance connections between parietal-occipital lobe and other brain regions not only have high discriminative power, but also significantly correlate with depressive symptoms, indicating the important role of these features in depression recognition. Therefore, this study may provide methodological guidance for the discovery of reproducible electrophysiological biomarkers and new insights into common neuropathological mechanisms of heterogeneous depression diseases.
NeurIPS Conference 2023 Conference Paper
The applications of direct policy search in reinforcement learning and continuous control have received increasing attention. In this work, we present novel theoretical results on the complexity of derivative-free policy optimization on an important class of robust control tasks, namely the structured $H_\infty$ synthesis with static output feedback. Optimal $H_\infty$ synthesis under structural constraints leads to a constrained nonconvex nonsmooth problem and is typicallyaddressed using subgradient-based policy search techniques that are built upon the concept of Goldstein subdifferential or other notions of enlarged subdifferential. In this paper, we study the complexity of finding $(\delta, \epsilon)$-stationary points for such nonsmooth robust control design tasks using policy optimization methods which can only access the zeroth-order oracle (i. e. the $H_\infty$ norm of the closed-loop system). First, we study the exact oracle setting and identify the coerciveness of the cost function to prove high-probability feasibility/complexity bounds for derivative-free policy optimization on this problem. Next, we derive a sample complexity result for the multi-input multi-output (MIMO) $H_\infty$-norm estimation. We combine this with our analysis to obtain the first sample complexity of model-free, trajectory-based, zeroth-order policy optimization on finding $(\delta, \epsilon)$-stationary points for structured $H_\infty$ control. Numerical results are also provided to demonstrate our theory.
JBHI Journal 2023 Journal Article
Numerous studies have shown that accurate analysis of neurological disorders contributes to the early diagnosis of brain disorders and provides a window to diagnose psychiatric disorders due to brain atrophy. The emergence of geometric deep learning approaches provides a new way to characterize geometric variations on brain networks. However, brain network data suffer from high heterogeneity and noise. Consequently, geometric deep learning methods struggle to identify discriminative and clinically meaningful representations from complex brain networks, resulting in poor diagnostic accuracy. Hence, the primary challenge in the diagnosis of brain diseases is to enhance the identification of discriminative features. To this end, this paper presents a dual-attention deep manifold harmonic discrimination (DA-DMHD) method for early diagnosis of neurodegenerative diseases. Here, a low-dimensional manifold projection is first learned to comprehensively exploit the geometric features of the brain network. Further, attention blocks with discrimination are proposed to learn a representation, which facilitates learning of group-dependent discriminant matrices to guide downstream analysis of group-specific references. Our proposed DA-DMHD model is evaluated on two independent datasets, ADNI and ADHD-200. Experimental results demonstrate that the model can tackle the hard-to-capture challenge of heterogeneous brain network topological differences and obtain excellent classifying performance in both accuracy and robustness compared with several existing state-of-the-art methods.
JBHI Journal 2023 Journal Article
Depression is a serious and common psychiatric disease characterized by emotional and cognitive dysfunction. In addition, the rates of clinical diagnosis and treatment for depression are low. Therefore, the accurate recognition of depression is important for its effective treatment. Electroencephalogram (EEG) signals, which can objectively reflect the inner states of human brains, are regarded as promising physiological tools that can enable effective and efficient clinical depression diagnosis and recognition. However, one of the challenges regarding EEG-based depression recognition involves sufficiently optimizing the spatial information derived from the multichannel space of EEG signals. Consequently, we propose an adaptive channel fusion method via improved focal loss (FL) functions for depression recognition based on EEG signals to effectively address this challenge. In this method, we propose two improved FL functions that can enhance the separability of hard examples by upweighting their losses as optimization objectives and can optimize the channel weights by a proposed adaptive channel fusion framework. The experimental results obtained on two EEG datasets show that the developed channel fusion method can achieve improved classification performance. The learned channel weights include the individual characteristics of each EEG epoch, which can effectively optimize the spatial information of each EEG epoch via the channel fusion method. In addition, the proposed method performs better than the state-of-the-art channel fusion methods.
NeurIPS Conference 2023 Conference Paper
Neural networks are known to be susceptible to adversarial samples: small variations of natural examples crafted to deliberatelymislead the models. While they can be easily generated using gradient-based techniques in digital and physical scenarios, they often differ greatly from the actual data distribution of natural images, resulting in a trade-off between strength and stealthiness. In this paper, we propose a novel framework dubbed Diffusion-Based Projected Gradient Descent (Diff-PGD) for generating realistic adversarial samples. By exploiting a gradient guided by a diffusion model, Diff-PGD ensures that adversarial samples remain close to the original data distribution while maintaining their effectiveness. Moreover, our framework can be easily customized for specific tasks such as digital attacks, physical-world attacks, and style-based attacks. Compared with existing methods for generating natural-style adversarial samples, our framework enables the separation of optimizing adversarial loss from other surrogate losses (e. g. content/smoothness/style loss), making it more stable and controllable. Finally, we demonstrate that the samples generated using Diff-PGD have better transferability and anti-purification power than traditional gradient-based methods.
JBHI Journal 2023 Journal Article
Since brain network organization is essentially governed by the harmonic waves derived from the Eigen-system of the underlying Laplacian matrix, discovering the harmonic-based alterations provides a new window to understand the pathogenic mechanism of Alzheimer's disease (AD) in a unified reference space. However, current reference (common harmonic waves) estimation studies over the individual harmonic waves are often sensitive to outliers, which are obtained by averaging the heterogenous individual brain networks. To address this challenge, we propose a novel manifold learning approach to identify a set of outlier-immunized common harmonic waves. The backbone of our framework is calculating the geometric median of all individual harmonic waves on the Stiefel manifold, instead of Fréchet mean, thus improving the robustness of learned common harmonic waves to the outliers. A manifold optimization scheme with theoretically guaranteed convergence is tailored to solve our method. The experimental results on synthetic data and real data demonstrate that the common harmonic waves learned by our approach are not only more robust to the outliers than the state-of-the-art methods, but also provide a putative imaging biomarker to predict the early stage of AD.
NeurIPS Conference 2023 Conference Paper
Recently, deep equilibrium models (DEQs) have drawn increasing attention from the machine learning community. However, DEQs are much less understood in terms of certified robustness than their explicit network counterparts. In this paper, we advance the understanding of certified robustness of DEQs via exploiting the connections between various Lipschitz network parameterizations for both explicit and implicit models. Importantly, we show that various popular Lipschitz network structures, including convex potential layers (CPL), SDP-based Lipschitz layers (SLL), almost orthogonal layers (AOL), Sandwich layers, and monotone DEQs (MonDEQ) can all be reparameterized as special cases of the Lipschitz-bounded equilibrium networks (LBEN) without changing the prescribed Lipschitz constant in the original network parameterization. A key feature of our reparameterization technique is that it preserves the Lipschitz prescription used in different structures. This opens the possibility of achieving improved certified robustness of DEQs via a combination of network reparameterization, structure-preserving regularization, and LBEN-based fine-tuning. We also support our theoretical understanding with new empirical results, which show that our proposed method improves the certified robust accuracy of DEQs on classification tasks. All codes and experiments are made available at \url{https: //github. com/AaronHavens/ExploitingLipschitzDEQ}.
AAAI Conference 2023 Conference Paper
This paper focuses on contrastive learning for gait-based emotion recognition. The existing contrastive learning approaches are rarely suitable for learning skeleton-based gait representations, which suffer from limited gait diversity and inconsistent semantics. In this paper, we propose a Cross-coordinate contrastive learning framework utilizing Ambiguity samples for self-supervised Gait-based Emotion representation (CAGE). First, we propose ambiguity transform to push positive samples into ambiguous semantic space. By learning similarities between ambiguity samples and positive samples, our model can learn higher-level semantics of the gait sequences and maintain semantic diversity. Second, to encourage learning the semantic invariance, we uniquely propose cross-coordinate contrastive learning between the Cartesian coordinate and the Spherical coordinate, which brings rich supervisory signals to learn the intrinsic semantic consistency information. Exhaustive experiments show that CAGE improves existing self-supervised methods by 5%–10% accuracy, and it achieves comparable or even superior performance to supervised methods.
JBHI Journal 2022 Journal Article
Currently, depression has become a common mental disorder, especially among postgraduates. It is reported that postgraduates have a higher risk of depression than the general public, and they are more sensitive to contact with others. Thus, a non-contact and effective method for detecting people at risk of depression becomes an urgent demand. In order to make the recognition of depression more reliable and convenient, we propose a multi-modal gait analysis-based depression detection method that combines skeleton modality and silhouette modality. Firstly, we propose a skeleton feature set to describe depression and train a Long Short-Term Memory (LSTM) model to conduct sequence strategy. Secondly, we generate Gait Energy Image (GEI) as silhouette features from RGB videos, and design two Convolutional Neural Network (CNN) models with a new loss function to extract silhouette features from front and side perspectives. Then, we construct a multi-modal fusion model consisting of fusing silhouettes from the front and side views at the feature level and the classification results of different modalities at the decision level. The proposed multi-modal model achieved accuracy at 85. 45% in the dataset consisting of 200 postgraduate students (including 86 depressive ones), 5. 17% higher than the best single-mode model. The multi-modal method also shows improved generalization by reducing the gender differences. Furthermore, we design a vivid 3D visualization of the gait skeletons, and our results imply that gait is a potent biometric for depression detection.
JBHI Journal 2022 Journal Article
The importance of detecting whether a person wears a face mask while speaking has tremendously increased since the outbreak of SARS-CoV-2 (COVID-19), as wearing a mask can help to reduce the spread of the virus and mitigate the public health crisis. Besides affecting human speech characteristics related to frequency, face masks cause temporal interferences in speech, altering the pace, rhythm, and pronunciation speed. In this regard, this paper presents two effective neural network models to detect surgical masks from audio. The proposed architectures are both based on Convolutional Neural Networks (CNNs), chosen as an optimal approach for the spatial processing of the audio signals. One architecture applies a Long Short-Term Memory (LSTM) network to model the time-dependencies. Through an additional attention mechanism, the LSTM-based architecture enables the extraction of more salient temporal information. The other architecture (named ConvTx) retrieves the relative position of a sequence through the positional encoder of a transformer module. In order to assess to which extent both architectures can complement each other when modelling temporal dynamics, we also explore the combination of LSTM and Transformers in three hybrid models. Finally, we also investigate whether data augmentation techniques, such as, using transitions between audio frames and considering gender-dependent frameworks might impact the performance of the proposed architectures. Our experimental results show that one of the hybrid models achieves the best performance, surpassing existing state-of-the-art results for the task at hand.
NeurIPS Conference 2022 Conference Paper
Direct policy search has been widely applied in modern reinforcement learning and continuous control. However, the theoretical properties of direct policy search on nonsmooth robust control synthesis have not been fully understood. The optimal $\mathcal{H}_\infty$ control framework aims at designing a policy to minimize the closed-loop $\mathcal{H}_\infty$ norm, and is arguably the most fundamental robust control paradigm. In this work, we show that direct policy search is guaranteed to find the global solution of the robust $\mathcal{H}_\infty$ state-feedback control design problem. Notice that policy search for optimal $\mathcal{H}_\infty$ control leads to a constrained nonconvex nonsmooth optimization problem, where the nonconvex feasible set consists of all the policies stabilizing the closed-loop dynamics. We show that for this nonsmooth optimization problem, all Clarke stationary points are global minimum. Next, we identify the coerciveness of the closed-loop $\mathcal{H}_\infty$ objective function, and prove that all the sublevel sets of the resultant policy search problem are compact. Based on these properties, we show that Goldstein's subgradient method and its implementable variants can be guaranteed to stay in the nonconvex feasible set and eventually find the global optimal solution of the $\mathcal{H}_\infty$ state-feedback synthesis problem. Our work builds a new connection between nonconvex nonsmooth optimization theory and robust control, leading to an interesting global convergence result for direct policy search on optimal $\mathcal{H}_\infty$ synthesis.
ICML Conference 2022 Conference Paper
Heavy Ball (HB) nowadays is one of the most popular momentum methods in non-convex optimization. It has been widely observed that incorporating the Heavy Ball dynamic in gradient-based methods accelerates the training process of modern machine learning models. However, the progress on establishing its theoretical foundation of acceleration is apparently far behind its empirical success. Existing provable acceleration results are of the quadratic or close-to-quadratic functions, as the current techniques of showing HB’s acceleration are limited to the case when the Hessian is fixed. In this work, we develop some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed. Based on our technical results, a class of Polyak-Lojasiewicz (PL) optimization problems for which provable acceleration can be achieved via HB is identified. Moreover, our analysis demonstrates a benefit of adaptively setting the momentum parameter.
YNIMG Journal 2022 Journal Article
JBHI Journal 2021 Journal Article
Depression is a mental disorder with emotional and cognitive dysfunction. The main clinical characteristic of depression is significant and persistent low mood. As reported, depression is a leading cause of disability worldwide. Moreover, the rate of recognition and treatment for depression is low. Therefore, the detection and treatment of depression are urgent. Multichannel electroencephalogram (EEG) signals, which reflect the working status of the human brain, can be used to develop an objective and promising tool for augmenting the clinical effects in the diagnosis and detection of depression. However, when a large number of EEG channels are acquired, the information redundancy and computational complexity of the EEG signals increase; thus, effective channel selection algorithms are required not only for machine learning feasibility, but also for practicality in clinical depression detection. Consequently, we propose an optimal channel selection method for EEG-based depression detection via kernel-target alignment (KTA) to effectively resolve the abovementioned issues. In this method, we consider a modified version KTA that can measure the similarity between the kernel matrix for channel selection and the target matrix as an objective function and optimize the objective function by a proposed optimal channel selection strategy. Experimental results on two EEG datasets show that channel selection can effectively increase the classification performance and that even if we rely only on a small subset of channels, the results are still acceptable. The selected channels are in line with the expected latent cortical activity patterns in depression detection. Moreover, the experimental results demonstrate that our method outperforms the state-of-the-art channel selection approaches.
NeurIPS Conference 2021 Conference Paper
Direct policy search serves as one of the workhorses in modern reinforcement learning (RL), and its applications in continuous control tasks have recently attracted increasing attention. In this work, we investigate the convergence theory of policy gradient (PG) methods for learning the linear risk-sensitive and robust controller. In particular, we develop PG methods that can be implemented in a derivative-free fashion by sampling system trajectories, and establish both global convergence and sample complexity results in the solutions of two fundamental settings in risk-sensitive and robust control: the finite-horizon linear exponential quadratic Gaussian, and the finite-horizon linear-quadratic disturbance attenuation problems. As a by-product, our results also provide the first sample complexity for the global convergence of PG methods on solving zero-sum linear-quadratic dynamic games, a nonconvex-nonconcave minimax optimization problem that serves as a baseline setting in multi-agent reinforcement learning (MARL) with continuous spaces. One feature of our algorithms is that during the learning phase, a certain level of robustness/risk-sensitivity of the controller is preserved, which we termed as the implicit regularization property, and is an essential requirement in safety-critical control systems.
IJCAI Conference 2021 Conference Paper
Skeleton-based person re-identification (Re-ID) is an emerging open topic providing great value for safety-critical applications. Existing methods typically extract hand-crafted features or model skeleton dynamics from the trajectory of body joints, while they rarely explore valuable relation information contained in body structure or motion. To fully explore body relations, we construct graphs to model human skeletons from different levels, and for the first time propose a Multi-level Graph encoding approach with Structural-Collaborative Relation learning (MG-SCR) to encode discriminative graph features for person Re-ID. Specifically, considering that structurally-connected body components are highly correlated in a skeleton, we first propose a multi-head structural relation layer to learn different relations of neighbor body-component nodes in graphs, which helps aggregate key correlative features for effective node representations. Second, inspired by the fact that body-component collaboration in walking usually carries recognizable patterns, we propose a cross-level collaborative relation layer to infer collaboration between different level components, so as to capture more discriminative skeleton graph features. Finally, to enhance graph dynamics encoding, we propose a novel self-supervised sparse sequential prediction task for model pre-training, which facilitates encoding high-level graph semantics for person Re-ID. MG-SCR outperforms state-of-the-art skeleton-based methods, and it achieves superior performance to many multi-modal methods that utilize extra RGB or depth features. Our codes are available at https: //github. com/Kali-Hac/MG-SCR.
NeurIPS Conference 2020 Conference Paper
Reinforcement learning (RL) algorithms can fail to generalize due to the gap between the simulation and the real world. One standard remedy is to use robust adversarial RL (RARL) that accounts for this gap during the policy training, by modeling the gap as an adversary against the training agent. In this work, we reexamine the effectiveness of RARL under a fundamental robust control setting: the linear quadratic (LQ) case. We first observe that the popular RARL scheme that greedily alternates agents’ updates can easily destabilize the system. Motivated by this, we propose several other policy-based RARL algorithms whose convergence behaviors are then studied both empirically and theoretically. We find: i) the conventional RARL framework (Pinto et al. , 2017) can learn a destabilizing policy if the initial policy does not enjoy the robust stability property against the adversary; and ii) with robustly stabilizing initializations, our proposed double-loop RARL algorithm provably converges to the global optimal cost while maintaining robust stability on-the-fly. We also examine the stability and convergence issues of other variants of policy-based RARL algorithms, and then discuss several ways to learn robustly stabilizing initializations. From a robust control perspective, we aim to provide some new and critical angles about RARL, by identifying and addressing the stability issues in this fundamental LQ setting in continuous control. Our results make an initial attempt toward better theoretical understandings of policy-based RARL, the core approach in Pinto et al. , 2017.
IJCAI Conference 2020 Conference Paper
Gait-based person re-identification (Re-ID) is valuable for safety-critical applications, and using only 3D skeleton data to extract discriminative gait features for person Re-ID is an emerging open topic. Existing methods either adopt hand-crafted features or learn gait features by traditional supervised learning paradigms. Unlike previous methods, we for the first time propose a generic gait encoding approach that can utilize unlabeled skeleton data to learn gait representations in a self-supervised manner. Specifically, we first propose to introduce self-supervision by learning to reconstruct input skeleton sequences in reverse order, which facilitates learning richer high-level semantics and better gait representations. Second, inspired by the fact that motion's continuity endows temporally adjacent skeletons with higher correlations (“locality”), we propose a locality-aware attention mechanism that encourages learning larger attention weights for temporally adjacent skeletons when reconstructing current skeleton, so as to learn locality when encoding gait. Finally, we propose Attention-based Gait Encodings (AGEs), which are built using context vectors learned by locality-aware attention, as final gait representations. AGEs are directly utilized to realize effective person Re-ID. Our approach typically improves existing skeleton-based methods by 10-20% Rank-1 accuracy, and it achieves comparable or even superior performance to multi-modal methods with extra RGB or depth information.
YNICL Journal 2020 Journal Article
NeurIPS Conference 2019 Conference Paper
In this paper, we provide a unified analysis of temporal difference learning algorithms with linear function approximators by exploiting their connections to Markov jump linear systems (MJLS). We tailor the MJLS theory developed in the control community to characterize the exact behaviors of the first and second order moments of a large family of temporal difference learning algorithms. For both the IID and Markov noise cases, we show that the evolution of some augmented versions of the mean and covariance matrix of the TD estimation error exactly follows the trajectory of a deterministic linear time-invariant (LTI) dynamical system. Applying the well-known LTI system theory, we obtain closed-form expressions for the mean and covariance matrix of the TD estimation error at any time step. We provide a tight matrix spectral radius condition to guarantee the convergence of the covariance matrix of the TD estimation error, and perform a perturbation analysis to characterize the dependence of the TD behaviors on learning rate. For the IID case, we provide an exact formula characterizing how the mean and covariance matrix of the TD estimation error converge to the steady state values at a linear rate. For the Markov case, we use our formulas to explain how the behaviors of TD learning algorithms are affected by learning rate and the underlying Markov chain. For both cases, upper and lower bounds for the mean square TD error are provided. The mean square TD error is shown to converge linearly to an exact limit.
AIIM Journal 2019 Journal Article
JBHI Journal 2019 Journal Article
Currently, depression has become a common mental disorder and one of the main causes of disability worldwide. Due to the difference in depressive symptoms evoked by individual differences, how to design comprehensive and effective depression detection methods has become an urgent demand. This study explored from physiological and behavioral perspectives simultaneously and fused pervasive electroencephalography (EEG) and vocal signals to make the detection of depression more objective, effective and convenient. After extraction of several effective features for these two types of signals, we trained six representational classifiers on each modality, then denoted diversity and correlation of decisions from different classifiers using co-decision tensor and combined these decisions into the ultimate classification result with multi-agent strategy. Experimental results on 170 (81 depressed patients and 89 normal controls) subjects showed that the proposed multi-modal depression detection strategy is superior to the single-modal classifiers or other typical late fusion strategies in accuracy, f1-score and sensitivity. This work indicates that late fusion of pervasive physiological and behavioral signals is promising for depression detection and the multi-agent strategy can take advantage of diversity and correlation of different classifiers effectively to gain a better final decision.
TIST Journal 2017 Journal Article
In many research and application areas, such as information retrieval and machine learning, we often encounter dealing with a probability distribution that is mixed by one distribution that is relevant to our task in hand and the other that is irrelevant and that we want to get rid of. Thus, it is an essential problem to separate the irrelevant distribution from the mixture distribution. This article is focused on the application in Information Retrieval, where relevance feedback is a widely used technique to build a refined query model based on a set of feedback documents. However, in practice, the relevance feedback set, even provided by users explicitly or implicitly, is often a mixture of relevant and irrelevant documents. Consequently, the resultant query model (typically a term distribution) is often a mixture rather than a true relevance term distribution, leading to a negative impact on the retrieval performance. To tackle this problem, we recently proposed a Distribution Separation Method (DSM), which aims to approximate the true relevance distribution by separating a seed irrelevance distribution from the mixture one. While it achieved a promising performance in an empirical evaluation with simulated explicit irrelevance feedback data, it has not been deployed in the scenario where one should automatically obtain the irrelevance feedback data. In this article, we propose a substantial extension of the basic DSM from two perspectives: developing a further regularization framework and deploying DSM in the automatic irrelevance feedback scenario. Specifically, in order to avoid the output distribution of DSM drifting away from the true relevance distribution when the quality of seed irrelevant distribution (as the input to DSM) is not guaranteed, we propose a DSM regularization framework to constrain the estimation for the relevance distribution. This regularization framework includes three algorithms, each corresponding to a regularization strategy incorporated in the objective function of DSM. In addition, we exploit DSM in automatic (i.e., pseudo) irrelevance feedback, by automatically detecting the seed irrelevant documents via three different document reranking methods. We have carried out extensive experiments based on various TREC datasets, in order to systematically evaluate the proposed methods. The experimental results demonstrate the effectiveness of our proposed approaches in comparison with various strong baselines.
IS Journal 2015 Journal Article
The authors summarize the main aspects of brain informatics based big data interacting in the social-cyber-physical space of the Wisdom Web of Things (W2T). In particular, they focus on how to realize human-level collective intelligence as a big data sharing mind--a harmonized collectivity of consciousness on the W2T that uses brain-inspired intelligent technologies to provide wisdom services. Finally, the authors propose five guiding principles to deeper understanding the nature of the vigorous interaction and interdependence of brain-body-environment.
JBHI Journal 2013 Journal Article
A new model to remove ocular artifacts (OA) from electroencephalograms (EEGs) is presented. The model is based on discrete wavelet transformation (DWT) and adaptive noise cancellation (ANC). Using simulated and measured data, the accuracy of the model is compared with the accuracy of other existing methods based on stationary wavelet transforms and our previous work based on wavelet packet transform and independent component analysis. A particularly novel feature of the new model is the use of DWTs to construct an OA reference signal, using the three lowest frequency wavelet coefficients of the EEGs. The results show that the new model demonstrates an improved performance with respect to the recovery of true EEG signals and also has a better tracking performance. Because the new model requires only single channel sources, it is well suited for use in portable environments where constraints with respect to acceptable wearable sensor attachments usually dictate single channel devices. The model is also applied and evaluated against data recorded within the EUFP 7 Project-Online Predictive Tools for Intervention in Mental Illness (OPTIMI). The results show that the proposed model is effective in removing OAs and meets the requirements of portable systems used for patient monitoring as typified by the OPTIMI project.
IS Journal 2012 Journal Article
To overcome public transportation problems during the 16th Asian Games held in Guanhzhou China, a PtMS (Parallel Transportation Management System), a novel application of Intelligent Transportation Systems, was introduced for effective and convenient traffic management. Results show that PtMS has successfully enhanced public traffic management, raising it from experience-based policy formulation plus manual implementation to scientific computing-based policy generation plus implementation with intelligent systems.
IS Journal 2011 Journal Article
Technical advances in the neuroelectric recordings and in the computational tools for the analysis of the brain activity and connectivity make it now possible to follow and to quantify, in real time, the interactive brain activity in a group of subjects engaged in social interactions. The degree of interaction between persons can then be assessed by "reading" their neuroelectric activities. Imaging the social brain can thus open a new area of study in neuroscience.