EAAI Journal 2026 Journal Article
Expert consensus-driven spatial-temporal graph neural network for enhanced diagnosis of chronic fetal distress
- Yefei Zhang
- Yanjun Deng
- Yi Yuan
- Bingxin Ruan
- Zhidong Zhao
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
EAAI Journal 2026 Journal Article
AAAI Conference 2026 Conference Paper
While Multimodal Large Language Models (MLLMs) show immense promise for achieving truly human-like interactions, progress is hindered by the lack of fine-grained evaluation frameworks for human-centered scenarios, encompassing both the understanding of complex human intentions and the provision of empathetic, context-aware responses. Here we introduce HumanSense, a comprehensive benchmark designed to evaluate the human-centered perception and interaction capabilities of MLLMs, with a particular focus on deep understanding of extended multimodal contexts and the formulation of rational feedback. Our evaluation reveals that leading MLLMs still have considerable room for improvement, particularly for advanced interaction-oriented tasks. Supplementing visual input with audio and text information yields substantial improvements, and Omni-modal models show advantages on these tasks.Furthermore, grounded in the observation that appropriate feedback stems from a contextual analysis of the interlocutor's needs and emotions, we posit that reasoning ability serves as the key to unlocking it. We devise a multi-stage, modality-progressive reinforcement learning approach, resulting in HumanSense-Omni-Reasoning, which substantially enhances performance on higher-level understanding and interactive tasks. Additionally, we observe that successful reasoning processes appear to exhibit consistent thought patterns. By designing corresponding prompts, we also enhance the performance of non-reasoning models in a training-free manner.
IJCAI Conference 2025 Conference Paper
Missing values in multivariate time series data present significant challenges to effective analysis. Existing methods for multivariate time series analysis either ignore missing data, sacrificing performance, or follow the impute-then-analyze paradigm, which suffers from redundant training and error accumulation, leading to biased results and suboptimal performance. In this paper, we propose INTER, a novel end-to-end framework for incomplete multivariate time series analysis, which bypasses imputation by leveraging pre-trained language models to learn the distribution of incomplete time series data. INTER incorporates two novel components: the missing-rate-aware time series patch-dropping (MPD) strategy and the missing-aware Transformer block, both of which we propose to enhance model generalization, robustness, and the ability to capture underlying patterns in the observed incomplete time series. Moreover, we theoretically prove that the MPD strategy exhibits lower sample variance for time series with the same dropout rate compared to other dropping strategies. Extensive experiments on 11 public real-world time series datasets demonstrate that INTER improves accuracy by over 20% compared to state-of-the-art methods, while maintaining competitive computational efficiency.
YNIMG Journal 2025 Journal Article
Synaptic plasticity plays a crucial role in the extinction of fearful memories. Low-intensity transcranial ultrasound stimulation (TUS) can modulate synaptic plasticity and promote the extinction of fear memories. However, the mechanism by which TUS promotes the extinction of fear memory remains unclear. This study aimed to explore whether and how synaptic plasticity under TUS is involved in modulating fear memory and the role of the brain-derived neurotrophic factor (BDNF)-the tropomyosin-related kinase B (TrkB) signaling pathway in this process. We used behavioral tests and two-photon fluorescence imaging to investigate the modulatory effects of TUS on fear memory and examined the formation/elimination of dendritic spines and the calcium activity of pyramidal neurons in the prefrontal cortex in mice in vivo. We found that TUS of the prefrontal cortex can promote fear memory extinction in mice while promoting dendritic spine formation, reducing dendritic spine elimination, increasing pyramidal neuron activity, and enhancing the expression of BDNF and its receptor TrkB. Conversely, inhibiting the BDNF-TrkB signaling pathway weakened these effects of ultrasound stimulation. Our study demonstrated that TUS could promote the extinction of fear memories, indicating that TUS has the potential to be used in the clinical treatment of patients with fear memory.
IJCAI Conference 2025 Conference Paper
Multivariate time series (MTS) data in real-world scenarios are often incomplete, which hinders effective data analysis. Therefore, MTS imputation has been widely studied to facilitate various MTS tasks. Existing imputation methods primarily initialize missing values with zeros in order to perform effective incomplete MTS encoding, which impede the model's capacity to precisely discern the missing distribution. Moreover, these methods often overlook the global similarity in time series but are limited in the use of local information within the sample. To this end, we propose a novel multivariate time series imputation network model, named MMNet. MMNet introduces a Missing-Aware Embedding (MAE) approach to adaptively represent incomplete MTS, allowing the model to better distinguish between missing and observed data. Furthermore, we design a Memory-Enhanced Encoder (MEE) aimed at modeling prior knowledge through memory mechanism, enabling better utilization of the global similarity within the time series. Building upon this, MMNet incorporates a Multi-scale Mixing architecture (MSM) that leverages information from multiple scales to enhance the final imputation. Extensive experiments on four public real-world datasets demonstrate that, MMNet yields a more than 25% gain in performance, compared with the state-of-the-art methods.
YNIMG Journal 2024 Journal Article
Working memory in attention deficit hyperactivity disorder (ADHD) is closely related to cortical functional network connectivity (CFNC), such as abnormal connections between the frontal, temporal, occipital cortices and with other brain regions. Low-intensity transcranial ultrasound stimulation (TUS) has the advantages of non-invasiveness, high spatial resolution, and high penetration depth and can improve ADHD memory behavior. However, how it modulates CFNC in ADHD and the CFNC mechanism that improves working memory behavior in ADHD remain unclear. In this study, we observed working memory impairment in ADHD rats, establishing a corresponding relationship between changes in CFNCs and the behavioral state during the working memory task. Specifically, we noted abnormalities in the information transmission and processing capabilities of CFNC in ADHD rats while performing working memory tasks. These abnormalities manifested in the network integration ability of specific areas, as well as the information flow and functional differentiation of CFNC. Furthermore, our findings indicate that TUS effectively enhances the working memory ability of ADHD rats by modulating information transmission, processing, and integration capabilities, along with adjusting the information flow and functional differentiation of CFNC. Additionally, we explain the CFNC mechanism through which TUS improves working memory in ADHD. In summary, these findings suggest that CFNCs are important in working memory behaviors in ADHD.
YNIMG Journal 2024 Journal Article
Memory is closely associated with neuronal activity and dendritic spine formation. Low-intensity transcranial ultrasound stimulation (TUS) improves the memory of individuals with vascular dementia (VD). However, it is unclear whether neuronal activity and dendritic spine formation under ultrasound stimulation are involved in memory improvement in VD. In this study, we found that seven days of TUS improved memory in VD model while simultaneously increasing pyramidal neuron activity, promoting dendritic spine formation, and reducing dendritic spine elimination. These effects lasted for 7 days but disappeared on 14 d after TUS. Neuronal activity and dendritic spine formation strongly corresponded to improvements in memory behavior over time. In addition, we also found that the memory, neuronal activity and dendritic spine of VD mice cannot be restored again by TUS of 7 days after 28 d. Collectively, these findings suggest that TUS increases neuronal activity and promotes dendritic spine formation and is thus important for improving memory in patients with VD.
ICML Conference 2023 Conference Paper
Text-to-audio (TTA) systems have recently gained attention for their ability to synthesize general audio based on text descriptions. However, previous studies in TTA have limited generation quality with high computational costs. In this study, we propose AudioLDM, a TTA system that is built on a latent space to learn continuous audio representations from contrastive language-audio pretraining (CLAP) embeddings. The pretrained CLAP models enable us to train LDMs with audio embeddings while providing text embeddings as the condition during sampling. By learning the latent representations of audio signals without modelling the cross-modal relationship, AudioLDM improves both generation quality and computational efficiency. Trained on AudioCaps with a single GPU, AudioLDM achieves state-of-the-art TTA performance compared to other open-sourced systems, measured by both objective and subjective metrics. AudioLDM is also the first TTA system that enables various text-guided audio manipulations (e. g. , style transfer) in a zero-shot fashion. Our implementation and demos are available at https: //audioldm. github. io.
YNIMG Journal 2023 Journal Article
Previous studies have demonstrated that transcranial ultrasound stimulation (TUS) not only modulates cerebral hemodynamics, neural activity, and neurovascular coupling characteristics in resting samples but also exerts a significant inhibitory effect on the neural activity in task samples. However, the effect of TUS on cerebral blood oxygenation and neurovascular coupling in task samples remains to be elucidated. To answer this question, we first used forepaw electrical stimulation of the mice to elicit the corresponding cortical excitation, and then stimulated this cortical region using different modes of TUS, and simultaneously recorded the local field potential using electrophysiological acquisition and hemodynamics using optical intrinsic signal imaging. The results indicate that for the mice under peripheral sensory stimulation state, TUS with a duty cycle of 50% can (1) enhance the amplitude of cerebral blood oxygenation signal, (2) reduce the time-frequency characteristics of evoked potential, (3) reduce the strength of neurovascular coupling in time domain, (4) enhance the strength of neurovascular coupling in frequency domain, and (5) reduce the time-frequency cross-coupling of neurovasculature. The results of this study indicate that TUS can modulate the cerebral blood oxygenation and neurovascular coupling in peripheral sensory stimulation state mice under specific parameters. This study opens up a new area of investigation for potential applicability of TUS in brain diseases related to cerebral blood oxygenation and neurovascular coupling.
YNIMG Journal 2023 Journal Article
AAAI Conference 2023 Conference Paper
The creation of a parameterized stylized character involves careful selection of numerous parameters, also known as the "avatar vectors" that can be interpreted by the avatar engine. Existing unsupervised avatar vector estimation methods that auto-create avatars for users, however, often fail to work because of the domain gap between realistic faces and stylized avatar images. To this end, we propose SwiftAvatar, a novel avatar auto-creation framework that is evidently superior to previous works. SwiftAvatar introduces dual-domain generators to create pairs of realistic faces and avatar images using shared latent codes. The latent codes can then be bridged with the avatar vectors as pairs, by performing GAN inversion on the avatar images rendered from the engine using avatar vectors. Through this way, we are able to synthesize paired data in high-quality as many as possible, consisting of avatar vectors and their corresponding realistic faces. We also propose semantic augmentation to improve the diversity of synthesis. Finally, a light-weight avatar vector estimator is trained on the synthetic pairs to implement efficient auto-creation. Our experiments demonstrate the effectiveness and efficiency of SwiftAvatar on two different avatar engines. The superiority and advantageous flexibility of SwiftAvatar are also verified in both subjective and objective evaluations.
YNIMG Journal 2023 Journal Article
The present study aimed to investigate the effectiveness of closed-loop transcranial ultrasound stimulation (closed-loop TUS) as a non-invasive, high temporal-spatial resolution method for modulating brain function to enhance memory. For this purpose, we applied closed-loop TUS to the CA1 region of the rat hippocampus for 7 consecutive days at different phases of theta cycles. Following the intervention, we evaluated memory performance through behavioral testing and recorded the neural activity. Our results indicated that closed-loop TUS applied at the peak phase of theta cycles significantly improves the memory performance in rats, as evidenced by behavioral testing. Furthermore, we observed that closed-loop TUS modifies the power and cross-frequency coupling strength of local field potentials (LFPs) during memory task, as well as modulates neuronal activity patterns and synaptic transmission, depending on phase of stimulation relative to theta rhythm. We demonstrated that closed-loop TUS can modulate neural activity and memory performance in a phase-dependent manner. Specifically, we observed that effectiveness of closed-loop TUS in regulating neural activity and memory is dependent on the timing of stimulation in relation to different theta phase. The findings implied that closed-loop TUS may have the capability to alter neural activity and memory performance in a phase-sensitive manner, and suggested that the efficacy of closed-loop TUS in modifying neural activity and memory was contingent on timing of stimulation with respect to the theta rhythm. Moreover, the improvement in memory performance after closed-loop TUS was found to be persistent.
AAAI Conference 2022 Conference Paper
Motion completion, as a challenging and fundamental problem, is of great significance in film and game applications. For different motion completion application scenarios (inbetweening, in-filling, and blending), most previous methods deal with the completion problems with case-by-case methodology designs. In this work, we propose a simple but effective method to solve multiple motion completion problems under a unified framework and achieve a new state-ofthe-art accuracy on LaFAN1 (+17% better than the previous SoTA) under multiple evaluation settings. Inspired by the recent great success of self-attention-based transformer models, we consider the completion as a sequence-to-sequence prediction problem. Our method consists of three modules a standard transformer encoder with self-attention that learns long-range dependencies of input motions, a trainable mixture embedding module that models temporal information and encodes different key-frame combinations in a unified form, and a new motion perceptual loss for better capturing high-frequency movements. Our method can predict multiple missing frames within a single forward propagation in real-time without post-processing. We also introduce a novel large-scale dance movement dataset for exploring the scaling capability of our method and its effectiveness in complex motion applications.
IJCAI Conference 2022 Conference Paper
In this paper, we present a novel double diffusion based neural radiance field, dubbed DD-NeRF, to reconstruct human body geometry and render the human body appearance in novel views from a sparse set of images. We first propose a double diffusion mechanism to achieve expressive representations of input images by fully exploiting human body priors and image appearance details at two levels. At the coarse level, we first model the coarse human body poses and shapes via an unclothed 3D deformable vertex model as guidance. At the fine level, we present a multi-view sampling network to capture subtle geometric deformations and image detailed appearances, such as clothing and hair, from multiple input views. Considering the sparsity of the two level features, we diffuse them into feature volumes in the canonical space to construct neural radiance fields. Then, we present a signed distance function (SDF) regression network to construct body surfaces from the diffused features. Thanks to our double diffused representations, our method can even synthesize novel views of unseen subjects. Experiments on various datasets demonstrate that our approach outperforms the state-of-the-art in both geometric reconstruction and novel view synthesis.
JBHI Journal 2022 Journal Article
A novel coronavirus disease (COVID-19) is a pandemic disease has caused 4 million deaths and more than 200 million infections worldwide (as of August 4, 2021). Rapid and accurate diagnosis of COVID-19 infection is critical to controlling the spread of the epidemic. In order to quickly and efficiently detect COVID-19 and reduce the threat of COVID-19 to human survival, we have firstly proposed a detection framework based on reinforcement learning for COVID-19 diagnosis, which constructs a mixed loss function that can integrate the advantages of multiple loss functions. This paper uses the accuracy of the validation set as the reward value, and obtains the initial model for the next epoch by searching the model corresponding to the maximum reward value in each epoch. We also have proposed a prediction framework that integrates multiple detection frameworks using parameter sharing to predict the progression of patients' disease without additional training. This paper also constructed a higher-quality version of the CT image dataset containing 247 cases screened by professional physicians, and obtained more excellent results on this dataset. Meanwhile, we used the other two COVID-19 datasets as external verifications, and still achieved a high accuracy rate without additional training. Finally, the experimental results show that our classification accuracy can reach 98. 31%, and the precision, sensitivity, specificity, and AUC (Area Under Curve) are 98. 82%, 97. 99%, 98. 67%, and 0. 989, respectively. The accuracy of external verification can reach 93. 34% and 91. 05%. What's more, the accuracy of our prediction framework is 91. 54%. A large number of experiments demonstrate that our proposed method is effective and robust for COVID-19 detection and prediction.
IJCAI Conference 2021 Conference Paper
Music-to-dance translation is an emerging and powerful feature in recent role-playing games. Previous works of this topic consider music-to-dance as a supervised motion generation problem based on time-series data. However, these methods require a large amount of training data pairs and may suffer from the degradation of movements. This paper provides a new solution to this task where we re-formulate the translation as a piece-wise dance phrase retrieval problem based on the choreography theory. With such a design, players are allowed to optionally edit the dance movements on top of our generation while other regression-based methods ignore such user interactivity. Considering that the dance motion capture is expensive that requires the assistance of professional dancers, we train our method under a semi-supervised learning fashion with a large unlabeled music dataset (20x than our labeled one) and also introduce self-supervised pre-training to improve the training stability and generalization performance. Experimental results suggest that our method not only generalizes well over various styles of music but also succeeds in choreography for game players. Our project including the large-scale dataset and supplemental materials is available at https: //github. com/FuxiCV/music-to-dance.
AAAI Conference 2021 Conference Paper
Self-supervised learning shows great potential in monocular depth estimation, using image sequences as the only source of supervision. Although people try to use high-resolution image for depth estimation, the accuracy of prediction has not been significantly improved. In this work, we find the core reason comes from the inaccurate depth estimation in large gradient regions, making the bilinear interpolation error gradually disappear as the resolution increases. To obtain more accurate depth estimation in large gradient regions, it is necessary to obtain high-resolution features with spatial and semantic information. Therefore, we present an improved DepthNet, HR-Depth, with two effective strategies: (1) redesign the skip-connection in DepthNet to get better highresolution features and (2) propose feature fusion Squeezeand-Excitation(fSE) module to fuse feature more efficiently. Using Resnet-18 as the encoder, HR-Depth surpasses all previous state-of-the-art(SoTA) methods with the least parameters at both high and low resolution. Moreover, previous SoTA methods are based on fairly complex and deep networks with many parameters which limits their real applications. Thus we also construct a lightweight network which uses MobileNetV3 as encoder. Experiments show that the lightweight network can perform on par with many large models like Monodepth2 at high-resolution with only 20% parameters. All codes and models will be available at https: //github. com/shawLyu/HR-Depth.
AAAI Conference 2021 Conference Paper
In this paper, we propose an effective global relation learning algorithm to recommend an appropriate location of a building unit for in-game customization of residential home complex. Given a construction layout, we propose a visual contextaware graph generation network that learns the implicit global relations among the scene components and infers the location of a new building unit. The proposed network takes as input the scene graph and the corresponding top-view depth image. It provides the location recommendations for a newlyadded building units by learning an auto-regressive edge distribution conditioned on existing scenes. We also introduce a global graph-image matching loss to enhance the awareness of essential geometry semantics of the site. Qualitative and quantitative experiments demonstrate that the recommended location well reflects the implicit spatial rules of components in the residential estates, and it is instructive and practical to locate the building units in the 3D scene of the complex construction.
YNIMG Journal 2021 Journal Article
AAAI Conference 2021 Conference Paper
Many deep learning based 3D face reconstruction methods have been proposed recently, however, few of them have applications in games. Current game character customization systems either require players to manually adjust considerable face attributes to obtain the desired face, or have limited freedom of facial shape and texture. In this paper, we propose an automatic character face creation method that predicts both facial shape and texture from a single portrait, and it can be integrated into most existing 3D games. Although 3D Morphable Face Model (3DMM) based methods can restore accurate 3D faces from single images, the topology of 3DMM mesh is different from the meshes used in most games. To acquire fidelity texture, existing methods require a large amount of face texture data for training, while building such datasets is time-consuming and laborious. Besides, such a dataset collected under laboratory conditions may not generalized well to in-the-wild situations. To tackle these problems, we propose 1) a low-cost facial texture acquisition method, 2) a shape transfer algorithm that can transform the shape of a 3DMM mesh to games, and 3) a new pipeline for training 3D game face reconstruction networks. The proposed method not only can produce detailed and vivid game characters similar to the input portrait, but can also eliminate the influence of lighting and occlusions. Experiments show that our method outperforms state-of-theart methods used in games. Code and dataset are available at https: //github. com/FuxiCV/MeInGame.
AAAI Conference 2021 Conference Paper
The paper proposes a novel generative adversarial network for one-shot face reenactment, which can animate a single face image to a different pose-and-expression (provided by a driving image) while keeping its original appearance. The core of our network is a novel mechanism called appearance adaptive normalization, which can effectively integrate the appearance information from the input image into our face generator by modulating the feature maps of the generator using the learned adaptive parameters. Furthermore, we specially design a local net to reenact the local facial components (i. e. , eyes, nose and mouth) first, which is a much easier task for the network to learn and can in turn provide explicit anchors to guide our face generator to learn the global appearance and pose-and-expression. Extensive quantitative and qualitative experiments demonstrate the significant efficacy of our model compared with prior one-shot methods.
AAAI Conference 2021 Conference Paper
In this paper we tackle the problem of pose guided person image generation, which aims to transfer a person image from the source pose to a novel target pose while maintaining the source appearance. Given the inefficiency of standard CNNs in handling large spatial transformation, we propose a structure-aware flow based method for high-quality person image generation. Specifically, instead of learning the complex overall pose changes of human body, we decompose the human body into different semantic parts (e. g. , head, torso, and legs) and apply different networks to predict the flow fields for these parts separately. Moreover, we carefully design the network modules to effectively capture the local and global semantic correlations of features within and among the human parts respectively. Extensive experimental results show that our method can generate high-quality results under large pose discrepancy and outperforms state-of-the-art methods in both qualitative and quantitative comparisons.
YNIMG Journal 2020 Journal Article
AAAI Conference 2020 Conference Paper
With the rapid development of Role-Playing Games (RPGs), players are now allowed to edit the facial appearance of their in-game characters with their preferences rather than using default templates. This paper proposes a game character autocreation framework that generates in-game characters according to a player’s input face photo. Different from the previous methods that are designed based on neural style transfer or monocular 3D face reconstruction, we re-formulate the character auto-creation process in a different point of view: by predicting a large set of physically meaningful facial parameters under a self-supervised learning paradigm. Instead of updating facial parameters iteratively at the input end of the renderer as suggested by previous methods, which are timeconsuming, we introduce a facial parameter translator so that the creation can be done efficiently through a single forward propagation from the face embeddings to parameters, with a considerable 1000x computational speedup. Despite its high efficiency, the interactivity is preserved in our method where users are allowed to optionally fine-tune the facial parameters on our creation according to their needs. Our approach also shows better robustness than previous methods, especially for those photos with head-pose variance. Comparison results and ablation analysis on seven public face verification datasets suggest the effectiveness of our method.
AAAI Conference 2020 Conference Paper
Head pose estimation from RGB images without depth information is a challenging task due to the loss of spatial information as well as large head pose variations in the wild. The performance of existing landmark-free methods remains unsatisfactory as the quality of estimated pose is inferior. In this paper, we propose a novel three-branch network architecture, termed as Feature Decoupling Network (FDN), a more powerful architecture for landmark-free head pose estimation from a single RGB image. In FDN, we first propose a feature decoupling (FD) module to explicitly learn the discriminative features for each pose angle by adaptively recalibrating its channel-wise responses. Besides, we introduce a crosscategory center (CCC) loss to constrain the distribution of the latent variable subspaces and thus we can obtain more compact and distinct subspaces. Extensive experiments on both in-the-wild and controlled environment datasets demonstrate that the proposed method outperforms other state-of-the-art methods based on a single RGB image and behaves on par with approaches based on multimodal input resources.