Author name cluster

Jia Jia

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers

1 author row

AAAI Conference 2026 Conference Paper

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

Haoyu Wang
Xiaozhe Xin
Xiaoyu Qin
Meiguang Jin
Junfeng Ma
Dan Xu
Jia Jia

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness. We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency. Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

PDF Details DOI

IS Journal 2026 Journal Article

Ordinal Prompt-Regularized Graph Optimal Transport for Image Ordinal Estimation

Xiangkai Wang
Kai Zhang
Xiaoxu Liu
Jia Jia
Maozhi Zhang

This article proposes a novel approach for image ordinal estimation, leveraging the power of optimal transport (OT) and prompt learning. Traditional ordinal regression methods primarily focus on learning a model to predict numerical scores, which may not directly reflect the intrinsic order. To address this limitation, we introduce a framework, termed ordinal prompt-regularized graph optimal transport (OPGOT), which utilizes OT to align the distribution of images and that of ordinal labels. First, we incorporate prompt learning with pretrained text encoders to construct ordinal prompts through a token-wise distance-based weighting scheme, enabling the model to capture the semantic relationships between ordinal categories. Second, OPGOT matches the graphs of image features and prompt embeddings via optimizing the OT with language-image cost. Hence, the learned transport plan reflects the intrinsic ordinal relationships. We conduct extensive evaluations on four benchmark datasets of different scenarios, demonstrating that OPGOT achieves significant improvements against existing methods.

Details DOI

YNICL Journal 2025 Journal Article

A multi-modal study on cerebrovascular dysfunction in cognitive decline of de novo Parkinson’s disease

Hongwei Li
Xiali Shao
Jia Jia
Bingyi Wang
Jian Wang
Kai Liu
Jinhan Chen
Zhensen Chen

BACKGROUND: Vascular risk factors are increasingly implicated in Parkinson's disease (PD), but the role of altered cerebrovascular dysfunction in early-stage PD remains unclear. Here, we investigated resting-state cerebrovascular reactivity (RS-CVR), cerebral blood flow (CBF), arterial morphological changes, and corresponding alterations in functional connectivity density (FCD) in de novo PD patients with different cognitive status. METHODS: 25 de novo PD patients with mild cognitive impairment (PD-MCI), 34 with normal cognition (PD-NC), and 48 healthy controls (HCs) underwent neuropsychological assessments and multimodal MRI. CBF derived from arterial spin labeling, RS-CVR and FCD generated from resting-state functional MRI and the arterial morphology extracted from the magnitude images of multi-echo gradient echo. RESULTS: RS-CVR significantly decreased in PD patients, particularly in the left occipital gyrus and posterior cerebral artery (PCA) territories. Long-range FCD was reduced in the left inferior occipital gyrus in both PD-NC and PD-MCI compared to HCs (p = 0.005, p < 0.001). In PD-MCI, negative correlations between Stroop Color-Word Test time and RS-CVR in the distal right PCA (r = -0.71, pFDR = 0.030) and middle left PCA (r = -0.66, pFDR = 0.044) were observed. A significant correlation was found between decreased long-range FCD in the left inferior occipital gyrus and poorer Trail Making Test Part B performance (r = -0.63, pFDR = 0.029) in the PD-MCI. No significant differences in CBF, but significant dilation of the left PCA and compensatory CBF increases in the corresponding territory in PD-MCI were found (r = 0.57, pFDR = 0.023). DISCUSSION: Microvascular dysfunction, rather than perfusion defects, might underlie early-stage of the de novo PD, especially in the patients with PD-MCI.

Details DOI

NeurIPS Conference 2024 Conference Paper

Skinned Motion Retargeting with Dense Geometric Interaction Perception

Zijie Ye
Jia-Wei Liu
Jia Jia
Shikun Sun
Mike Zheng Shou

Capturing and maintaining geometric interactions among different body parts is crucial for successful motion retargeting in skinned characters. Existing approaches often overlook body geometries or add a geometry correction stage after skeletal motion retargeting. This results in conflicts between skeleton interaction and geometry correction, leading to issues such as jittery, interpenetration, and contact mismatches. To address these challenges, we introduce a new retargeting framework, MeshRet, which directly models the dense geometric interactions in motion retargeting. Initially, we establish dense mesh correspondences between characters using semantically consistent sensors (SCS), effective across diverse mesh topologies. Subsequently, we develop a novel spatio-temporal representation called the dense mesh interaction (DMI) field. This field, a collection of interacting SCS feature vectors, skillfully captures both contact and non-contact interactions between body geometries. By aligning the DMI field during retargeting, MeshRet not only preserves motion semantics but also prevents self-interpenetration and ensures contact preservation. Extensive experiments on the public Mixamo dataset and our newly-collected ScanRet dataset demonstrate that MeshRet achieves state-of-the-art performance. Code available at https: //github. com/abcyzj/MeshRet.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

Houlun Chen
Xin Wang
Hong Chen
Zeyang Zhang
Wei Feng
Bin Huang
Jia Jia
Wenwu Zhu

Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding that hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR.

PDF Details DOI

AAAI Conference 2023 Conference Paper

What Does Your Face Sound Like? 3D Face Shape towards Voice

Zhihan Yang
Zhiyong Wu
Ying Shan
Jia Jia

Face-based speech synthesis provides a practical solution to generate voices from human faces. However, directly using 2D face images leads to the problems of uninterpretability and entanglement. In this paper, to address the issues, we introduce 3D face shape which (1) has an anatomical relationship between voice characteristics, partaking in the "bone conduction" of human timbre production, and (2) is naturally independent of irrelevant factors by excluding the blending process. We devise a three-stage framework to generate speech from 3D face shapes. Fully considering timbre production in anatomical and acquired terms, our framework incorporates three additional relevant attributes including face texture, facial features, and demographics. Experiments and subjective tests demonstrate our method can generate utterances matching faces well, with good audio quality and voice diversity. We also explore and visualize how the voice changes with the face. Case studies show that our method upgrades the face-voice inference to personalized custom-made voice creating, revealing a promising prospect in virtual human and dubbing applications.

PDF Details DOI

YNICL Journal 2022 Journal Article

Locus coeruleus integrity correlates with inhibitory functions of the fronto-subthalamic ‘hyperdirect’ pathway in Parkinson’s disease

Biman Xu
Tingting He
Yuan Lu
Jia Jia
Barbara J. Sahakian
Trevor W. Robbins
Lirong Jin
Zheng Ye

A long-running debate concerns whether dopamine or noradrenaline deficiency drives response disinhibition in Parkinson's disease (PD). This study aimed to investigate whether damage to the locus coeruleus (LC) or substantia nigra (SN) might impact inhibitory functions of the fronto-subthalamic hyperdirect or fronto-striatal indirect pathway. Patients with PD (n = 29, 13 women) and matched healthy controls (n = 29, 15 women) participated in this cross-sectional study. LC and SN integrity was assessed using neuromelanin-sensitive MRI. Response inhibition was measured using fMRI with a stop-signal task. In healthy controls, LC (but not SN) integrity correlated with the stopping-related activity of the right inferior frontal gyrus (IFG) and right subthalamic nucleus (STN), which further correlated with stop-signal reaction time (SSRT). PD patients showed reduced LC integrity, longer SSRT, and lower stopping-related activity over the right IFG, pre-supplementary motor area, and right caudate nucleus than healthy controls. In PD patients, the relationship between SSRT and the fronto-subthalamic pathway was preserved. However, LC integrity no longer correlated with the stopping-related right IFG or right STN activity. No contribution of SN integrity was found during stopping. In conclusion, LC (but not SN) might modulate inhibitory functions of the right IFG-STN pathway. Damage to the LC might impact the right IFG-STN pathway during stopping, leading to response disinhibition in PD.

Details DOI

AAAI Conference 2021 Conference Paper

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach

Suping Zhou
Jia Jia
Zhiyong Wu
Zhihan Yang
Yanfeng Wang
Wei Chen
Fanbo Meng
Shuo Huang

Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data. Second, to employ more diverse emotion expressions, we design a Multi-path Mixmatch Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semisupervised methods for superior generalisation and robustness. Experiments on an internet voice dataset with 500, 000 utterances show our method outperforms (+10. 09% in terms of F1) several alternative baselines, while an acted corpus with 2, 397 utterances contributes 4. 35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the proposed approach.