Arrow Research search

Author name cluster

Jia Jia

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

29 papers
1 author row

Possible papers

29

AAAI Conference 2026 Conference Paper

Emotion-Conditioned Motion Sub-spaces with Flow Matching for Real-Time Audio-Driven Talking Heads

  • Haoyu Wang
  • Xiaozhe Xin
  • Xiaoyu Qin
  • Meiguang Jin
  • Junfeng Ma
  • Dan Xu
  • Jia Jia

Recent advances in audio-driven talking-head synthesis have brought lip-sync precision close to human perception, yet emotional fidelity and real-time inference remain open challenges. Existing pipelines typically disentangle lip articulation, facial expression, and head pose in latent space; this rigid factorization ignores the intrinsic coupling between articulation and affect — e.g., downward lip corners when sad—thus limiting expressiveness. We cast speech-conditioned facial motion as a sample from an emotion-conditioned distribution in a motion latent space. Concretely, we (i) learn a motion dictionary of orthogonal bases with an autoencoder via self-supervision, (ii) construct emotion-conditioned sub-spaces within the latent space, and (iii) design a layer-progressive cross-attention fusion module that modulates a flow-matching sampler with both audio and emotion signals. Only ten reverse ODE steps are required to generate a motion-latent trajectory, enabling real-time end-to-end latency. Extensive experiments on MEAD and RAVDESS show that our method outperforms recent GAN- and diffusion-based baselines in emotion accuracy while running at around 75 FPS on a single desktop GPU. The proposed framework delivers the first emotionally expressive Audio2Face system that simultaneously achieves lip-sync accuracy, affective realism, and real-time performance.

IS Journal 2026 Journal Article

Ordinal Prompt-Regularized Graph Optimal Transport for Image Ordinal Estimation

  • Xiangkai Wang
  • Kai Zhang
  • Xiaoxu Liu
  • Jia Jia
  • Maozhi Zhang

This article proposes a novel approach for image ordinal estimation, leveraging the power of optimal transport (OT) and prompt learning. Traditional ordinal regression methods primarily focus on learning a model to predict numerical scores, which may not directly reflect the intrinsic order. To address this limitation, we introduce a framework, termed ordinal prompt-regularized graph optimal transport (OPGOT), which utilizes OT to align the distribution of images and that of ordinal labels. First, we incorporate prompt learning with pretrained text encoders to construct ordinal prompts through a token-wise distance-based weighting scheme, enabling the model to capture the semantic relationships between ordinal categories. Second, OPGOT matches the graphs of image features and prompt embeddings via optimizing the OT with language-image cost. Hence, the learned transport plan reflects the intrinsic ordinal relationships. We conduct extensive evaluations on four benchmark datasets of different scenarios, demonstrating that OPGOT achieves significant improvements against existing methods.

YNICL Journal 2025 Journal Article

A multi-modal study on cerebrovascular dysfunction in cognitive decline of de novo Parkinson’s disease

  • Hongwei Li
  • Xiali Shao
  • Jia Jia
  • Bingyi Wang
  • Jian Wang
  • Kai Liu
  • Jinhan Chen
  • Zhensen Chen

BACKGROUND: Vascular risk factors are increasingly implicated in Parkinson's disease (PD), but the role of altered cerebrovascular dysfunction in early-stage PD remains unclear. Here, we investigated resting-state cerebrovascular reactivity (RS-CVR), cerebral blood flow (CBF), arterial morphological changes, and corresponding alterations in functional connectivity density (FCD) in de novo PD patients with different cognitive status. METHODS: 25 de novo PD patients with mild cognitive impairment (PD-MCI), 34 with normal cognition (PD-NC), and 48 healthy controls (HCs) underwent neuropsychological assessments and multimodal MRI. CBF derived from arterial spin labeling, RS-CVR and FCD generated from resting-state functional MRI and the arterial morphology extracted from the magnitude images of multi-echo gradient echo. RESULTS: RS-CVR significantly decreased in PD patients, particularly in the left occipital gyrus and posterior cerebral artery (PCA) territories. Long-range FCD was reduced in the left inferior occipital gyrus in both PD-NC and PD-MCI compared to HCs (p = 0.005, p < 0.001). In PD-MCI, negative correlations between Stroop Color-Word Test time and RS-CVR in the distal right PCA (r = -0.71, pFDR = 0.030) and middle left PCA (r = -0.66, pFDR = 0.044) were observed. A significant correlation was found between decreased long-range FCD in the left inferior occipital gyrus and poorer Trail Making Test Part B performance (r = -0.63, pFDR = 0.029) in the PD-MCI. No significant differences in CBF, but significant dilation of the left PCA and compensatory CBF increases in the corresponding territory in PD-MCI were found (r = 0.57, pFDR = 0.023). DISCUSSION: Microvascular dysfunction, rather than perfusion defects, might underlie early-stage of the de novo PD, especially in the patients with PD-MCI.

NeurIPS Conference 2024 Conference Paper

Skinned Motion Retargeting with Dense Geometric Interaction Perception

  • Zijie Ye
  • Jia-Wei Liu
  • Jia Jia
  • Shikun Sun
  • Mike Zheng Shou

Capturing and maintaining geometric interactions among different body parts is crucial for successful motion retargeting in skinned characters. Existing approaches often overlook body geometries or add a geometry correction stage after skeletal motion retargeting. This results in conflicts between skeleton interaction and geometry correction, leading to issues such as jittery, interpenetration, and contact mismatches. To address these challenges, we introduce a new retargeting framework, MeshRet, which directly models the dense geometric interactions in motion retargeting. Initially, we establish dense mesh correspondences between characters using semantically consistent sensors (SCS), effective across diverse mesh topologies. Subsequently, we develop a novel spatio-temporal representation called the dense mesh interaction (DMI) field. This field, a collection of interacting SCS feature vectors, skillfully captures both contact and non-contact interactions between body geometries. By aligning the DMI field during retargeting, MeshRet not only preserves motion semantics but also prevents self-interpenetration and ensures contact preservation. Extensive experiments on the public Mixamo dataset and our newly-collected ScanRet dataset demonstrate that MeshRet achieves state-of-the-art performance. Code available at https: //github. com/abcyzj/MeshRet.

NeurIPS Conference 2024 Conference Paper

VERIFIED: A Video Corpus Moment Retrieval Benchmark for Fine-Grained Video Understanding

  • Houlun Chen
  • Xin Wang
  • Hong Chen
  • Zeyang Zhang
  • Wei Feng
  • Bin Huang
  • Jia Jia
  • Wenwu Zhu

Existing Video Corpus Moment Retrieval (VCMR) is limited to coarse-grained understanding that hinders precise video moment localization when given fine-grained queries. In this paper, we propose a more challenging fine-grained VCMR benchmark requiring methods to localize the best-matched moment from the corpus with other partially matched candidates. To improve the dataset construction efficiency and guarantee high-quality data annotations, we propose VERIFIED, an automatic \underline{V}id\underline{E}o-text annotation pipeline to generate captions with \underline{R}el\underline{I}able \underline{FI}n\underline{E}-grained statics and \underline{D}ynamics. Specifically, we resort to large language models (LLM) and large multimodal models (LMM) with our proposed Statics and Dynamics Enhanced Captioning modules to generate diverse fine-grained captions for each video. To filter out the inaccurate annotations caused by the LLM hallucination, we propose a Fine-Granularity Aware Noise Evaluator where we fine-tune a video foundation model with disturbed hard-negatives augmented contrastive and matching losses. With VERIFIED, we construct a more challenging fine-grained VCMR benchmark containing Charades-FIG, DiDeMo-FIG, and ActivityNet-FIG which demonstrate a high level of annotation quality. We evaluate several state-of-the-art VCMR models on the proposed dataset, revealing that there is still significant scope for fine-grained video understanding in VCMR.

AAAI Conference 2023 Conference Paper

What Does Your Face Sound Like? 3D Face Shape towards Voice

  • Zhihan Yang
  • Zhiyong Wu
  • Ying Shan
  • Jia Jia

Face-based speech synthesis provides a practical solution to generate voices from human faces. However, directly using 2D face images leads to the problems of uninterpretability and entanglement. In this paper, to address the issues, we introduce 3D face shape which (1) has an anatomical relationship between voice characteristics, partaking in the "bone conduction" of human timbre production, and (2) is naturally independent of irrelevant factors by excluding the blending process. We devise a three-stage framework to generate speech from 3D face shapes. Fully considering timbre production in anatomical and acquired terms, our framework incorporates three additional relevant attributes including face texture, facial features, and demographics. Experiments and subjective tests demonstrate our method can generate utterances matching faces well, with good audio quality and voice diversity. We also explore and visualize how the voice changes with the face. Case studies show that our method upgrades the face-voice inference to personalized custom-made voice creating, revealing a promising prospect in virtual human and dubbing applications.

YNICL Journal 2022 Journal Article

Locus coeruleus integrity correlates with inhibitory functions of the fronto-subthalamic ‘hyperdirect’ pathway in Parkinson’s disease

  • Biman Xu
  • Tingting He
  • Yuan Lu
  • Jia Jia
  • Barbara J. Sahakian
  • Trevor W. Robbins
  • Lirong Jin
  • Zheng Ye

A long-running debate concerns whether dopamine or noradrenaline deficiency drives response disinhibition in Parkinson's disease (PD). This study aimed to investigate whether damage to the locus coeruleus (LC) or substantia nigra (SN) might impact inhibitory functions of the fronto-subthalamic hyperdirect or fronto-striatal indirect pathway. Patients with PD (n = 29, 13 women) and matched healthy controls (n = 29, 15 women) participated in this cross-sectional study. LC and SN integrity was assessed using neuromelanin-sensitive MRI. Response inhibition was measured using fMRI with a stop-signal task. In healthy controls, LC (but not SN) integrity correlated with the stopping-related activity of the right inferior frontal gyrus (IFG) and right subthalamic nucleus (STN), which further correlated with stop-signal reaction time (SSRT). PD patients showed reduced LC integrity, longer SSRT, and lower stopping-related activity over the right IFG, pre-supplementary motor area, and right caudate nucleus than healthy controls. In PD patients, the relationship between SSRT and the fronto-subthalamic pathway was preserved. However, LC integrity no longer correlated with the stopping-related right IFG or right STN activity. No contribution of SN integrity was found during stopping. In conclusion, LC (but not SN) might modulate inhibitory functions of the right IFG-STN pathway. Damage to the LC might impact the right IFG-STN pathway during stopping, leading to response disinhibition in PD.

AAAI Conference 2021 Conference Paper

Inferring Emotion from Large-scale Internet Voice Data: A Semi-supervised Curriculum Augmentation based Deep Learning Approach

  • Suping Zhou
  • Jia Jia
  • Zhiyong Wu
  • Zhihan Yang
  • Yanfeng Wang
  • Wei Chen
  • Fanbo Meng
  • Shuo Huang

Effective emotion inference from user queries helps to give a more personified response for Voice Dialogue Applications(VDAs). The tremendous amounts of VDA users bring in diverse emotion expressions. How to achieve a high emotion inferring performance from large-scale Internet Voice Data in VDAs? Traditionally, researches on speech emotion recognition are based on acted voice datasets, which have limited speakers but strong and clear emotion expressions. Inspired by this, in this paper, we propose a novel approach to leverage acted voice data with strong emotion expressions to enhance large-scale unlabeled internet voice data with diverse emotion expressions for emotion inferring. Specifically, we propose a novel semi-supervised multi-modal curriculum augmentation deep learning framework. First, to learn more general emotion cues, we adopt a curriculum learning based epoch-wise training strategy, which trains our model guided by strong and balanced emotion samples from acted voice data and sub-sequently leverages weak and unbalanced emotion samples from internet voice data. Second, to employ more diverse emotion expressions, we design a Multi-path Mixmatch Multimodal Deep Neural Network(MMMD), which effectively learns feature representations for multiple modalities and trains labeled and unlabeled data in hybrid semisupervised methods for superior generalisation and robustness. Experiments on an internet voice dataset with 500, 000 utterances show our method outperforms (+10. 09% in terms of F1) several alternative baselines, while an acted corpus with 2, 397 utterances contributes 4. 35%. To further compare our method with state-of-the-art techniques in traditionally acted voice datasets, we also conduct experiments on public dataset IEMOCAP. The results reveal the effectiveness of the proposed approach.

AAAI Conference 2020 Conference Paper

Mining Unfollow Behavior in Large-Scale Online Social Networks via Spatial-Temporal Interaction

  • Haozhe Wu
  • Zhiyuan Hu
  • Jia Jia
  • Yaohua Bu
  • Xiangnan He
  • Tat-Seng Chua

Online Social Networks (OSNs) evolve through two pervasive behaviors: follow and unfollow, which respectively signify relationship creation and relationship dissolution. Researches on social network evolution mainly focus on the follow behavior, while the unfollow behavior has largely been ignored. Mining unfollow behavior is challenging because user’s decision on unfollow is not only affected by the simple combination of user’s attributes like informativeness and reciprocity, but also affected by the complex interaction among them. Meanwhile, prior datasets seldom contain suf- ficient records for inferring such complex interaction. To address these issues, we first construct a large-scale real-world Weibo1 dataset, which records detailed post content and relationship dynamics of 1. 8 million Chinese users. Next, we de- fine user’s attributes as two categories: spatial attributes (e. g. , social role of user) and temporal attributes (e. g. , post content of user). Leveraging the constructed dataset, we systematically study how the interaction effects between user’s spatial and temporal attributes contribute to the unfollow behavior. Afterwards, we propose a novel unified model with heterogeneous information (UMHI) for unfollow prediction. Specifically, our UMHI model: 1) captures user’s spatial attributes through social network structure; 2) infers user’s temporal attributes through user-posted content and unfollow history; and 3) models the interaction between spatial and temporal attributes by the nonlinear MLP layers. Comprehensive evaluations on the constructed dataset demonstrate that the proposed UMHI model outperforms baseline methods by 16. 44 on average in terms of precision. In addition, factor analyses verify that both spatial attributes and temporal attributes are essential for mining unfollow behavior.

AAAI Conference 2020 Conference Paper

PEIA: Personality and Emotion Integrated Attentive Model for Music Recommendation on Social Media Platforms

  • Tiancheng Shen
  • Jia Jia
  • Yan Li
  • Yihui Ma
  • Yaohua Bu
  • Hanjie Wang
  • Bo Chen
  • Tat-Seng Chua

With the rapid expansion of digital music formats, it’s indispensable to recommend users with their favorite music. For music recommendation, users’ personality and emotion greatly affect their music preference, respectively in a longterm and short-term manner, while rich social media data provides effective feedback on these information. In this paper, aiming at music recommendation on social media platforms, we propose a Personality and Emotion Integrated Attentive model (PEIA), which fully utilizes social media data to comprehensively model users’ long-term taste (personality) and short-term preference (emotion). Specifically, it takes full advantage of personality-oriented user features, emotionoriented user features and music features of multi-faceted attributes. Hierarchical attention is employed to distinguish the important factors when incorporating the latent representations of users’ personality and emotion. Extensive experiments on a large real-world dataset of 171, 254 users demonstrate the effectiveness of our PEIA model which achieves an NDCG of 0. 5369, outperforming the state-of-the-art methods. We also perform detailed parameter analysis and feature contribution analysis, which further verify our scheme and demonstrate the significance of co-modeling of user personality and emotion in music recommendation.

IJCAI Conference 2019 Conference Paper

Design and Implementation of a Disambiguity Framework for Smart Voice Controlled Devices

  • Kehua Lei
  • Tianyi Ma
  • Jia Jia
  • Cunjun Zhang
  • Zhihan Yang

With about 100 million people using it recently, SVCD(Smart Voice Controlled Device) are becoming demotic. Whether at home or in an office, usually, multiple appliances are under the control of a single SVCD and several people may manipulate an SVCD simultaneously. However, present SVCD fails to handle them appropriately. In this paper, we propose a novel framework for SVCD to eliminate orders’ ambiguity for single user or multi-user. We also design an algorithm combining Word2Vec and emotion detection for the device to wipe off ambiguity. Finally, we apply our framework into a virtual smart home scene and the performance of it indicates that our strategy resolves the problems commendably.

IJCAI Conference 2019 Conference Paper

Towards Discriminative Representation Learning for Speech Emotion Recognition

  • Runnan Li
  • Zhiyong Wu
  • Jia Jia
  • Yaohua Bu
  • Sheng Zhao
  • Helen Meng

In intelligent speech interaction, automatic speech emotion recognition (SER) plays an important role in understanding user intention. While sentimental speech has different speaker characteristics but similar acoustic attributes, one vital challenge in SER is how to learn robust and discriminative representations for emotion inferring. In this paper, inspired by human emotion perception, we propose a novel representation learning component (RLC) for SER system, which is constructed with Multi-head Self-attention and Global Context-aware Attention Long Short-Term Memory Recurrent Neutral Network (GCA-LSTM). With the ability of Multi-head Self-attention mechanism in modeling the element-wise correlative dependencies, RLC can exploit the common patterns of sentimental speech features to enhance emotion-salient information importing in representation learning. By employing GCA-LSTM, RLC can selectively focus on emotion-salient factors with the consideration of entire utterance context, and gradually produce discriminative representation for emotion inferring. Experiments on public emotional benchmark database IEMOCAP and a tremendous realistic interaction database demonstrate the outperformance of the proposed SER framework, with 6. 6% to 26. 7% relative improvement on unweighted accuracy compared to state-of-the-art techniques.

IJCAI Conference 2018 Conference Paper

Cross-Domain Depression Detection via Harvesting Social Media

  • Tiancheng Shen
  • Jia Jia
  • Guangyao Shen
  • Fuli Feng
  • Xiangnan He
  • Huanbo Luan
  • Jie Tang
  • Thanassis Tiropanis

Depression detection is a significant issue for human well-being. In previous studies, online detection has proven effective in Twitter, enabling proactive care for depressed users. Owing to cultural differences, replicating the method to other social media platforms, such as Chinese Weibo, however, might lead to poor performance because of insufficient available labeled (self-reported depression) data for model training. In this paper, we study an interesting but challenging problem of enhancing detection in a certain target domain (e. g. Weibo) with ample Twitter data as the source domain. We first systematically analyze the depression-related feature patterns across domains and summarize two major detection challenges, namely isomerism and divergency. We further propose a cross-domain Deep Neural Network model with Feature Adaptive Transformation & Combination strategy (DNN-FATC) that transfers the relevant information across heterogeneous domains. Experiments demonstrate improved performance compared to existing heterogeneous transfer methods or training directly in the target domain (over 3. 4% improvement in F1), indicating the potential of our model to enable depression detection via social media for more countries with different cultural settings.

AAAI Conference 2018 Conference Paper

Inferring Emotion from Conversational Voice Data: A Semi-Supervised Multi-Path Generative Neural Network Approach

  • Suping Zhou
  • Jia Jia
  • Qi Wang
  • Yufei Dong
  • Yufeng Yin
  • Kehua Lei

To give a more humanized response in Voice Dialogue Applications (VDAs), inferring emotion states from users’ queries may play an important role. However, in VDAs, we have tremendous amount of VDA users and massive scale of unlabeled data with high dimension features from multimodal information, which challenge the traditional speech emotion recognition methods. In this paper, to better infer emotion from conversational voice data, we propose a semisupervised multi-path generative neural network. Specifically, first, we build a novel supervised multi-path deep neural network framework. To avoid high dimensional input, raw features are trained by groups in local classifiers. Then high-level features of each local classifiers are concatenated as input of a global classifier. These two kinds classifiers are trained simultaneously through a single objective function to achieve a more effective and discriminative emotion inferring. To further solve the labeled-datascarcity problem, we extend the multi-path deep neural network to a generative model based on semi-supervised variational autoencoder(semi-VAE), which is able to train the labeled and unlabeled data simultaneously. Experiment based on a 24, 000 real-world dataset collected from Sogou Voice Assistant1 (SVAD13) and a benchmark dataset IEMOCAP show that our method significantly outperforms the existing state-of-the-art results.

AAAI Conference 2018 System Paper

Lookine: Let the Blind Hear a Smile

  • Yaohua Bu
  • Jia Jia
  • Yuhan Tang
  • Xuan Zang
  • Tianyu Gao

It is believed that nonverbal visual information including facial expressions, facial micro-actions and head movements plays a significant role in fundamental social communication. Unfortunately it is regretful that the blind can not achieve such necessary information. Therefore, we propose a socialassistant system, Lookine, to help them to go beyond this limitation. For Lookine, we apply the novel techniques including facial expression recognition, facial action recognition and head pose estimation, and obey “barrier-free” principles in our design. In experiments, the algorithm evaluation and user study prove that our system has promising accuracy, good real-time performance, and great user experience.

IJCAI Conference 2018 Conference Paper

Mental Health Computing via Harvesting Social Media Data

  • Jia Jia

Mental health has become a general concern of people nowadays. It is of vital importance to detect and manage mental health issues before they turn into severe problems. Traditional psychological interventions are reliable, but expensive and hysteretic. With the rapid development of social media, people are increasingly sharing their daily lives and interacting with friends online. Via harvesting social media data, we comprehensively study the detection of mental wellness, with two typical mental problems, stress and depression, as specific examples. Initializing with binary user-level detection, we expand our research towards multiple contexts, by considering the trigger and level of mental health problems, and involving different social media platforms of different cultures. We construct several benchmark real-world datasets for analysis and propose a series of multi-modal detection models, whose effectiveness are verified by extensive experiments. We also make in-depth analysis to reveal the underlying online behaviors regarding these mental health issues.

AAAI Conference 2017 System Paper

A Virtual Personal Fashion Consultant: Learning from the Personal Preference of Fashion

  • Jingtian Fu
  • Yejun Liu
  • Jia Jia
  • Yihui Ma
  • Fanhang Meng
  • Huan Huang

Besides fashion, personalization is another important factor of wearing. How to balance fashion trend and personal preference to better appreciate wearing is a non-trivial task. In previous work we develop a demo, Magic Mirror, to recommend clothing collocation based on the fashion trend. However, the diversity of people’s aesthetics is huge. In order to meet different demand, Magic Mirror is upgraded in this paper, and it can give out recommendations by considering both the fashion trend and personal preference, and work as a private clothing consultant. For more suitable recommendation, the virtual consultant will learn users’ tastes and preferences from their behaviors by using Genetic algorithm. Users can get collocations or matched top/bottom recommendation after choosing occasion and style. They can also get a report about their fashion state and aesthetic standpoint on recent wearing.

JBHI Journal 2017 Journal Article

Analyzing and Identifying Teens’ Stressful Periods and Stressor Events From a Microblog

  • Qi Li
  • Yuanyuan Xue
  • Liang Zhao
  • Jia Jia
  • Ling Feng

Increased health problems among adolescents caused by psychological stress have aroused worldwide attention. Long-standing stress without targeted assistance and guidance negatively impacts the healthy growth of adolescents, threatening the future development of our society. So far, research focused on detecting adolescent psychological stress revealed from each individual post on microblogs. However, beyond stressful moments, identifying teens' stressful periods and stressor events that trigger each stressful period is more desirable to understand the stress from appearance to essence. In this paper, we define the problem of identifying teens' stressful periods and stressor events from the open social media microblog. Starting from a case study of adolescents' posting behaviors during stressful school events, we build a Poisson-based probability model for the correlation between stressor events and stressful posting behaviors through a series of posts on Tencent Weibo (referred to as the microblog throughout the paper). With the model, we discover teens' maximal stressful periods and further extract details of possible stressor events that cause the stressful periods. We generalize and present the extracted stressor events in a hierarchy based on common stress dimensions and event types. Taking 122 scheduled stressful study-related events in a high school as the ground truth, we test the approach on 124 students' posts from January 1, 2012 to February 1, 2015 and obtain some promising experimental results: (stressful periods: recall 0. 761, precision 0. 737, and F 1 -measure 0. 734) and (top-3 stressor events: recall 0. 763, precision 0. 756, and F 1 -measure 0. 759). The most prominent stressor events extracted are in the self-cognition domain, followed by the school life domain. This conforms to the adolescent psychological investigation result that problems in school life usually accompanied with teens' inner cognition problems. Compared with the state-of-the-art top-1 personal life event detection approach, our stressor event detection method is 13. 72% higher in precision, 19. 18% higher in recall, and 16. 50% higher in F 1 -measure, demonstrating the effectiveness of our proposed framework.

AAAI Conference 2017 System Paper

AniDraw: When Music and Dance Meet Harmoniously

  • Yaohua Bu
  • Tang Taoran
  • Jia Jia
  • Ma Zhiyuan
  • Wu Songyao
  • You Yuming

In this paper, we present a demo, AniDraw, which can help users practice the coordination between their hands, mouth and eyes by combing the elements of music, painting and dance. Users can sketch a cartoon character through multitouch screens and then hum songs, which will drive the cartoon character to dance to create a lively animation. In technical realization, we apply the mechanism of acoustic driving in which AniDraw extracts time-domain acoustic features to map to the intensity of dances, frequency-domain ones to map to the style of dances, and high-level ones including onesets and tempos to map to start, duration and speed of dances. AniDraw can not only stimulate users’ enthusiasm in artistic creation, but also enhance their esthetic ability on harmony.

IJCAI Conference 2017 Conference Paper

Depression Detection via Harvesting Social Media: A Multimodal Dictionary Learning Solution

  • Guangyao Shen
  • Jia Jia
  • Liqiang Nie
  • Fuli Feng
  • Cunjun Zhang
  • Tianrui Hu
  • Tat-Seng Chua
  • Wenwu Zhu

Depression is a major contributor to the overall global burden of diseases. Traditionally, doctors diagnose depressed people face to face via referring to clinical depression criteria. However, more than 70% of the patients would not consult doctors at early stages of depression, which leads to further deterioration of their conditions. Meanwhile, people are increasingly relying on social media to disclose emotions and sharing their daily lives, thus social media have successfully been leveraged for helping detect physical and mental diseases. Inspired by these, our work aims to make timely depression detection via harvesting social media data. We construct well-labeled depression and non-depression dataset on Twitter, and extract six depression-related feature groups covering not only the clinical depression criteria, but also online behaviors on social media. With these feature groups, we propose a multimodal depressive dictionary learning model to detect the depressed users on Twitter. A series of experiments are conducted to validate this model, which outperforms (+3% to +10%) several baselines. Finally, we analyze a large-scale dataset on Twitter to reveal the underlying online behaviors between depressed and non-depressed users.

AAAI Conference 2017 Conference Paper

Multi-Task Deep Learning for User Intention Understanding in Speech Interaction Systems

  • Yishuang Ning
  • Jia Jia
  • Zhiyong Wu
  • Runnan Li
  • Yongsheng An
  • Yanfeng Wang
  • Helen Meng

Speech interaction systems have been gaining popularity in recent years. The main purpose of these systems is to generate more satisfactory responses according to users’ speech utterances, in which the most critical problem is to analyze user intention. Researches show that user intention conveyed through speech is not only expressed by content, but also closely related with users’ speaking manners (e. g. with or without acoustic emphasis). How to incorporate these heterogeneous attributes to infer user intention remains an open problem. In this paper, we define Intention Prominence (IP) as the semantic combination of focus by text and emphasis by speech, and propose a multi-task deep learning framework to predict IP. Specifically, we first use long short-term memory (LSTM) which is capable of modeling long short-term contextual dependencies to detect focus and emphasis, and incorporate the tasks for focus and emphasis detection with multi-task learning (MTL) to reinforce the performance of each other. We then employ Bayesian network (BN) to incorporate multimodal features (focus, emphasis, and location reflecting users’ dialect conventions) to predict IP based on feature correlations. Experiments on a data set of 135, 566 utterances collected from real-world Sogou Voice Assistant illustrate that our method can outperform the comparison methods over 6. 9-24. 5% in terms of F1-measure. Moreover, a real practice in the Sogou Voice Assistant indicates that our method can improve the performance on user intention understanding by 7%.

AAAI Conference 2017 System Paper

SenseRun: Real-Time Running Routes Recommendation towards Providing Pleasant Running Experiences

  • Jiayu Long
  • Jia Jia
  • Han Xu

In this demo, we develop a mobile running application, SenseRun, to involve landscape experiences for routes recommendation. We firstly define landscape experiences, perceived enjoyment from landscape as motivators for running, by public natural area and traffic density. Based on landscape experiences, we categorize locations into 3 types (natural, leisure, traffic space) and set them with different basic weight. Real-time context factors (weather, season and hour of the day) are involved to adjust the weight. We propose a multiattributes method to recommend routes with weight based on MVT (The Marginal Value Theorem) k-shortest-paths algorithm. We also use a landscape-awareness sounds algorithm as supplementary of landscape experiences. Experimental results improve that SenseRun can enhance running experiences and is helpful to promote regular physical activities.

AAAI Conference 2017 Conference Paper

Towards Better Understanding the Clothing Fashion Styles: A Multimodal Deep Learning Approach

  • Yihui Ma
  • Jia Jia
  • Suping Zhou
  • Jingtian Fu
  • Yejun Liu
  • Zijian Tong

In this paper, we aim to better understand the clothing fashion styles. There remain two challenges for us: 1) how to quantitatively describe the fashion styles of various clothing, 2) how to model the subtle relationship between visual features and fashion styles, especially considering the clothing collocations. Using the words that people usually use to describe clothing fashion styles on shopping websites, we build a Fashion Semantic Space (FSS) based on Kobayashi’s aesthetics theory to describe clothing fashion styles quantitatively and universally. Then we propose a novel fashion-oriented multimodal deep learning based model, Bimodal Correlative Deep Autoencoder (BCDA), to capture the internal correlation in clothing collocations. Employing the benchmark dataset we build with 32133 full-body fashion show images, we use BCDA to map the visual features to the FSS. The experiment results indicate that our model outperforms (+13% in terms of MSE) several alternative baselines, confirming that our model can better understand the clothing fashion styles. To further demonstrate the advantages of our model, we conduct some interesting case studies, including fashion trends analyses of brands, clothing collocation recommendation, etc.

AAAI Conference 2016 Conference Paper

Learning to Appreciate the Aesthetic Effects of Clothing

  • Jia Jia
  • Jie Huang
  • Guangyao Shen
  • Tao He
  • Zhiyuan Liu
  • Huanbo Luan
  • Chao Yan

How do people describe clothing? The words like “formal” or “casual” are usually used. However, recent works often focus on recognizing or extracting visual features (e. g. , sleeve length, color distribution and clothing pattern) from clothing images accurately. How can we bridge the gap between the visual features and the aesthetic words? In this paper, we formulate this task to a novel three-level framework: visual features (VF) image-scale space (ISS) - aesthetic words space (AWS). Leveraging the art-field image-scale space served as an intermediate layer, we first propose a Stacked Denoising Autoencoder Guided by Correlative Labels (SDAE- GCL) to map the visual features to the image-scale space; and then according to the semantic distances computed by WordNet: :Similarity, we map the most often used aesthetic words in online clothing shops to the image-scale space too. Employing upper-body menswear images downloaded from several global online clothing shops as experimental data, the results indicate that the proposed three-level framework can help to capture the subtle relationship between visual features and aesthetic words better compared to several baselines. To demonstrate that our three-level framework and its implementation methods are universally applicable, we finally present some interesting analyses on the fashion trend of menswear in the last 10 years.

AAAI Conference 2016 Conference Paper

Moodee: An Intelligent Mobile Companion for Sensing Your Stress from Your Social Media Postings

  • Huijie Lin
  • Jia Jia
  • Jie Huang
  • Enze Zhou
  • Jingtian Fu
  • Yejun Liu
  • Huanbo Luan

In this demo, we build a practical mobile application, Moodee, to help detect and release users’ psychological stress by leveraging users’ social media data in online social networks, and provide an interactive user interface to present users’ and friends’ psychological stress states in an visualized and intuitional way. Given users’ online social media data as input, Moodee intelligently and automatically detects users’ stress states. Moreover, Moodee would recommend users with different links to help release their stress. The main technology of this demo is a novel hybrid model - a factor graph model combined with Deep Neural Network, which can leverage social media content and social interaction information for stress detection. We think that Moodee can be helpful to people’s mental health, which is a vital problem in modern world.

AAAI Conference 2016 Conference Paper

Representation Learning of Knowledge Graphs with Entity Descriptions

  • Ruobing Xie
  • Zhiyuan Liu
  • Jia Jia
  • Huanbo Luan
  • Maosong Sun

Representation learning (RL) of knowledge graphs aims to project both entities and relations into a continuous lowdimensional space. Most methods concentrate on learning representations with knowledge triples indicating relations between entities. In fact, in most knowledge graphs there are usually concise descriptions for entities, which cannot be well utilized by existing methods. In this paper, we propose a novel RL method for knowledge graphs taking advantages of entity descriptions. More specifically, we explore two encoders, including continuous bag-of-words and deep convolutional neural models to encode semantics of entity descriptions. We further learn knowledge representations with both triples and descriptions. We evaluate our method on two tasks, including knowledge graph completion and entity classification. Experimental results on real-world datasets show that, our method outperforms other baselines on the two tasks, especially under the zero-shot setting, which indicates that our method is capable of building representations for novel entities according to their descriptions. The source code of this paper can be obtained from https: //github. com/xrb92/DKRL.

AAAI Conference 2016 Conference Paper

Social Role-Aware Emotion Contagion in Image Social Networks

  • Yang Yang
  • Jia Jia
  • Boya Wu
  • Jie Tang

Psychological theories suggest that emotion represents the state of mind and instinctive responses of one’s cognitive system (Cannon 1927). Emotions are a complex state of feeling that results in physical and psychological changes that influence our behavior. In this paper, we study an interesting problem of emotion contagion in social networks. In particular, by employing an image social network (Flickr) as the basis of our study, we try to unveil how users’ emotional statuses influence each other and how users’ positions in the social network affect their influential strength on emotion. We develop a probabilistic framework to formalize the problem into a role-aware contagion model. The model is able to predict users’ emotional statuses based on their historical emotional statuses and social structures. Experiments on a large Flickr dataset show that the proposed model significantly outperforms (+31% in terms of F1-score) several alternative methods in predicting users’ emotional status. We also discover several intriguing phenomena. For example, the probability that a user feels happy is roughly linear to the number of friends who are also happy; but taking a closer look, the happiness probability is superlinear to the number of happy friends who act as opinion leaders (Page et al. 1999) in the network and sublinear in the number of happy friends who span structural holes (Burt 2001). This offers a new opportunity to understand the underlying mechanism of emotional contagion in online social networks.

IJCAI Conference 2016 Conference Paper

What Does Social Media Say about Your Stress?

  • Huijie Lin
  • Jia Jia
  • Liqiang Nie
  • Guangyao Shen
  • Tat-Seng Chua

With the rise of social media such as Twitter, people are more willing to convey their stressful life events via these platforms. In a sense, it is feasible to detect stress from social media data for proactive health care. In psychology, stress is composed of stressor and stress level, where stressor further comprises of stressor event and subject. By far, little attention has been paid to estimate exact stressor and stress level from social media data, due to the following challenges: 1) stressor subject identification, 2) stressor event detection, and 3) data collection and representation. To address these problems, we devise a comprehensive scheme to measure a user's stress level from his/her social media data. In particular, we first build a benchmark dataset and extract a rich set of stress-oriented features. We then propose a novel hybrid multi-task model to detect the stressor event and subject, which is capable of modeling the relatedness among stressor events as well as stressor subjects. At last, we lookup an expert-defined stress table with the detected subject and event to estimate the stressor and stress level. Extensive experiments on real-world datasets well verify the effectiveness of our scheme.

AAAI Conference 2014 Conference Paper

How Do Your Friends on Social Media Disclose Your Emotions?

  • Yang Yang
  • Jia Jia
  • Shumei Zhang
  • Boya Wu
  • Qicong Chen
  • Juanzi Li
  • Chunxiao Xing
  • Jie Tang

Extracting emotions from images has attracted much interest, in particular with the rapid development of social networks. The emotional impact is very important for understanding the intrinsic meanings of images. Despite many studies having been done, most existing methods focus on image content, but ignore the emotion of the user who published the image. One interesting question is: How does social effect correlate with the emotion expressed in an image? Specifically, can we leverage friends interactions (e. g. , discussions) related to an image to help extract the emotions? In this paper, we formally formalize the problem and propose a novel emotion learning method by jointly modeling images posted by social users and comments added by their friends. One advantage of the model is that it can distinguish those comments that are closely related to the emotion expression for an image from the other irrelevant ones. Experiments on an open Flickr dataset show that the proposed model can significantly improve (+37. 4% by F1) the accuracy for inferring user emotions. More interestingly, we found that half of the improvements are due to interactions between 1. 0% of the closest friends.