Author name cluster

Jia Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

73 papers

2 author rows

AAAI Conference 2026 Conference Paper

Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation

Xuesong Zhang
Yunbo Xu
Jia Li
Ruonan Liu
Zhenzhen Hu

Navigating unseen environments based on natural language instructions remains difficult for egocentric agents in Vision-and-Language Navigation (VLN). Intuitively, humans inherently ground concrete semantic knowledge within spatial layouts during indoor navigation. Although previous studies have introduced diverse environmental representations to enhance reasoning, other co-occurrence modalities are often naively concatenated with RGB features, resulting in suboptimal utilization of each modality's distinct contribution. Inspired by this, we propose a hierarchical Semantic Understanding and Spatial Awareness (SUSA) architecture to enable agents to perceive and ground environments at diverse scales. Specifically, the Textual Semantic Understanding (TSU) module supports local action prediction by generating view-level descriptions, thereby capturing fine-grained environmental semantics and narrowing the modality gap between instructions and environments. Complementarily, the Depth-enhanced Spatial Perception (DSP) module incrementally constructs a trajectory-level depth exploration map, providing the agent with a coarse-grained comprehension of the global spatial layout. Extensive experiments demonstrate that SUSA's hierarchical representation enrichment not only boosts the navigation performance of the baseline on discrete VLN benchmarks (REVERIE, R2R, and SOON), but also exhibits superior generalization to the continuous R2R-CE.

PDF Details DOI

AAAI Conference 2026 Conference Paper

BAG: Benchmarking Anomaly Detection on Dynamic Graphs

Fengrui Hua
Yiyan Qi
Zikai Wei
Yuxing Tian
Chengjin Xu
Xiaojun Wu
Jia Li
Jian Guo

Anomaly detection in dynamic graphs is a critical area of research that focuses on identifying abnormal components within evolving graph structures that deviate significantly from typical patterns. Despite advancements in traditional temporal pattern mining and deep learning techniques, a comprehensive benchmarking framework for Dynamic Graph Anomaly Detection (DyGAD) has been lacking. To address this gap, we introduce BAG, the first comprehensive benchmark specifically designed for anomaly detection on dynamic graphs. BAG enables extensive evaluation of 25 leading DyGAD models, covering both classical approaches and advanced Dynamic Graph Neural Networks (DGNNs), across 10 diverse real-world datasets that include both synthetic and naturally occurring anomalies. The framework supports evaluations at both the edge and node levels, offering a robust tool to advance DyGAD research. Our main finding is that Continuous-time Dynamic Graph (CTDG) models demonstrate superior performance and potential in detecting anomalies in dynamic graph edges, compared to Discrete-time Dynamic Graph (DTDG) models. Furthermore, the results reveal that existing methods are less effective at detecting organic anomalies, primarily due to the presence of temporal anomalies and highly imbalanced samples. The proposed BAG benchmark significantly enhances the evaluation of DyGAD methods by improving dataset selection, metric application, and model training. Moreover, BAG supports reproducibility and further exploration in this field by integrating all models, datasets, and evaluation protocols into an open-source repository.

PDF Details DOI

JBHI Journal 2026 Journal Article

Beyond NLL: Pathwise Cross-Entropy Loss for Discriminative and Calibrated Event-Time Survival Prediction

Jingmin Long
Jia Li
Jesper Kers
Fons J. Verbeek

Deep survival models are increasingly used for time-to-event prediction under censoring, yet training objectives remain a bottleneck. The widely used discrete-time negative log-likelihood (NLL) supervises hazards and can suffer from temporal information imbalance and gradient attenuation, yielding early-dominated probability mass and degraded late-horizon calibration, especially under heavy censoring and competing risks. We introduce Pathwise Cross-Entropy (PCE), which utilizes a symmetric, full-path objective that directly learns the occurred-by-t trajectory as a Cumulative Incidence Function (CIF). This direct approach seamlessly yields a normalized Probability Mass Function (PMF) for predicting event times, unlike NLL, where the derived PMF is structurally biased toward monotonic decrease, hindering its predictive utility. In a counting-process view, PCE supplies bidirectional gradients and constitutes a strictly proper scoring rule on counting paths. We extend PCE to competing risks with cause-specific supervision that avoids the multinomial coupling in NLL under competing risks. Empirically, across the tabular SEER and a WSI-derived kidney dataset and multiple backbones, PCE consistently improves discrimination (C-index, AUC) and calibration (IBS), produces calibration plots (ECE and PP plots) that are closer to observation, and enables ordinal first-hit time prediction directly with minimal practical monotonicity violations. These results indicate that PCE is a reliable and interpretable objective for single and competing-risk survival.

Details DOI

EAAI Journal 2026 Journal Article

Children’s psychological recognition with a multimodal language model incorporating visual language features

Yao-Dong Chen
Jia Li
Jing Xu

Although large language models have shown noticeable capability in open areas like translation and text classification, psychological theories are still required, therefore the threshold for emotional computing is high. This research focuses on children’s difficult problem of emotion perception by developing a child-specific model to improve the performance on visual question responses and facial expression categorization. We present a new visual-language model trained with text describing children’s traits and visual aspects, based on the bootstrapping language-image pretraining architecture. The child psychologist’s written instructions serve as a guide for the training process. To increase accuracy and achieve lower memory and calculation costs, the model is further fine-tuned and refined using the low-rank adaptation method. Three public children’s emotion classification and seven child psychology survey datasets are used to create the instruction and test the model performance. The outcomes demonstrate our models’ clear superiority over traditional deep learning models across all datasets. Following training, our multi-modal models extract meaningful visual information and demonstrate picture understanding, which is more sophisticated than categorization, demonstrating a knowledge of emotional perception.

Details DOI

AAAI Conference 2026 Conference Paper

DAVSP: Safety Alignment for Large Vision-Language Models via Deep Aligned Visual Safety Prompt

Yitong Zhang
Jia Li
Liyi Cai
Ge Li

Large Vision-Language Models (LVLMs) have achieved impressive progress across various applications but remain vulnerable to malicious queries. Existing safety alignment approaches typically fail to resist malicious queries while preserving utility on benign ones effectively. To address these challenges, we propose DAVSP, which is built upon two key innovations. First, we introduce Visual Safety Prompt, which appends a trainable padding region around the input image. It preserves visual features and expands the optimization space. Second, we propose Deep Alignment, a novel approach to train the visual safety prompt through supervision in the model's activation space. It enhances the inherent ability of LVLMs to perceive malicious queries, achieving deeper alignment than prior works. Extensive experiments demonstrate that DAVSP effectively resists malicious queries while preserving benign input utility. Furthermore, DAVSP exhibits great cross-model generation ability. Ablation studies further reveal that both the Visual Safety Prompt and Deep Alignment are essential to the overall effectiveness.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Do Audio-Visual Segmentation Models Truly Segment Sounding Objects?

Jia Li
Wenjie Zhao
Ziru Huang
Yunhui Guo
Yapeng Tian

Unlike traditional visual segmentation, audio-visual segmentation (AVS) requires the model not only to identify and segment objects but also to determine whether they are sound sources. Recent AVS approaches have achieved impressive performance on standard benchmarks. Yet, an important question remains: Do these models genuinely integrate audio-visual cues to segment sounding objects? Our study reveals a fundamental bias in current methods: they tend to generate segmentation masks based predominantly on visual salience, irrespective of the audio context, resulting in unreliable predictions when sounds are absent or irrelevant. To address this challenge, we introduce AVSBench-Robust, a comprehensive benchmark incorporating diverse negative audio scenarios, including silence, noise, and off-screen sounds. We also propose a simple yet effective approach combining balanced training with negative samples and classifier-guided similarity learning. Our extensive experiments show that while state-of-the-art AVS methods consistently fail under negative audio conditions, our approach achieves remarkable improvements in both standard metrics and robustness measures, maintaining near-perfect false positive rates while preserving high-quality segmentation performance.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MedSpaformer: A Transferable Transformer with Multi-Granularity Token Sparsification for Medical Time Series Classification

Jiexia Ye
Weiqi Zhang
Ziyue Li
Jia Li
Fugee Tsung

Accurate medical time series (MedTS) classification is essential for effective clinical diagnosis, yet remains challenging due to complex multi-channel temporal dependencies, information redundancy, and label scarcity. While transformer-based models have shown promise in time series analysis, most are designed for forecasting tasks and fail to fully exploit the unique characteristics of MedTS. In this paper, we introduce MedSpaformer, a transformer-based framework tailored for MedTS classification. It incorporates a sparse token-based dual-attention mechanism that enables global context modeling and token sparsification, allowing dynamic feature refinement by focusing on informative tokens while reducing redundancy. This mechanism is integrated into a multi-granularity cross-channel encoding scheme to capture intra- and inter-granularity temporal dependencies and inter-channel correlations, enabling progressive refinement of task-relevant patterns in medical signals. The sparsification design allows our model to flexibly accommodate inputs with variable lengths and channel dimensions. We also introduce an adaptive label encoder to extract label semantics and address cross-dataset label space misalignment. Together, these components enhance the model’s transferability across heterogeneous medical datasets, which helps alleviate the challenge of label scarcity. Our model outperforms 13 baselines across 7 medical datasets under supervised learning. It also excels in few-shot learning and demonstrates zero-shot capability in both in-domain and cross-domain diagnostics. These results highlight MedSpaformer's robustness and its potential as a unified solution for MedTS classification across diverse settings.

PDF Details DOI

YNIMG Journal 2026 Journal Article

Neural representations of emotional response inhibition reveal trait and state biomarkers in pediatric bipolar disorder

Jia Li
Rong Wang
Jianze Wu
Qian Xiao
Yuan Zhong

Pediatric bipolar disorder (PBD) is characterized by disrupted cognitive control, particularly in response inhibition under emotional interference. However, the neural underpinnings of these deficits, particularly how these impairments vary across emotional valence and whether they reflect trait markers or state alterations, remain unclear. While traditional univariate fMRI analyses reveal broad activation differences, they lack sensitivity to fine-grained neural patterns. This study aims to examine the neural representations of emotional response inhibition in PBD under valence-dependent interference using representational similarity analysis(RSA). We included manic (n = 15) and euthymic (n = 18) PBD patients, along with matched healthy controls (n = 17). Participants completed an emotional Go/NoGo task with happy, sad, and neutral faces during fMRI. Six contrast conditions were modeled to assess trait- and state-related effects. Whole-brain searchlight RSA (8 mm radius) was used to identify regions showing group differences in neural representational patterns. Results showed that emotional response inhibition engaged distributed neural systems, with distinct patterns across valence conditions. Compared to controls, PBD patients exhibited trait-related representational differences during happy inhibition, sad inhibition, and sad-specific inhibition, involving regions such as the precentral gyrus, middle frontal gyrus, and inferior parietal lobule. Manic patients showed state-related reductions in neural representations during sad-specific inhibition within frontal areas compared to euthymic patients. These findings indicate that emotional response inhibition deficits in PBD arise from both trait- and state-dependent abnormalities in neural representations. The study highlights the value of multivariate fMRI in uncovering clinically relevant biomarkers and provides a novel framework for developing phase-specific interventions.

Details DOI

AAAI Conference 2026 Conference Paper

Toward Gaze Target Detection of Young Autistic Children

Shijian Deng
Erin E. Kosloski
Siva Sai Nagender Vasireddy
Jia Li
Randi Sierra Sherwood
Feroz Mohamed Hatha
Siddhi Patel
Pamela R. Rollins

The automatic detection of gaze targets in autistic children through artificial intelligence can be impactful, especially for those who lack access to a sufficient number of professionals to improve their quality of life. This paper introduces a new, real-world AI application for gaze target detection in autistic children, which predicts a child's point of gaze from an activity image. This task is foundational for building automated systems that can measure joint attention—a core challenge in Autism Spectrum Disorder (ASD). To facilitate the study of this challenging application, we collected the first-ever Autism Gaze Target (AGT) Dataset. We further propose a novel social-aware coarse-to-fine (SACF) gaze detection framework that explicitly leverages the social context of a scene to overcome the class imbalance common in autism datasets—a consequence of autistic children's tendency to show reduced gaze to faces. It utilizes a two-pathway architecture with expert models specialized in social and non-social gaze, guided by a context-awareness gate module. The results of our comprehensive experiments demonstrate that our framework achieves new state-of-the-art performance for gaze target detection in this population, significantly outperforming existing methods, especially on the critical minority class of face-directed gaze.

PDF Details DOI

AAAI Conference 2025 Conference Paper

A Comprehensive Evaluation on Event Reasoning of Large Language Models

Zhengwei Tao
Zhi Jin
Yifan Zhang
Xiancai Chen
Haiyan Zhao
Jia Li
Bin Liang
Chongyang Tao

Event reasoning is a fundamental ability that underlies many applications. It requires event schema knowledge to perform global reasoning and needs to deal with the diversity of the inter-event relations and the reasoning paradigms. The extent to which LLMs excel in event reasoning across various relations and reasoning paradigms has not been thoroughly investigated. Additionally, it is still unclear whether LLMs utilize event knowledge in the same way humans do. To mitigate this disparity, we comprehensively evaluate the abilities of event reasoning of LLMs on different relations, paradigms, and levels of abstraction. We introduce a novel benchmark EV2 for EValuation of EVent reasoning. EV2 consists of two levels of evaluation on schema and instance and is comprehensive in relations and reasoning paradigms. We conduct extensive experiments on EV2. We find that 1) LLMs have abilities to accomplish event reasoning but their performances are far from satisfactory. 2) There are imbalances of event reasoning abilities on different relations and paradigms. 3) LLMs have event schema knowledge, however, they're not aligned with humans on how to utilize the knowledge. Based on these findings, we guide the LLMs in utilizing the event schema knowledge as memory leading to improvements in event reasoning.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Diffusion-Classifier Synergy: Reward-Aligned Learning via Mutual Boosting Loop for FSCIL

Ruitao Wu
Yifan Zhao
Guangyao Chen
Jia Li

Few-Shot Class-Incremental Learning (FSCIL) challenges models to sequentially learn new classes from minimal examples without forgetting prior knowledge, a task complicated by the stability-plasticity dilemma and data scarcity. Current FSCIL methods often struggle with generalization due to their reliance on limited datasets. While diffusion models offer a path for data augmentation, their direct application can lead to semantic misalignment or ineffective guidance. This paper introduces Diffusion-Classifier Synergy (DCS), a novel framework that establishes a mutual boosting loop between diffusion model and FSCIL classifier. DCS utilizes a reward-aligned learning strategy, where a dynamic, multi-faceted reward function derived from the classifier's state directs the diffusion model. This reward system operates at two levels: the feature level ensures semantic coherence and diversity using prototype-anchored maximum mean discrepancy and dimension-wise variance matching, while the logits level promotes exploratory image generation and enhances inter-class discriminability through confidence recalibration and cross-session confusion-aware mechanisms. This co-evolutionary process, where generated images refine the classifier and an improved classifier state yields better reward signals, demonstrably achieves state-of-the-art performance on FSCIL benchmarks, significantly enhancing both knowledge retention and new class learning.