Arrow Research search

Author name cluster

Jin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

54 papers
2 author rows

Possible papers

54

EAAI Journal 2026 Journal Article

A three-dimensional multi-sensor fusion convolutional network for bearing fault diagnosis under complex small sample conditions

  • Qiang Li
  • Rundong Zhou
  • Xinyu Zhai
  • Jin Wang
  • Qing Lv

To address the problem of inadequate information characterization of single sensors in bearing fault diagnosis, this paper proposes a three-dimensional multi-sensor fusion convolutional network (3D MFCN). Initially, it constructs multi-source inputs by integrating vibration and other fault signals. Subsequently, a three-dimensional feature extraction module (3D FEM) transforms one-dimensional signals into a time-frequency-depth three-dimensional feature tensor via multi-scale Mel transform. Ultimately, an end-to-end fault diagnosis is achieved through a three-dimensional convolution pooling module (3D CPM) in conjunction with a bidirectional long and short-term memory network (BiLSTM). Experimental validation demonstrates 3D MFCN attains over 99. 7 % classification accuracy across all three datasets, while both 3D FEM feature extraction and the complete 3D MFCN model exhibit stability performance exceeding 98 % in noisy environments, markedly surpassing traditional single-sensor diagnostic methods.

AAAI Conference 2026 Conference Paper

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

  • Riling Wei
  • Kelu Yao
  • Chuanguang Yang
  • Jin Wang
  • Zhuoyan Gao
  • Chao Li

Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

AAAI Conference 2026 Conference Paper

HCC-3D: Hierarchical Compensatory Compression for 98% 3D Token Reduction in Vision-Language Models

  • Liheng Zhang
  • Jin Wang
  • Hui Li
  • Bingfeng Zhang
  • Weifeng Liu

3D understanding has drawn significant attention recently, leveraging Vision-Language Models (VLMs) to enable multi-modal reasoning between point cloud and text data. Current 3D-VLMs directly embed the 3D point clouds into 3D tokens, following large 2D-VLMs with powerful reasoning capabilities. However, this framework has a great computational cost limiting its application, where we identify that the bottleneck lies in processing all 3D tokens in the Large Language Model (LLM) part. This raises the question: how can we reduce the computational overhead introduced by 3D tokens while preserving the integrity of their essential information? To address this question, we introduce Hierarchical Compensatory Compression (HCC-3D) to efficiently compress 3D tokens while maintaining critical detail retention. Specifically, we first propose a global structure compression (GSC), in which we design global queries to compress all 3D tokens into a few key tokens while keeping overall structural information. Then, to compensate for the information loss in GSC, we further propose an adaptive detail mining (ADM) module that selectively recompresses salient but under-attended features through complementary scoring. Extensive experiments demonstrate that HCC-3D not only achieves extreme compression ratios (approximately 98%) compared to previous 3D VLMs, but also achieves new state-of-the-art performance, showing the great improvements on both efficiency and performance.

EAAI Journal 2026 Journal Article

Knowledge graph-based operation and maintenance risk analysis and early warning approach for railway traction power supply systems

  • Shi Qiu
  • Xiaojian Li
  • Yongjun Chen
  • Weidong Wang
  • Jin Wang
  • Runan Cheng
  • Qasim Zaheer

The railway traction power supply system (RTPSS) is a critical component in the operation of electrified railways. However, as the network expands and maintenance cycles lengthen, it faces increasing operational risk. To enhance the accuracy of risk management and the timeliness of decision-making, this paper presents a risk analysis framework for the operation and maintenance (O&M) of RTPSS by utilizing knowledge graph technology. Initially, natural language processing (NLP) techniques are employed to handle massive fault data, constructing a systematic model to comprehensively represent the global modeling of multi-risk coupling mechanisms and cross-system cascade failures. Subsequently, a method for evaluating the early warning levels of risk events is proposed, which integrates multidimensional data. This method systematically assesses early warning levels by considering risk probability, risk loss data, and network topology data. Finally, the study outlines the process of mapping the early warning levels of RTPSS O&M risk onto knowledge graphs by dynamically integrating physical data with graph-based approaches. This approach enables maintenance personnel to quickly identify and comprehend the operational status of the RTPSS. Case study results demonstrate that the proposed method significantly enhances systematization, comprehensiveness, and observability, providing a more accurate and holistic tool for managing RTPSS O&M risk.

AAAI Conference 2026 Conference Paper

LLMdoctor: Token-Level Flow-Guided Preference Optimization for Efficient Test-Time Alignment of Large Language Models

  • Tiesunlong Shen
  • Rui Mao
  • Jin Wang
  • Heming Sun
  • Jian Zhang
  • Xuejie Zhang
  • Erik Cambria

Aligning Large Language Models (LLMs) with human preferences is critical, yet traditional fine-tuning methods are computationally expensive and inflexible. While test-time alignment offers a promising alternative, existing approaches often rely on distorted trajectory-level signals or inefficient sampling, fundamentally capping performance and failing to preserve the generative diversity of the base model. This paper introduces LLMdoctor, a novel framework for efficient test-time alignment that operates via a patient-doctor paradigm. It integrates token-level reward acquisition with token-level flow-guided preference optimization (TFPO) to steer a large, frozen patient LLM with a smaller, specialized doctor model. Unlike conventional methods that rely on trajectory-level rewards, LLMdoctor first extracts fine-grained, token-level preference signals from the patient model's behavioral variations. These signals then guide the training of the doctor model via TFPO, which establishes flow consistency across all subtrajectories, enabling precise token-by-token alignment while inherently preserving generation diversity. Extensive experiments demonstrate that LLMdoctor significantly outperforms existing test-time alignment methods and even surpasses the performance of full fine-tuning approaches like DPO.

EAAI Journal 2026 Journal Article

Multimodal graph neural network framework for railway fastener tightness assessment from high-resolution point clouds

  • Qasim Zaheer
  • S Muhammad Ahmed Hassan Shah
  • Weidong Wang
  • Haleema Ehsan
  • Chengbo Ai
  • Jin Wang
  • Shi Qiu

Railway fasteners are critical to the safety and structural integrity of railway infrastructure, yet conventional tightness assessment methods based on manual inspection are labor-intensive, subjective, and difficult to scale. This paper presents a hybrid dual-phase framework for automated fastener tightness estimation that integrates multimodal self-supervised contrastive learning with graph-based feature analysis. The framework exploits complementary information from images, depth maps, point clouds, and mesh representations to learn mechanically meaningful features without requiring manual annotations. Experimental results demonstrate stable cross-modal feature alignment, with cosine similarity remaining consistent across configurations, and show that incorporating three-dimension geometric information increases representational richness, as reflected by higher feature-space separation. Inference-time analysis indicates that multimodal feature extraction requires approximately 3 s for image-depth processing and 2 s for graph-based three-dimension processing per fastener, supporting practical deployment through offline or batch-based operation. Overall, the proposed framework provides a robust and physically interpretable approach for railway fastener tightness monitoring and establishes a foundation for scalable intelligent maintenance systems.

EAAI Journal 2026 Journal Article

Numerical spiking neural membrane systems with dendritic spines for diagnosis of infectious spondylitis on Magnetic Resonance Images

  • Hongyan Zhang
  • Qiang Zhang
  • Jin Wang
  • Xiang Yu
  • Yang Li
  • Xiyu Liu
  • Jie Xue

As a branch of the third-generation neural network, the spiking neural membrane system has strong parallelism and low energy consumption. It has achieved success in pattern recognition, combinatorial optimization, power system control and robotics. However, traditional systems rely only on neuron rules for signal processing and transmission, and lack long-term memory. Long-term memory is an important function for biological neurons to achieve learning behavior. To address this limitation, we propose an innovative numerical spiking neural membrane system with dendritic spines, which enables neurons to retain and amplify important information. The system contains four neuron populations for identifying, memorizing, enhancing, and evaluating local salient features, respectively, and can be flexibly integrated into complex integrated membrane systems. In this study, a novel integrated membrane system is designed, which can extract key details from magnetic resonance images (MRI) using neurons, and enhance the salient features using neurons with dendritic spines, so as to improve the accuracy and efficiency of spondylitis diagnosis. The experimental results show that the system outperforms the current state-of-the-art deep learning network and four traditional classifiers in distinguishing tuberculous spondylitis from brucellar spondylitis, highlighting its potential in practical clinical applications.

AAAI Conference 2026 Conference Paper

SAPO: Self-Adaptive Process Optimization Makes Small Reasoners Stronger

  • Kaiyuan Chen
  • Guangmin Zheng
  • Jin Wang
  • Xiaobing Zhou
  • Xuejie Zhang

Existing self-evolution methods overlook the influence of fine-grained reasoning steps, which leads to the reasoner-verifier gap. The computational inefficiency of Monte Carlo (MC) process supervision further exacerbates the difficulty in mitigating the gap. Motivated by the Error-Related Negativity (ERN), which the reasoner can localize error following incorrect decisions, guiding rapid adjustments, we propose a Self-Adaptive Process Optimization (SAPO) method for self-improvement in Small Language Models (SLMs). SAPO adaptively and efficiently introduces process supervision signals by actively minimizing the reasoner-verifier gap rather than relying on inefficient MC estimations. Extensive experiments demonstrate that the proposed method outperforms most existing self-evolution methods on two challenging task types: mathematics and code. Additionally, to further investigate SAPO's impact on verifier performance, this work introduces two new benchmarks for process reward models in both mathematical and coding tasks.

EAAI Journal 2026 Journal Article

Self-verified user simulator via code-based interpretation in task-oriented dialogues

  • Xiang Luo
  • Jin Wang
  • Liang-Chih Yu
  • Xuejie Zhang

User simulators are essential for training and evaluating task-oriented dialogue systems (TODs). Recently, large language models (LLMs) have been increasingly adopted to construct user simulators by prompting them to generate natural language utterances and dialogue actions. However, due to the difficulty of controlling structured outputs through natural language prompts alone, these LLM-based simulators often produce incomplete, inconsistent, or invalid dialogue actions, limiting their effectiveness. To tackle this, this paper proposes a self-verified code-based user simulator that guides LLMs to generate intermediate Python code for structured dialogue actions. These code snippets are executed and validated by an external interpreter, and the verified outputs are used to refine the simulator’s behavior. Experiments on the Multi-Domain Wizard-of-Oz (MultiWOZ) dataset demonstrate that our method improves dialogue action accuracy by 4. 0%, and significantly enhances utterance diversity, achieving 12. 1% more trigrams, an increase of 0. 93 in entropy, and a 13. 2% gain in measure of textual lexical diversity (MTLD) over 100 dialogue turns. These results highlight the effectiveness of code-level verification in improving the controllability, correctness, and expressiveness of LLM-based user simulators.

AAAI Conference 2026 Conference Paper

Step-GRPO: Enhancing Reasoning Quality and Efficiency via Structured PRM-Based Reinforcement Learning

  • Weijie Li
  • Jin Wang
  • Liang-Chih Yu
  • Xuejie Zhang

Large reasoning models (LRMs) improve performance at test time by thinking longer, but this often leads to overthinking and high computational cost. To address this, recent reinforcement learning (RL) methods adopt outcome-level rewards, such as rule- or prompt-based signals, that favor shorter correct reasoning paths but often overlook reasoning quality. While such rewards neglect intermediate reasoning, dense supervision from process reward models (PRMs) has proven more effective in promoting coherent and high-quality reasoning. However, static PRM supervision introduces two challenges: reward hacking, since fixed rewards poorly capture global reasoning objectives, and the high training cost of obtaining dense reward labels at scale. To overcome these issues, we propose Step Group Relative Policy Optimization (Step-GRPO), a GRPO-based method that integrates step-level PRM signals into sparse trajectory-level feedback, avoiding costly step-level supervision while improving reasoning quality beyond accuracy. In addition, Step-GRPO employs a step-attention mechanism that captures inter-step dependencies and emphasizes critical reasoning steps, effectively mitigating reward hacking. We apply Step-GRPO to train large language models and observe consistent gains in reasoning quality, accuracy, and shorter reasoning traces across multiple math benchmarks, outperforming reinforcement learning baselines at substantially lower cost. Notably, the proposed model achieves 36.7 percent accuracy on AIME 2024 with 11,000 training samples and a training cost of 38 US dollars, surpassing baselines that require over 1,000 US dollars and more than 40,000 samples, demonstrating strong cost-effectiveness and scalability.

EAAI Journal 2026 Journal Article

Syntax-aware question generation through dependency relations-guided attention

  • Jinhong Li
  • Xuejie Zhang
  • Jin Wang
  • Xiaobing Zhou

Question generation is challenging and has drawn broad interest in recent years. Despite progress, robustly modeling linguistic structure in lengthy, mixed text while suppressing noise remains a major bottleneck. Traditional attention mechanisms offer no explicit constraints and tend to attend to all words, often placing undue weight on irrelevant ones. In this paper, we suggest guiding text modeling through syntax by integrating explicit syntactic constraints into the attention mechanism, yielding more linguistically-driven word representations. Specifically, we introduce syntax-dependency relations of interest into the self-attention network, which, combined with the original self-attention network in the Bidirectional and Auto-Regressive Transformers (BART) encoder through a dual-context architecture, forms a syntax guidance augmented framework (SG-BART), to achieve better linguistically inspired representations. Additionally, we address the issue of exposure bias by employing contrastive learning. This involves constructing contrastive samples, selecting appropriate contrastive loss functions, and designing new decoding strategies to enhance the model’s generalization capability. Our method significantly improves performance across multiple metrics for question generation on the public datasets.

EAAI Journal 2025 Journal Article

A lightweight and robust detection network for diverse glass surface defects via scale- and shape-aware feature extraction

  • Huan Yu
  • Jin Wang
  • Jingru Yang
  • Yiming Liang
  • Zhihui Li
  • Zhan Wang
  • Haiyan He
  • Xusheng Zhang

As glass usage expands across industries, intelligent glass defect detection is essential for ensuring quality. However, the varying shapes and sizes of defects, coupled with numerous subtle defects and the demand for efficient detection, present challenges for existing methods in achieving both accurate and real-time detection. To address these, we propose the lightweight and robust Glass Surface Defect Network (GSDNet) via scale- and shape-aware feature extraction. Specifically, the novel Shape-aware Feature Extraction (SFE) block, which employs deformable convolution with special linear shape-adaptive offset constraints, forms the feature extraction network, enabling the adaptive extraction of local features for defects with irregular shapes. Meanwhile, the Scale-aware (SA) attention is proposed, incorporating spatial attention mechanism to guide the model in focusing on key features across different receptive fields, enhancing defect detection at various scales. Finally, to enhance detection efficiency, the Efficient Bidirectional Path Aggregation Network (EBiPAN) is proposed as the feature aggregation module, integrating high-resolution information through bi-directional concatenation to improve small defect detection while avoiding significant additional computational burden. To validate the effectiveness of GSDNet, we compile the first multi-class glass defect dataset, covering 4 types of glass and 12 defect categories. Extensive experiments demonstrate GSDNet exhibits exceptional accuracy and robustness, consistently outperforming 9 advanced networks, with a 6. 8% improvement in mean Average Precision and a notable 10. 9% improvement in mean Average Precision small over the You Only Look Once version 8. Moreover, the optimal balance of accuracy and efficiency is achieved, with a detection speed of 68 frames per second. The dataset and code are publicly available at: https: //github. com/FisherYuuri/GSDNet.

AAAI Conference 2025 Conference Paper

Batch Selection for Multi-Label Classification Guided by Uncertainty and Dynamic Label Correlations

  • Ao Zhou
  • Bin Liu
  • Jin Wang
  • Grigorios Tsoumakas

The accuracy of deep neural networks is significantly influenced by the effectiveness of mini-batch construction during training. In single-label scenarios, such as binary and multi-class classification tasks, it has been demonstrated that batch selection algorithms preferring samples with higher uncertainty achieve better performance than difficulty-based methods. Although there are two batch selection methods tailored for multi-label data, none of them leverage important uncertainty information. Adapting the concept of uncertainty to multi-label data is not a trivial task, since there are two issues that should be tackled. First, traditional variance or entropy-based uncertainty measures ignore fluctuations of predictions within sliding windows and the importance of the current model state. Second, existing multi-label methods do not explicitly exploit the label correlations, particularly the uncertainty-based label correlations that evolve during the training process. In this paper, we propose an uncertainty-based multi-label batch selection algorithm. It assesses uncertainty for each label by considering differences between successive predictions and the confidence of current outputs, and further leverages dynamic uncertainty-based label correlations to emphasize instances whose uncertainty is synergistically expressed across multiple labels. Empirical studies demonstrate the effectiveness of our method in improving the performance and accelerating the convergence of various multi-label deep learning models.

AAAI Conference 2025 Conference Paper

Data-Free Black-Box Federated Learning via Zeroth-Order Gradient Estimation

  • Xinge Ma
  • Jin Wang
  • Xuejie Zhang

Federated learning (FL) enables decentralized clients to collaboratively train a global model under the orchestration of a central server without exposing their individual data. However, the iterative exchange of model parameters between the server and clients imposes heavy communication burdens, risks potential privacy leakage, and even precludes collaboration among heterogeneous clients. Distillation-based FL tackles these challenges by exchanging low-dimensional model outputs rather than model parameters, yet it highly relies on a task-relevant auxiliary dataset that is often not available in practice. Data-free FL attempts to overcome this limitation by training a server-side generator to directly synthesize task-specific data samples for knowledge transfer. However, the update rule of the generator requires clients to share on-device models for white-box access, which greatly compromises the advantages of distillation-based FL. This motivates us to explore a data-free and black-box FL framework via Zeroth-order Gradient Estimation (FedZGE), which estimates the gradients after flowing through on-device models in a black-box optimization manner to complete the training of the generator in terms of fidelity, transferability, diversity, and equilibrium, without involving any auxiliary data or sharing any model parameters, thus combining the advantages of both distillation-based FL and data-free FL. Experiments on large-scale image classification datasets and network architectures demonstrate the superiority of FedZGE in terms of data heterogeneity, model heterogeneity, communication efficiency, and privacy protection.

AAAI Conference 2025 Conference Paper

Divide-Solve-Combine: An Interpretable and Accurate Prompting Framework for Zero-shot Multi-Intent Detection

  • Libo Qin
  • Qiguang Chen
  • Jingxuan Zhou
  • Jin Wang
  • Hao Fei
  • Wanxiang Che
  • Min Li

Zero-shot multi-intent detection is capable of capturing multiple intents within a single utterance without any training data, which gains increasing attention. Building on the success of large language models (LLM), dominant approaches in the literature explore prompting techniques to enable zero-shot multi-intent detection. While significant advancements have been witnessed, the existing prompting approaches still face two major issues: lacking explicit reasoning and lacking interpretability. Therefore, in this paper, we introduce a Divide-Solve-Combine Prompting (DSCP) to address the above issues. Specifically, DSCP explicitly decomposes multi-intent detection into three components including (1) single-intent division prompting is utilized to decompose an input query into distinct sub-sentences, each containing a single intent; (2) intent-by-intent solution prompting is applied to solve each sub-sentence recurrently; and (3) multi-intent combination prompting is employed for combining each sub-sentence result to obtain the final multi-intent result. By decomposition, DSCP allows the model to track the explicit reasoning process and improve the interpretability. In addition, we propose an interactive divide-solve-combine prompting (Inter-DSCP) to naturally capture the interaction capabilities of large language models. Experimental results on two standard multi-intent benchmarks (i.e., MixATIS and MixSNIPS) reveal that both DSCP and Inter-DSCP obtain substantial improvements over baselines, achieving superior performance and higher interpretability.

EAAI Journal 2025 Journal Article

Dual-branch crack segmentation network with multi-shape kernel based on convolutional neural network and Mamba

  • Jianming Zhang
  • Dianwen Li
  • Zhigao Zeng
  • Rui Zhang
  • Jin Wang

Cracks are one of the most common pavement diseases. If not promptly repaired, they will hasten the deterioration of the road. Semantic segmentation is the most convenient pavement crack detection method to assess the damage level. Convolutional neural networks (CNN) excel at extracting local spatial information, but they have limitations in capturing global contextual information. Therefore, a dual-branch crack segmentation network (DBCNet) with Mamba and multi-shape convolutional kernels is proposed. First, a dual-branch encoder is employed to extract both spatial and contextual information, consisting of the spatial branch and the context branch. The cross-like block (CrossBlock) that excels in extracting spatial information horizontally and vertically from cracks is proposed. Multiple CrossBlocks are stacked to construct a lightweight network as a spatial branch. The improved Visual State Space Model (VMamba) serves as a context branch for modeling long-range dependencies for more accurate pixel-by-pixel segmentation. Second, the Feature Fusion Module (FFM), based on squeeze-and-excitation attention, is constructed to dynamically fuse the features from the two branches layer by layer. Third, a Cross-aware Mamba Module (CMM) with the hybrid CNN-Mamba architecture is proposed to compose the decoder. Fourth, comprehensive evaluations were conducted on three public datasets. Performs on multiple metrics achieved considerable progress, outperforming the seven state-of-the-art models. The mean intersection over union (mIoU) on Deepcrack, CrackTree 260, and CFD reached 87. 87%, 85. 34%, and 81. 35%, respectively. Code and data will be available at https: //github. com/name191/DBCNet.

AAAI Conference 2025 Conference Paper

Efficient Event-Based Semantic Segmentation via Exploiting Frame-Event Fusion: A Hybrid Neural Network Approach

  • Hebei Li
  • Yansong Peng
  • Jiahui Yuan
  • Peixi Wu
  • Jin Wang
  • Yueyi Zhang
  • Xiaoyan Sun

Event cameras have recently been introduced into image semantic segmentation, owing to their high temporal resolution and other advantageous properties. However, existing event-based semantic segmentation methods often fail to fully exploit the complementary information provided by frames and events, resulting in complex training strategies and increased computational costs. To address these challenges, we propose an efficient hybrid framework for image semantic segmentation, comprising a Spiking Neural Network branch for events and an Artificial Neural Network branch for frames. Specifically, we introduce three specialized modules to facilitate the interaction between these two branches: the Adaptive Temporal Weighting (ATW) Injector, the Event-Driven Sparse (EDS) Injector, and the Channel Selection Fusion (CSF) module. The ATW Injector dynamically integrates temporal features from event data into frame features, enhancing segmentation accuracy by leveraging critical dynamic temporal information. The EDS Injector effectively combines sparse event data with rich frame features, ensuring precise temporal and spatial information alignment. The CSF module selectively merges these features to optimize segmentation performance. Experimental results demonstrate that our framework not only achieves state-of-the-art accuracy across the DDD17-Seg, DSEC-Semantic, and M3ED-Semantic datasets but also significantly reduces energy consumption, achieving a 65% reduction on the DSEC-Semantic dataset.

AAAI Conference 2025 Conference Paper

Event-Enhanced Blurry Video Super-Resolution

  • Dachun Kai
  • Yueyi Zhang
  • Jin Wang
  • Zeyu Xiao
  • Zhiwei Xiong
  • Xiaoyan Sun

In this paper, we tackle the task of blurry video super-resolution (BVSR), aiming to generate high-resolution (HR) videos from low-resolution (LR) and blurry inputs. Current BVSR methods often fail to restore sharp details at high resolutions, resulting in noticeable artifacts and jitter due to insufficient motion information for deconvolution and the lack of high-frequency details in LR frames. To address these challenges, we introduce event signals into BVSR and propose a novel event-enhanced network, Ev-DeblurVSR. To effectively fuse information from frames and events for feature deblurring, we introduce a reciprocal feature deblurring module that leverages motion information from intra-frame events to deblur frame features while reciprocally using global scene context from the frames to enhance event features. Furthermore, to enhance temporal consistency, we propose a hybrid deformable alignment module that fully exploits the complementary motion information from inter-frame events and optical flow to improve motion estimation in the deformable alignment process. Extensive evaluations demonstrate that Ev-DeblurVSR establishes a new state-of-the-art performance on both synthetic and real-world datasets. Notably, on real data, our method is 2.59 dB more accurate and 7.28× faster than the recent best BVSR baseline FMA-Net.

NeurIPS Conference 2025 Conference Paper

FUDOKI: Discrete Flow-based Unified Understanding and Generation via Kinetic-Optimal Velocities

  • Jin Wang
  • Yao Lai
  • Aoxue Li
  • Shifeng Zhang
  • Jiacheng Sun
  • Ning Kang
  • Chengyue Wu
  • Zhenguo Li

The rapid progress of large language models (LLMs) has catalyzed the emergence of multimodal large language models (MLLMs) that unify visual understanding and image generation within a single framework. However, most existing MLLMs rely on autoregressive (AR) architectures, which impose inherent limitations on future development, such as the raster-scan order in image generation and restricted reasoning abilities in causal context modeling. In this work, we challenge the dominance of AR-based approaches by introducing FUDOKI, a unified multimodal model purely based on discrete flow matching, as an alternative to conventional AR paradigms. By leveraging metric-induced probability paths with kinetic optimal velocities, our framework goes beyond the previous masking-based corruption process, enabling iterative refinement with self-correction capability and richer bidirectional context integration during generation. To mitigate the high cost of training from scratch, we initialize FUDOKI from pre-trained AR-based MLLMs and adaptively transition to the discrete flow matching paradigm. Experimental results show that FUDOKI achieves performance comparable to state-of-the-art AR-based MLLMs across both visual understanding and image generation tasks, highlighting its potential as a foundation for next-generation unified multimodal models. Furthermore, we show that applying test-time scaling techniques to FUDOKI yields significant performance gains, further underscoring its promise for future enhancement through reinforcement learning.

NeurIPS Conference 2025 Conference Paper

Jury-and-Judge Chain-of-Thought for Uncovering Toxic Data in 3D Visual Grounding

  • Kaixiang Huang
  • Qifeng Zhang
  • Jin Wang
  • Jingru Yang
  • Yang Zhou
  • Huan Yu
  • Guodong Lu
  • Shengfeng He

3D Visual Grounding (3DVG) faces persistent challenges due to coarse scene-level observations and logically inconsistent annotations, which introduce ambiguities that compromise data quality and hinder effective model supervision. To address these challenges, we introduce Refer-Judge, a novel framework that harnesses the reasoning capabilities of Multimodal Large Language Models (MLLMs) to identify and mitigate toxic data. At the core of Refer-Judge is a Jury-and-Judge Chain-of-Thought paradigm, inspired by the deliberative process of the judicial system. This framework targets the root causes of annotation noise: jurors collaboratively assess 3DVG samples from diverse perspectives, providing structured, multi-faceted evaluations. Judges then consolidate these insights using a Corroborative Refinement strategy, which adaptively reorganizes information to correct ambiguities arising from biased or incomplete observations. Through this two-stage deliberation, Refer-Judge significantly enhances the reliability of data judgments. Extensive experiments demonstrate that our framework not only achieves human-level discrimination at the scene level but also improves the performance of baseline algorithms via data purification. Code is available at https: //github. com/Hermione-HKX/Refer_Judge.

YNICL Journal 2025 Journal Article

Language dysfunction and related amyloid-β in Alzheimer’s disease mediated by brain functional connectivity changes

  • Jin-Wen Xiao
  • Jin Wang
  • Jin-Tao Wang
  • Hai-Xia Li
  • Jian-Ping Li
  • Xin-Yi Xie
  • Jie-Li Geng
  • Nan Zhi

BACKGROUND: Language dysfunction occurs early in Alzheimer's disease (AD) and whether amyloid-β pathology is related to language dysfunction remains unclear. Functional connectivity (FC) in language networks is critical for language function. We hypothesize that altered FCs in fronto-temporal regions (core language areas) mediate the association between amyloid-β burden and language impairment. METHODS: A total of 110 individuals were recruited including 44 cognitively unimpaired individuals, 24 mild cognitive impairment patients and 42 AD patients. The acoustic features and semantic content were extracted from speech recordings of cookie-theft picture description for language assessment. Resting-state functional magnetic resonance imaging and 18F-florbetapir positron emission tomography were conducted to estimate fronto-temporal FCs and amyloid-β burden. RESULTS: The acoustic features of average duration of silence segments, percentage of silence duration and ratio of hesitation/speech counts were significantly longer in AD patients, which were correlated with the semantic content. Among them, the average duration of silence segments was positively correlated with global amyloid-β burden (r = 0.30, p = 0.015). The fronto-temporal FCs decreased in AD patients. Mediation analysis revealed that the reduced FC between the left rostroposterior superior temporal sulcus (rpSTS) and the right dorsal inferior frontal gyrus mediated the association between global amyloid-β burden and average duration of silence segments (a*b = 0.19, p = 0.026). Additionally, it also mediated the effect of amyloid-β burden in the left rpSTS region on average duration of silence segments (a*b = 0.15, p = 0.030). CONCLUSIONS: Our findings showed that reduced fronto-temporal FCs mediated the association between amyloid-β burden and language dysfunction and further demonstrated the vulnerability of the left rpSTS in response to amyloid-β burden, which might lead to decreased FC with other regions.

AAAI Conference 2025 Conference Paper

MegActor-Sigma: Unlocking Flexible Mixed-Modal Control in Portrait Animation with Diffusion Transformer

  • Shurong Yang
  • Huadong Li
  • Juhao Wu
  • Minhao Jing
  • Linze Li
  • Renhe Ji
  • Jiajun Liang
  • Haoqiang Fan

Diffusion models have demonstrated superior performance in portrait animation. However, current approaches relied on either visual or audio modality to control character movements, failing to exploit the potential of mixed-modal control. This challenge arises from the difficulty in balancing the weak control strength of audio modality and the strong control strength of visual modality. To address this issue, we introduce MegActor-Sigma: a mixed-modal conditional diffusion transformer (DiT), which can flexibly inject audio and visual modality control signals into portrait animation. Specifically, we make substantial advancements over its predecessor, MegActor, by leveraging the promising model structure of DiT and integrating audio and visual conditions through advanced modules within the DiT framework. To further achieve flexible combinations of mixed-modal control signals, we propose a ``Modality Decoupling Control" training strategy to balance the control strength between visual and audio modalities, along with the ``Amplitude Adjustment" inference strategy to freely regulate the motion amplitude of each modality. Finally, to facilitate extensive studies in this field, we design several dataset evaluation metrics to filter out public datasets and solely use this filtered dataset for training. Extensive experiments demonstrate the superiority of our approach in generating vivid portrait animations.

ICLR Conference 2025 Conference Paper

MMIU: Multimodal Multi-image Understanding for Evaluating Large Vision-Language Models

  • Fanqing Meng
  • Jin Wang
  • Chuanhao Li 0001
  • Quanfeng Lu
  • Hao Tian 0006
  • Tianshuo Yang
  • Jiaqi Liao
  • Xizhou Zhu

The capability to process multiple images is crucial for Large Vision-Language Models (LVLMs) to develop a more thorough and nuanced understanding of a scene. Recent multi-image LVLMs have begun to address this need. However, their evaluation has not kept pace with their development. To fill this gap, we introduce the Multimodal Multi-image Understanding (MMIU) benchmark, a comprehensive evaluation suite designed to assess LVLMs across a wide range of multi-image tasks. MMIU encompasses 7 types of multi-image relationships, 52 tasks, 77K images, and 11K meticulously curated multiple-choice questions, making it the most extensive benchmark of its kind. Our evaluation of nearly 30 popular LVLMs, including both open-source and proprietary models, reveals significant challenges in multi-image comprehension, particularly in tasks involving spatial understanding. Even the most advanced models, such as GPT-4o, achieve only 55.7\% accuracy on MMIU. Through multi-faceted analytical experiments, we identify key performance gaps and limitations, providing valuable insights for future model and data improvements. We aim for MMIU to advance the frontier of LVLM research and development. We release the data and code at https://github.com/MMIUBenchmark/MMIU.

AAAI Conference 2025 Conference Paper

Multi-Attribute Multi-Grained Adaptation of Pre-Trained Language Models for Text Understanding from Bayesian Perspective

  • You Zhang
  • Jin Wang
  • Liang-Chih Yu
  • Dan Xu
  • Xuejie Zhang

Current neural networks often employ multi-domain-learning or attribute-injecting mechanisms to incorporate non-independent and identically distributed (non-IID) information for text understanding tasks by capturing individual characteristics and the relationships among samples. However, the extent of the impact of non-IID information and how these methods affect pre-trained language models (PLMs) remains unclear. This study revisits the assumption that non-IID information enhances PLMs to achieve performance improvements from a Bayesian perspective, which unearths and integrates non-IID and IID features. Furthermore, we proposed a multi-attribute multi-grained framework for PLM adaptations (M2A), which combines multi-attribute and multi-grained views to mitigate uncertainty in a lightweight manner. We evaluate M2A through prevalent text-understanding datasets and demonstrate its superior performance, mainly when data are implicitly non-IID, and PLMs scale larger.

IROS Conference 2025 Conference Paper

RoboNurse-VLA: Robotic Scrub Nurse System based on Vision-Language-Action Model

  • Shunlei Li
  • Jin Wang
  • Rui Dai
  • Wanyu Ma
  • Wing Yin Ng
  • Yingbai Hu
  • Zheng Li 0012

In modern healthcare, the demand for autonomous robotic assistants has grown significantly, particularly in the operating room, where surgical tasks require precision and reliability. Robotic scrub nurses have emerged as a promising solution to improve efficiency and reduce human error during surgery. However, challenges remain in terms of accurately grasping and handing over surgical instruments, especially when dealing with complex objects in dynamic environments. In this work, we introduce RoboNurse-VLA, a novel robotic scrub nurse system based on a Vision-Language-Action (VLA) model. RoboNurse-VLA integrates Segment Anything Model 2 (SAM 2) and Llama 2, leveraging an LLM head to enhance reasoning capabilities. By combining SAM 2’s mask generation with Llama 2’s advanced reasoning, RoboNurse-VLA can accurately interpret task requirements, identify optimal grasping points, and determine appropriate handover poses. Designed for real-time operation, RoboNurse-VLA enables precise grasping and seamless handover of surgical instruments based on voice commands from the surgeon. Utilizing state-of-the-art vision and language models, it effectively addresses challenges related to object detection, pose optimization, and handling difficult-to-grasp instruments. Extensive evaluations demonstrate that RoboNurse-VLA outperforms existing models, achieving high success rates in surgical instrument handovers, even for previously unseen tools and complex objects. This work represents a significant advancement in autonomous surgical assistance, highlighting the potential of VLA models for real-world medical applications. More details can be found at https:// robonurse-vla.github.io.

AAAI Conference 2025 Conference Paper

Sample-aware Adaptive Structured Pruning for Large Language Models

  • Jun Kong
  • Xinge Ma
  • Jin Wang
  • Xuejie Zhang

Large language models (LLMs) have achieved outstanding performance in natural language processing, but enormous model sizes and high computational costs limit their practical deployment. Structured pruning can effectively reduce the resource demands for deployment by removing redundant model parameters. However, the randomly selected calibration data and fixed single importance estimation metrics in existing structured pruning methods lead to degraded performance of pruned models. This study introduces AdaPruner, a sample-aware adaptive structured pruning framework for LLMs, aiming to optimize the calibration data and importance estimation metrics in the structured pruning process. Specifically, AdaPruner effectively removes redundant parameters from LLMs by constructing a structured pruning solution space and then employing Bayesian optimization to adaptively search for the optimal calibration data and importance estimation metrics. Experimental results show that the AdaPruner outperforms existing structured pruning methods on a family of LLMs with varying pruning ratios, demonstrating its applicability and robustness. Remarkably, at a 20% pruning ratio, the model pruned with AdaPruner maintains 97% of the performance of the unpruned model.

IJCAI Conference 2025 Conference Paper

TEST-V: TEst-time Support-set Tuning for Zero-shot Video Classification

  • Rui Yan
  • Jin Wang
  • Hongyu Qu
  • Xiaoyu Du
  • Dong Zhang
  • Jinhui Tang
  • Tieniu Tan

Recently, adapting Vision Language Models (VLMs) to zero-shot visual classification by tuning class embedding with a few prompts (Test-time Prompt Tuning, TPT) or replacing class names with generated visual samples (support-set) has shown promising results. However, TPT cannot avoid the semantic gap between modalities while the support-set cannot be tuned. To this end, we draw on each other's strengths and propose a novel framework, namely TEst-time Support-set Tuning for zero-shot Video Classification (TEST-V). It first dilates the support-set with multiple prompts (Multi-prompting Support-set Dilation, MSD) and then erodes the support-set via learnable weights to mine key cues dynamically (Temporal-aware Support-set Erosion, TSE). Specifically, i) MSD expands the support samples for each class based on multiple prompts inquired from LLMs to enrich the diversity of the support-set. ii) TSE tunes the support-set with factorized learnable weights according to the temporal prediction consistency in a self-supervised manner to dig pivotal supporting cues for each class. TEST-V achieves state-of-the-art results across four benchmarks and shows good interpretability.

AAAI Conference 2025 Conference Paper

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

  • Kuanghong Liu
  • Jin Wang
  • Kangjian He
  • Dan Xu
  • Xuejie Zhang

Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

IROS Conference 2024 Conference Paper

Autonomous Behavior Planning For Humanoid Loco-manipulation Through Grounded Language Model

  • Jin Wang
  • Arturo Laurenzi
  • Nikos G. Tsagarakis

Enabling humanoid robots to perform autonomously loco-manipulation in unstructured environments is crucial and highly challenging for achieving embodied intelligence. This involves robots being able to plan their actions and behaviors in long-horizon tasks while using multi-modality to perceive deviations between task execution and high-level planning. Recently, large language models (LLMs) have demonstrated powerful planning and reasoning capabilities for comprehension and processing of semantic information through robot control tasks, as well as the usability of analytical judgment and decision-making for multi-modal inputs. To leverage the power of LLMs towards humanoid loco-manipulation, we propose a novel language-model based framework that enables robots to autonomously plan behaviors and low-level execution under given textual instructions, while observing and correcting failures that may occur during task execution. To systematically evaluate this framework in grounding LLMs, we created the robot ’action’ and ’sensing’ behavior library for task planning, and conducted mobile manipulation tasks and experiments in both simulated and real environments using the CENTAURO robot, and verified the effectiveness and application of this approach in robotic tasks with autonomous behavioral planning. Video: https://youtu.be/mmnaxthEX34

NeurIPS Conference 2024 Conference Paper

Continuous Heatmap Regression for Pose Estimation via Implicit Neural Representation

  • Shengxiang Hu
  • Huaijiang Sun
  • Dong Wei
  • Xiaoning Sun
  • Jin Wang

Heatmap regression has dominated human pose estimation due to its superior performance and strong generalization. To meet the requirements of traditional explicit neural networks for output form, existing heatmap-based methods discretize the originally continuous heatmap representation into 2D pixel arrays, which leads to performance degradation due to the introduction of quantization errors. This problem is significantly exacerbated as the size of the input image decreases, which makes heatmap-based methods not much better than coordinate regression on low-resolution images. In this paper, we propose a novel neural representation for human pose estimation called NerPE to achieve continuous heatmap regression. Given any position within the image range, NerPE regresses the corresponding confidence scores for body joints according to the surrounding image features, which guarantees continuity in space and confidence during training. Thanks to the decoupling from spatial resolution, NerPE can output the predicted heatmaps at arbitrary resolution during inference without retraining, which easily achieves sub-pixel localization precision. To reduce the computational cost, we design progressive coordinate decoding to cooperate with continuous heatmap regression, in which localization no longer requires the complete generation of high-resolution heatmaps. The code is available at https: //github. com/hushengxiang/NerPE.

ICML Conference 2024 Conference Paper

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI

  • Kaining Ying
  • Fanqing Meng
  • Jin Wang
  • Zhiqian Li
  • Han Lin
  • Yue Yang
  • Hao Zhang 0117
  • Wenbo Zhang 0009

Large Vision-Language Models (LVLMs) show significant strides in general-propose multimodal applications such as visual dialogue and embodied navigation. However, existing multimodal evaluation benchmarks cover a limited number of multimodal tasks testing rudimentary capabilities, falling short in tracking LVLM development. In this study, we present MMT-Bench, a comprehensive benchmark designed to assess LVLMs across massive multimodal tasks requiring expert knowledge and deliberate visual recognition, localization, and reasoning. MMT-Bench comprises $31, 325$ meticulously curated multi-choice visual questions from various multimodal scenarios such as vehicle driving and embodied navigation, covering $32$ core meta-tasks and $162$ subtasks in multimodal understanding. Due to its extensive task coverage, MMT-Bench enables the evaluation of LVLMs using a task map, facilitating the discovery of in- and out-of-domain tasks. Evaluation results involving $20$ publicly available LVLMs such as the proprietary GeminiProVision model, underscore the significant challenges posed by MMT-Bench. We anticipate that MMT-Bench will inspire the community to develop next-generation multimodal foundation models aimed at achieving general-purpose multimodal intelligence.

AAAI Conference 2024 Conference Paper

Personalized LoRA for Human-Centered Text Understanding

  • You Zhang
  • Jin Wang
  • Liang-Chih Yu
  • Dan Xu
  • Xuejie Zhang

Effectively and efficiently adapting a pre-trained language model (PLM) for human-centered text understanding (HCTU) is challenging since user tokens are million-level in most personalized applications and do not have concrete explicit semantics. A standard and parameter-efficient approach (e.g., LoRA) necessitates memorizing numerous suits of adapters for each user. In this work, we introduce a personalized LoRA (PLoRA) with a plug-and-play (PnP) framework for the HCTU task. PLoRA is effective, parameter-efficient, and dynamically deploying in PLMs. Moreover, a personalized dropout and a mutual information maximizing strategies are adopted and hence the proposed PLoRA can be well adapted to few/zero-shot learning scenarios for the cold-start issue. Experiments conducted on four benchmark datasets show that the proposed method outperforms existing methods in full/few/zero-shot learning scenarios for the HCTU task, even though it has fewer trainable parameters. For reproducibility, the code for this paper is available at: https://github.com/yoyo-yun/PLoRA.

NeurIPS Conference 2023 Conference Paper

Birder: Communication-Efficient 1-bit Adaptive Optimizer for Practical Distributed DNN Training

  • Hanyang Peng
  • Shuang Qin
  • Yue Yu
  • Jin Wang
  • Hui Wang
  • Ge Li

Various gradient compression algorithms have been proposed to alleviate the communication bottleneck in distributed learning, and they have demonstrated effectiveness in terms of high compression ratios and theoretical low communication complexity. However, when it comes to practically training modern deep neural networks (DNNs), these algorithms have yet to match the inference performance of uncompressed SGD-momentum (SGDM) and adaptive optimizers (e. g. ,Adam). More importantly, recent studies suggest that these algorithms actually offer no speed advantages over SGDM/Adam when used with common distributed DNN training frameworks ( e. g. , DistributedDataParallel (DDP)) in the typical settings, due to heavy compression/decompression computation or incompatibility with the efficient All-Reduce or the requirement of uncompressed warmup at the early stage. For these reasons, we propose a novel 1-bit adaptive optimizer, dubbed *Bi*nary *r*andomization a*d*aptive optimiz*er* (**Birder**). The quantization of Birder can be easily and lightly computed, and it does not require warmup with its uncompressed version in the beginning. Also, we devise Hierarchical-1-bit-All-Reduce to further lower the communication volume. We theoretically prove that it promises the same convergence rate as the Adam. Extensive experiments, conducted on 8 to 64 GPUs (1 to 8 nodes) using DDP, demonstrate that Birder achieves comparable inference performance to uncompressed SGDM/Adam, with up to ${2. 5 \times}$ speedup for training ResNet-50 and ${6. 3\times}$ speedup for training BERT-Base. Code is publicly available at https: //openi. pcl. ac. cn/c2net_optim/Birder.

AAAI Conference 2023 Conference Paper

Joint Multimodal Entity-Relation Extraction Based on Edge-Enhanced Graph Alignment Network and Word-Pair Relation Tagging

  • Li Yuan
  • Yi Cai
  • Jin Wang
  • Qing Li

Multimodal named entity recognition (MNER) and multimodal relation extraction (MRE) are two fundamental subtasks in the multimodal knowledge graph construction task. However, the existing methods usually handle two tasks independently, which ignores the bidirectional interaction between them. This paper is the first to propose jointly performing MNER and MRE as a joint multimodal entity-relation extraction (JMERE) task. Besides, the current MNER and MRE models only consider aligning the visual objects with textual entities in visual and textual graphs but ignore the entity-entity relationships and object-object relationships. To address the above challenges, we propose an edge-enhanced graph alignment network and a word-pair relation tagging (EEGA) for the JMERE task. Specifically, we first design a word-pair relation tagging to exploit the bidirectional interaction between MNER and MRE and avoid error propagation. Then, we propose an edge-enhanced graph alignment network to enhance the JMERE task by aligning nodes and edges in the cross-graph. Compared with previous methods, the proposed method can leverage the edge information to auxiliary alignment between objects and entities and find the correlations between entity-entity relationships and object-object relationships. Experiments are conducted to show the effectiveness of our model.

AAAI Conference 2023 Conference Paper

Learning to Memorize Entailment and Discourse Relations for Persona-Consistent Dialogues

  • Ruijun Chen
  • Jin Wang
  • Liang-Chih Yu
  • Xuejie Zhang

Maintaining engagement and consistency is particularly important in dialogue systems. Existing works have improved the performance of dialogue systems by intentionally learning interlocutor personas with sophisticated network structures. One issue with this approach is that it requires more personal corpora with annotations. Additionally, these models typically perform the next utterance prediction to generate a response but neglect the discourse coherence in the entire conversation. To address these issues, this study proposes a method of learning to memorize entailment and discourse relations for persona-consistent dialogue tasks. Entailment text pairs in natural language inference dataset were applied to learn latent entailment relations as external memories by premise-to-hypothesis generation task. Furthermore, an internal memory with a similar architecture was applied to the discourse information in the dialogue. Placing orthogonality restrictions on these two memory spaces ensures that the latent entailment relations remain dialogue-independent. Both memories collaborate to obtain entailment and discourse representation for the generation, allowing a deeper understanding of both consistency and coherence. Experiments on two large public datasets, PersonaChat and DSTC7-AVSD, demonstrated the effectiveness of the proposed method. Both automatic and human evaluations indicate that the proposed model outperforms several strong baselines in terms of both persona consistency and response coherence. Our source code is availabled at https://github.com/Chenrj233/LMEDR.

AAAI Conference 2023 Conference Paper

Supervised Contrastive Few-Shot Learning for High-Frequency Time Series

  • Xi Chen
  • Cheng Ge
  • Ming Wang
  • Jin Wang

Significant progress has been made in representation learning, especially with recent success on self-supervised contrastive learning. However, for time series with less intuitive or semantic meaning, sampling bias may be inevitably encountered in unsupervised approaches. Although supervised contrastive learning has shown superior performance by leveraging label information, it may also suffer from class collapse. In this study, we consider a realistic scenario in industry with limited annotation information available. A supervised contrastive framework is developed for high-frequency time series representation and classification, wherein a novel variant of supervised contrastive loss is proposed to include multiple augmentations while induce spread within each class. Experiments on four mainstream public datasets as well as a series of sensitivity and ablation analyses demonstrate that the learned representations are effective and robust compared with the direct supervised learning and self-supervised learning, notably under the minimal few-shot situation.

YNICL Journal 2022 Journal Article

Disrupted coupling between salience network segregation and glucose metabolism is associated with cognitive decline in Alzheimer's disease – A simultaneous resting-state FDG-PET/fMRI study

  • Miao Zhang
  • Ziyun Guan
  • Yaoyu Zhang
  • Wanqing Sun
  • Wenli Li
  • Jialin Hu
  • Binyin Li
  • Guanyu Ye

The aberrant organization and functioning of three core neurocognitive networks (NCNs), i.e., default-mode network (DMN), central executive network (CEN), and salience network (SN), are among the prominent features in Alzheimer's disease (AD). The dysregulation of both intra- and inter-network functional connectivities (FCs) of the three NCNs contributed to AD-related cognitive and behavioral abnormalities. Brain functional network segregation, integrating intra- and inter-network FCs, is essential for maintaining the energetic efficiency of brain metabolism. The association of brain functional network segregation, together with glucose metabolism, with age-related cognitive decline was recently shown. Yet how these joint functional-metabolic biomarkers relate to cognitive decline along with mild cognitive impairment (MCI) and AD remains to be elucidated. In this study, under the framework of the triple-network model, we performed a hybrid FDG-PET/fMRI study to evaluate the concurrent changes of resting-state brain intrinsic FCs and glucose metabolism of the three NCNs across cognitively normal (CN) (N = 24), MCI (N = 21), and AD (N = 21) groups. Lower network segregation and glucose metabolism were observed in all three NCNs in patients with AD. More interestingly, in the SN, the coupled relationship between network segregation and glucose metabolism existed in the CN group (r = 0.523, p = 0.013) and diminished in patients with MCI (r = 0.431, p = 0.065) and AD (r = 0.079, p = 0.748). Finally, the glucose metabolism of the DMN (r = 0.380, p = 0.017) and the network segregation of the SN (r = 0.363, p = 0.023) were significantly correlated with the general cognitive status of the patients. Our findings suggest that the impaired SN segregation and its uncoupled relationship with glucose metabolism contribute to the cognitive decline in AD.

AAAI Conference 2022 Conference Paper

Interpretable Generative Adversarial Networks

  • Chao Li
  • Kelu Yao
  • Jin Wang
  • Boyu Diao
  • Yongjun Xu
  • Quanshi Zhang

Learning a disentangled representation is still a challenge in the field of the interpretability of generative adversarial networks (GANs). This paper proposes a generic method to modify a traditional GAN into an interpretable GAN, which ensures that filters in an intermediate layer of the generator encode disentangled localized visual concepts. Each filter in the layer is supposed to consistently generate image regions corresponding to the same visual concept when generating different images. The interpretable GAN learns to automatically discover meaningful visual concepts without any annotations of visual concepts. The interpretable GAN enables people to modify a specific visual concept on generated images by manipulating feature maps of the corresponding filters in the layer. Our method can be broadly applied to different types of GANs. Experiments have demonstrated the effectiveness of our method.

YNIMG Journal 2021 Journal Article

Reciprocal relations between reading skill and the neural basis of phonological awareness in 7- to 9-year-old children

  • Jin Wang
  • Julia Pines
  • Marc Joanisse
  • James R. Booth

By using a longitudinal design and functional magnetic resonance imaging (fMRI), our previous study (Wang et al., 2020) found a scaffolding effect of early phonological processing in the superior temporal gyrus (STG) in 6-year-old children on later behavioral reading skill in 7.5-year-old children. Other than this previous study, nothing is known about longitudinal change in the bidirectional relation between reading skill and phonological processing in the brain. To fill this gap, in the current study, we used the same experimental paradigm as in Wang et al. (2020) to measure children's reading skill and brain activity during an auditory phonological awareness task, but with children who were 7.5 years old at Time 1 (T1) and about 1.5 years later when they were 9 years old at Time 2 (T2). The phonological awareness task included both small grain (i.e., onset) and large grain (i.e., rhyme) conditions. In a univariate analysis, we found that better reading skill at T1 predicted lower brain activation in IFG at T2 for onset processing after controlling for brain activation and non-verbal IQ at T1. This suggests that early reading ability reduces the effort of phonemic access, thus supporting the refinement hypothesis. When using general psychophysiological interaction (gPPI), we found that higher functional connectivity from IFG to STG for rhyme processing at T1 predicted better reading skill at T2 after controlling for reading skill and non-verbal IQ at T1. This suggests that the early effectiveness of accessing rhyme representations scaffolds reading acquisition. As both results did not survive multiple comparison corrections, replication of these findings is needed. However, both findings are consistent with prior studies demonstrating that phonological access in the frontal lobe becomes important in older elementary school readers. Moreover, the refinement effect for onsets is consistent with the hypothesis that learning to read allows for better access of small grain phonology, and the scaffolding effect for rhymes supports the idea that reading progresses to larger grain orthography-to-phonology mapping in older skilled readers. The current study, along with our previous study on younger children, indicates that the development of reading skill is associated with (1) the early importance of the quality of the phonological representations to later access of these representations, and (2) early importance of small grain sizes to later development of large grain ones.

IJCAI Conference 2020 Conference Paper

Discovering Subsequence Patterns for Next POI Recommendation

  • Kangzhi Zhao
  • Yong Zhang
  • Hongzhi Yin
  • Jin Wang
  • Kai Zheng
  • Xiaofang Zhou
  • Chunxiao Xing

Next Point-of-Interest (POI) recommendation plays an important role in location-based services. State-of-the-art methods learn the POI-level sequential patterns in the user's check-in sequence but ignore the subsequence patterns that often represent the socio-economic activities or coherence of preference of the users. However, it is challenging to integrate the semantic subsequences due to the difficulty to predefine the granularity of the complex but meaningful subsequences. In this paper, we propose Adaptive Sequence Partitioner with Power-law Attention (ASPPA) to automatically identify each semantic subsequence of POIs and discover their sequential patterns. Our model adopts a state-based stacked recurrent neural network to hierarchically learn the latent structures of the user's check-in sequence. We also design a power-law attention mechanism to integrate the domain knowledge in spatial and temporal contexts. Extensive experiments on two real-world datasets demonstrate the effectiveness of our model.

YNIMG Journal 2020 Journal Article

Neural representations of phonology in temporal cortex scaffold longitudinal reading gains in 5- to 7-year-old children

  • Jin Wang
  • Marc F. Joanisse
  • James R. Booth

The objective of this study was to investigate whether phonological processes measured through brain activation are crucial for the development of reading skill (i. e. scaffolding hypothesis) and/or whether learning to read words fine-tunes phonology in the brain (i. e. refinement hypothesis). We specifically looked at how different grain sizes in two brain regions implicated in phonological processing played a role in this bidirectional relation. According to the dual-stream model of speech processing and previous empirical studies, the posterior superior temporal gyrus (STG) appears to be a perceptual region associated with phonological representations, whereas the dorsal inferior frontal gyrus (IFG) appears to be an articulatory region that accesses phonological representations in STG during more difficult tasks. 36 children completed a reading test outside the scanner and an auditory phonological task which included both small (i. e. onset) and large (i. e. rhyme) grain size conditions inside the scanner when they were 5. 5–6. 5 years old (Time 1) and once again approximately 1. 5 years later (Time 2). To study the scaffolding hypothesis, a regression analysis was carried out by entering brain activation in either STG or IFG for either small (onset > perceptual) or large (rhyme > perceptual) grain size phonological processing at T1 as the predictors and reading skill at T2 as the dependent measure, with several covariates of no interest included. To study the refinement hypothesis, the regression analysis included reading skill at T1 as the predictor and brain activation in either STG or IFG for either small or large grain size phonological processing at T2 as the dependent measures, with several covariates of no interest included. We found that only posterior STG, regardless of grain size, was predictive of reading gains. Parallel models with only behavioral accuracy were not significant. Taken together, our results suggest that the representational quality of phonology in temporal cortex is crucial for reading development. Moreover, our study provides neural evidence supporting the scaffolding hypothesis, suggesting that brain measures of phonology could be helpful in early identification of reading difficulties.

IJCAI Conference 2019 Conference Paper

Hierarchical Inter-Attention Network for Document Classification with Multi-Task Learning

  • Bing Tian
  • Yong Zhang
  • Jin Wang
  • Chunxiao Xing

Document classification is an essential task in many real world applications. Existing approaches adopt both text semantics and document structure to obtain the document representation. However, these models usually require a large collection of annotated training instances, which are not always feasible, especially in low-resource settings. In this paper, we propose a multi-task learning framework to jointly train multiple related document classification tasks. We devise a hierarchical architecture to make use of the shared knowledge from all tasks to enhance the document representation of each task. We further propose an inter-attention approach to improve the task-specific modeling of documents with global information. Experimental results on 15 public datasets demonstrate the benefits of our proposed model.

TIST Journal 2019 Journal Article

Large-Scale Frequent Episode Mining from Complex Event Sequences with Hierarchies

  • Xiang Ao
  • Haoran Shi
  • Jin Wang
  • Luo Zuo
  • Hongwei Li
  • Qing He

Frequent Episode Mining (FEM), which aims at mining frequent sub-sequences from a single long event sequence, is one of the essential building blocks for the sequence mining research field. Existing studies about FEM suffer from unsatisfied scalability when faced with complex sequences as it is an NP-complete problem for testing whether an episode occurs in a sequence. In this article, we propose a scalable, distributed framework to support FEM on “big” event sequences. As a rule of thumb, “big” illustrates an event sequence is either very long or with masses of simultaneous events. Meanwhile, the events in this article are arranged in a predefined hierarchy. It derives some abstractive events that can form episodes that may not directly appear in the input sequence. Specifically, we devise an event-centered and hierarchy-aware partitioning strategy to allocate events from different levels of the hierarchy into local processes. We then present an efficient special-purpose algorithm to improve the local mining performance. We also extend our framework to support maximal and closed episode mining in the context of event hierarchy, and to the best of our knowledge, we are the first attempt to define and discover hierarchy-aware maximal and closed episodes. We implement the proposed framework on Apache Spark and conduct experiments on both synthetic and real-world datasets. Experimental results demonstrate the efficiency and scalability of the proposed approach and show that we can find practical patterns when taking event hierarchies into account.

IJCAI Conference 2019 Conference Paper

Learn Smart with Less: Building Better Online Decision Trees with Fewer Training Examples

  • Ariyam Das
  • Jin Wang
  • Sahil M. Gandhi
  • Jae Lee
  • Wei Wang
  • Carlo Zaniolo

Online decision tree models are extensively used in many industrial machine learning applications for real-time classification tasks. These models are highly accurate, scalable and easy to use in practice. The Very Fast Decision Tree (VFDT) is the classic online decision tree induction model that has been widely adopted due to its theoretical guarantees as well as competitive performance. However, VFDT and its variants solely rely on conservative statistical measures like Hoeffding bound to incrementally grow the tree. This makes these models extremely circumspect and limits their ability to learn fast. In this paper, we efficiently employ statistical resampling techniques to build an online tree faster using fewer examples. We first theoretically show that a naive implementation of resampling techniques like non-parametric bootstrap does not scale due to large memory and computational overheads. We mitigate this by proposing a robust memory-efficient bootstrap simulation heuristic (Mem-ES) that successfully expedites the learning process. Experimental results on both synthetic data and large-scale real world datasets demonstrate the efficiency and effectiveness of our proposed technique.

AIIM Journal 2018 Journal Article

Approximate dynamic programming approaches for appointment scheduling with patient preferences

  • Xin Li
  • Jin Wang
  • Richard Y.K. Fung

During the appointment booking process in out-patient departments, the level of patient satisfaction can be affected by whether or not their preferences can be met, including the choice of physicians and preferred time slot. In addition, because the appointments are sequential, considering future possible requests is also necessary for a successful appointment system. This paper proposes a Markov decision process model for optimizing the scheduling of sequential appointments with patient preferences. In contrast to existing models, the evaluation of a booking decision in this model focuses on the extent to which preferences are satisfied. Characteristics of the model are analysed to develop a system for formulating booking policies. Based on these characteristics, two types of approximate dynamic programming algorithms are developed to avoid the curse of dimensionality. Experimental results suggest directions for further fine-tuning of the model, as well as improving the efficiency of the two proposed algorithms.

IJCAI Conference 2018 Conference Paper

Beyond Polarity: Interpretable Financial Sentiment Analysis with Hierarchical Query-driven Attention

  • Ling Luo
  • Xiang Ao
  • Feiyang Pan
  • Jin Wang
  • Tong Zhao
  • Ningzi Yu
  • Qing He

Sentiment analysis has played a significant role in financial applications in recent years. The informational and emotive aspects of news texts may affect the prices, volatilities, volume of trades, and even potential risks of financial subjects. Previous studies in this field mainly focused on identifying polarity~(e. g. positive or negative). However, as financial decisions broadly require justifications, only plausible polarity cannot provide enough evidence during the decision making processes of humanity. Hence an explainable solution is in urgent demand. In this paper, we present an interpretable neural net framework for financial sentiment analysis. First, we design a hierarchical model to learn the representation of a document from multiple granularities. In addition, we propose a query-driven attention mechanism to satisfy the unique characteristics of financial documents. With the domain specified questions provided by the financial analysts, we can discover different spotlights for queries from different aspects. We conduct extensive experiments on a real-world dataset. The results demonstrate that our framework can learn better representation of the document and unearth meaningful clues on replying different users? preferences. It also outperforms the state-of-the-art methods on sentiment prediction of financial documents.

IJCAI Conference 2017 Conference Paper

Combining Knowledge with Deep Convolutional Neural Networks for Short Text Classification

  • Jin Wang
  • Zhongyuan Wang
  • Dawei Zhang
  • Jun Yan

Text classification is a fundamental task in NLP applications. Most existing work relied on either explicit or implicit text representation to address this problem. While these techniques work well for sentences, they can not easily be applied to short text because of its shortness and sparsity. In this paper, we propose a framework based on convolutional neural networks that combines explicit and implicit representations of short text for classification. We first conceptualize a short text as a set of relevant concepts using a large taxonomy knowledge base. We then obtain the embedding of short text by coalescing the words and relevant concepts on top of pre-trained word vectors. We further incorporate character level features into our model to capture fine-grained subword information. Experimental results on five commonly used datasets show that our proposed method significantly outperforms state-of-the-art methods.

AIIM Journal 2015 Journal Article

Adaptive dynamic programming algorithms for sequential appointment scheduling with patient preferences

  • Jin Wang
  • Richard Y.K. Fung

Objectives A well-developed appointment system can help increase the utilization of medical facilities in an outpatient department. This paper outlines the development of an appointment system that can make an outpatient department work more efficiently and improve patient satisfaction level. Methods A Markov decision process model is proposed to schedule sequential appointments with the consideration of patient preferences in order to maximize the patient satisfaction level. Adaptive dynamic programming algorithms are developed to avoid the curse of dimensionality. These algorithms can dynamically capture patient preferences, update the value of being a state, and thus improve the appointment decisions. Results Experiments were conducted to investigate the performance of the algorithms. The convergence behaviors under different settings, including the number of iterations needed for convergence and the accuracy of results, were examined. Bias-adjusted Kalman filter step-sizes were found to lead to the best convergence behavior, which stabilized within 5000 iterations. As for the effects of exploration and exploitation, it resulted in the best convergence behavior when the probability of taking a myopically optimal action equaled 0. 9. The performance of value function approximation algorithm was greatly affected by the combination of basis functions. Under different combinations, errors varied from 2. 7% to 8. 3%. More preferences resulted in faster convergence, but required longer computation time. Conclusions System parameters are adaptively updated as bookings are confirmed. The proposed appointment scheduling system could certainly contribute to better patient satisfaction level during the booking periods.

JBHI Journal 2013 Journal Article

Thermal Imaging as a Biometrics Approach to Facial Signature Authentication

  • A. M. Guzman
  • M. Goryawala
  • Jin Wang
  • A. Barreto
  • J. Andrian
  • N. Rishe
  • M. Adjouadi

A new thermal imaging framework with unique feature extraction and similarity measurements for face recognition is presented. The research premise is to design specialized algorithms that would extract vasculature information, create a thermal facial signature, and identify the individual. The proposed algorithm is fully integrated and consolidates the critical steps of feature extraction through the use of morphological operators, registration using the Linear Image Registration Tool, and matching through unique similarity measures designed for this task. The novel approach at developing a thermal signature template using four images taken at various instants of time ensured that unforeseen changes in the vasculature over time did not affect the biometric matching process as the authentication process relied only on consistent thermal features. Thirteen subjects were used for testing the developed technique on an in-house thermal imaging system. The matching using the similarity measures showed an average accuracy of 88. 46% for skeletonized signatures and 90. 39% for anisotropically diffused signatures. The highly accurate results obtained in the matching process clearly demonstrate the ability of the thermal infrared system to extend in application to other thermal-imaging-based systems. Empirical results applying this approach to an existing database of thermal images prove this assertion.