Arrow Research search

Author name cluster

Xuan Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

31 papers
2 author rows

Possible papers

31

EAAI Journal 2026 Journal Article

An intelligent vision-based method for real-time pig disease identification through postural feature analysis

  • Zhe Yin
  • Yue Cao
  • Hong Feng
  • Qiqi Guo
  • Xuan Wang
  • Zhenyu Liu

The health of pig populations is critical to production efficiency and economic viability. By integrating postural characteristics indicative of disease, a model capable of accommodating complex postural variations can facilitate non-contact, low-cost, real-time disease monitoring in pigs. Building on the baseline You Only Look Once Version 10 (YOLOv10) deep learning model, this study proposes an improved model with three core innovations: a lightweight backbone network, spatial and channel reconstruction convolution, and large-scale keypoint attention. The lightweight backbone enhances feature extraction for subtle postural cues, the attention mechanism strengthens focus on key disease postures, and the optimised feature fusion structure improves feature representation and robustness to complex posture variations. Compared with the baseline model, the proposed method demonstrates significantly improved performance. It achieves a mean average precision of 97. 66 per cent, corresponding to a 3. 9 percentage point increase over the baseline. In particular, the detection precision for African swine fever improves from 94. 7 per cent to 98. 9 per cent, while the harmonic mean score reaches 91. 98 per cent, reflecting a 3. 72 percentage point improvement. Despite these notable gains in accuracy, the proposed method reduces the parameter count by 1. 80 million and the computational complexity by 1. 5 giga floating-point operations, maintaining high computational efficiency without sacrificing detection performance. The results demonstrate the potential practicality of the proposed method for real-time pig disease detection and provide a reliable technical basis for future deployment in intelligent livestock farming systems.

AAAI Conference 2026 Short Paper

BDI-based Opponent Modeling and Strategy Generation for Multi-Issue Negotiation (Student Abstract)

  • Tianzi Ma
  • Yulin Wu
  • Hang Ren
  • Xiaozhen Sun
  • Shuhan Qi
  • Xuan Wang

Accurately modeling opponent behaviors and integrating strategy are key challenges for multi-issue automated negotiation. Existing approaches often isolate preference learning or trend prediction and lack a unified cognitive structure with coordinated reasoning. This paper proposes a BDI (Belief-Desire-Intention)-based opponent modeling and strategy generation framework. The framework analyzes opponent responses (Belief), predicts preference weights and the utility function (Desire), and infers utilities of future offers (Intention). Building on these predictions, we design a responsive strategy, enabling gradual concessions and balanced outcomes. Our main contributions are: D-MBUE in the Desire module, I-DABI in the Intention module, and the BDI Negotiator on top of the modeling modules. Experiments on 45 standard negotiation domains and against 12 representative opponents demonstrate the effectiveness of our BDI framework.

AAAI Conference 2026 Conference Paper

BulletTime4D: Towards High Spatio-Temporal Resolution Dynamic Scene Rendering via Spike-Guided Stereo Vision

  • Yiqian Chang
  • Haoran Xu
  • Qinghong Ye
  • Jianing Li
  • Xuan Wang
  • Wei Zhang
  • Peixi Peng

High spatio‑temporal resolution novel‑view scene rendering is crucial for applications such as sports analysis and scientific experiments. However, existing Dynamic Scene Rendering (DSR) approaches typically rely on conventional RGB cameras with limited frame rates, making it difficult to achieve high spatio‑temporal resolution. In this paper, we present BulletTime4D, a high spatio‑temporal resolution DSR framework, which is the first trial to integrate a spike camera with binocular RGB cameras for dynamic scene reconstruction. Specifically, we first develop a hybrid camera prototype and build a real‑world dynamic scene reconstruction dataset. Then, BulletTime4D presents a multi‑timescale deformation representation by combining low‑frequency spatio‑temporal features with high‑frequency inter‑frame motion features. Finally, a rendering network is designed capable of projecting 4D Gaussians into the spike domain for spike rendering, and a cross‑domain supervision strategy is proposed to achieve high‑frame‑rate texture and color rendering. The results show that BulletTime4D outperforms state‑of‑the‑art methods on both simulated and real‑world datasets. In addition, BulletTime4D can synthesize 300 FPS novel‑view renderings using stereo RGB cameras at 30 FPS and a single spike camera.

EAAI Journal 2026 Journal Article

Data-driven robust topology optimization using surrogate modeling, model reduction, and machine learning

  • Xuan Wang
  • Weiqi Ji

This paper presents a novel data-driven framework for robust topology optimization (RTO) under load uncertainty. The proposed methodology synergistically integrates model reduction, surrogate modeling, and machine learning (ML) to efficiently solve the computationally demanding RTO problem. A Polynomial Chaos Expansion (PCE) surrogate model is employed to accurately compute the statistical moments of the stochastic compliance response required for robust optimization. To drastically reduce computational cost, a linearity-assumption-enhanced substructuring method serves as the model reduction technique. Crucially, a physics-enhanced Artificial Neural Network (ANN) is developed to predict substructure shape functions in real-time, enabling rapid online evaluations during the optimization loop. The effectiveness and superiority of the proposed data-driven approach are rigorously demonstrated through comprehensive comparisons against the full Finite Element Analysis (FEA) method, the linearity-assumption-enhanced substructuring method, and validated using Monte Carlo simulations. Results confirm that the framework achieves significant computational savings while maintaining high accuracy in large-scale robust topology design under uncertainty.

JBHI Journal 2026 Journal Article

Improving 3D Thin Vessel Segmentation in Brain TOF-MRA via a Dual-Space Context-Aware Network

  • Wenqi Shan
  • Xudong Li
  • Xuan Wang
  • Qiang Li
  • Zhiwei Wang

3D cerebrovascular segmentation poses a significant challenge, akin to locating a line within a vast 3D environment. This complexity can be substantially reduced by projecting the vessels onto a 2D plane, enabling easier segmentation. In this paper, we create a vessel-segmentation-friendly space using a clinical visualization technique called maximum intensity projection (MIP). Leveraging this, we propose a Dual-space Context-Aware Network (DCANet) for 3D vessel segmentation, designed to capture even the finest vessel structures accurately. DCANet begins by transforming a magnetic resonance angiography (MRA) volume into a 3D Regional-MIP volume, where each Regional-MIP slice is constructed by projecting adjacent MRA slices. This transformation highlights vessels as prominent continuous curves rather than the small circular or ellipsoidal cross-sections seen in MRA slices. DCANet encodes vessels separately in the MRA and the projected Regional-MIP spaces and introduces the Regional-MIP Image Fusion Block (MIFB) between these dual spaces to selectively integrate contextual features from Regional-MIP into MRA. Following dual-space encoding, DCANet employs a Dual-mask Spatial Guidance TransFormer (DSGFormer) decoder to focus on vessel regions while effectively excluding background areas, which reduces the learning burden and improves segmentation accuracy. We benchmark DCANet on four datasets: two public datasets, TubeTK and IXI-IOP, and two in-house datasets, Xiehe and IXI-HH. The results demonstrate that DCANet achieves superior performance, with improvements in average DSC values of at least 2. 26%, 2. 17%, 2. 62%, and 2. 58% for thin vessels, respectively.

AAAI Conference 2026 Conference Paper

JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation

  • Zhenyu Bi
  • Gaurav Srivastava
  • Yang Li
  • Swastik Roy
  • Meng Lu
  • Morteza Ziyadi
  • Xuan Wang

While small language models (SLMs) have shown promise on various reasoning tasks, their ability to judge the correctness of answers remains unclear compared to large language models (LLMs). Prior work on LLM-as-a-judge frameworks typically relies on comparing candidate answers against ground-truth labels or other candidate answers using predefined metrics like entailment. However, this approach is inherently indirect and difficult to fully automate, offering limited support for fine-grained and scalable evaluation of reasoning outputs. In this work, we propose JudgeBoard, a novel evaluation pipeline that directly queries models to assess the correctness of candidate answers without requiring extra answer comparisons. We focus on two core reasoning domains: mathematical reasoning and science/commonsense reasoning, and construct task-specific evaluation leaderboards using both accuracy-based ranking and an Elo-based rating system across five benchmark datasets, enabling consistent model comparison as judges rather than comparators. To improve judgment performance in lightweight models, we propose MAJ (Multi-Agent Judging), a novel multi-agent evaluation framework that leverages multiple interacting SLMs with distinct reasoning profiles to approximate LLM-level judgment accuracy through collaborative deliberation. Experimental results reveal a significant performance gap between SLMs and LLMs in isolated judging tasks. However, our MAJ framework substantially improves the reliability and consistency of SLMs. On the MATH dataset, MAJ using smaller-sized models as backbones performs comparatively well or even better than their larger-sized counterparts. Our findings highlight that multi-agent SLM systems can potentially match or exceed LLM performance in judgment tasks, with implications for scalable and efficient assessment.

YNIMG Journal 2026 Journal Article

The amplitude and latency of the earliest signal in V1 encode bottom-up saliency by feature conjunction

  • Chen Wu
  • Xiaoning Li
  • Huan Li
  • Xuan Wang
  • Ziang Yin
  • Zeyu Wang
  • Peng Zhang
  • Zhikuan Yang

The neural origin of bottom-up saliency for exogenous attention remains highly controversial. In this study, we investigated whether the earliest activity in the primary visual cortex (V1) encodes saliency signals defined by the eye-of-origin and feature-conjunction information. Electroencephalography (EEG) recordings from the human occipital cortex revealed early responses to eye-of-origin (E) and/or orientation (O) singletons, with larger response amplitudes to the double-feature (EO) singletons. The short onset latency (58-70 ms) and polarity reversal of the responses indicate an origin in the early visual areas, particularly V1. Importantly, the latency and amplitude of these responses predicted behavioral detection performance. Together, these findings suggest that the timing and amplitude of the earliest signals in V1 represent the saliency of combined feature contrasts for bottom-up attention. These signals unlikely originate from projections of other proposed source areas of saliency, due to the scarcity of necessary monocular neurons to process eye-of-origin information.

AAAI Conference 2026 Conference Paper

X-MoGen: Unified Motion Generation Across Humans and Animals

  • Xuan Wang
  • Kai Ruan
  • Liyang Qian
  • Guo Zhi Zhi
  • Chang Su
  • Gaoang Wang

Text-driven motion generation has attracted increasing attention due to its broad applications in virtual reality, animation, and robotics. While existing methods typically model human and animal motion separately, a joint cross-species approach offers key advantages, such as a unified representation and improved generalization. However, morphological differences across species remain a key challenge, often compromising motion plausibility. To address this, we propose X-MoGen, the first unified framework for cross-species text-driven motion generation covering both humans and animals. X-MoGen adopts a two-stage architecture. First, a conditional graph variational autoencoder learns canonical T-pose priors, while an autoencoder encodes motion into a shared latent space regularized by morphological loss. In the second stage, we perform masked motion modeling to generate motion embeddings conditioned on textual descriptions. During training, a morphological consistency module is employed to promote skeletal plausibility across species. To support unified modeling, we construct UniMo4D, a large-scale dataset of 115 species and 119k motion sequences, which integrates human and animal motions under a shared skeletal topology for joint training. Extensive experiments on UniMo4D demonstrate that X-MoGen outperforms state-of-the-art methods on both seen and unseen species.

JBHI Journal 2025 Journal Article

A Novel Dynamic Latent Variables-Based Framework for Enhancing Freezing of Gait Detection in Parkinson's Disease Patients

  • Xuan Wang
  • Lisha Yu
  • S. Joe Qin
  • Yang Zhao

Freezing of Gait (FOG) is one of the most severe symptoms of Parkinson's disease (PD), which often lead to life-threatening falls. Wearable sensor-based technologies coupled with data driven methods have advanced the detection of FOG in a timely fashion. However, most existing monitoring methods overlook the dynamics of processes when extracting effective information from high-dimensional sensor data. To tackle these problems, we develop a novel framework for FOG detection by integrating Dynamic Latent Variable (DLV)-based dimensionality reduction strategies and personalized monitoring. First, a multi-channel sliding window mechanism is adopted to extract the multiple potentially effective feature sequences. Second, an interpretable DLV-based method incorporating time-lagged terms is designed for the subspace representation of complex high-dimensional sequences. Third, the extracted DLVs are integrated with threshold-based methods or the Statistical Process Control (SPC) method for anomaly detection. We identified distinct variations in gait patterns among individuals, underscoring the importance of personalized approaches. The proposed framework demonstrates its effectiveness in FOG detection via validating on real world dataset, achieving a sensitivity of $\mathbf {0. 845} \pm \mathbf {0. 254}$ and a specificity of $\mathbf {0. 842} \pm \mathbf {0. 211}$.

ICML Conference 2025 Conference Paper

Better to Teach than to Give: Domain Generalized Semantic Segmentation via Agent Queries with Diffusion Model Guidance

  • Fan Li
  • Xuan Wang
  • Min Qi
  • Zhaoxiang Zhang 0002
  • Yuelei Xu

Domain Generalized Semantic Segmentation (DGSS) trains a model on a labeled source domain to generalize to unseen target domains with consistent contextual distribution and varying visual appearance. Most existing methods rely on domain randomization or data generation but struggle to capture the underlying scene distribution, resulting in the loss of useful semantic information. Inspired by the diffusion model’s capability to generate diverse variations within a given scene context, we consider harnessing its rich prior knowledge of scene distribution to tackle the challenging DGSS task. In this paper, we propose a novel agent Query -driven learning framework based on Diff usion model guidance for DGSS, named QueryDiff. Our recipe comprises three key ingredients: (1) generating agent queries from segmentation features to aggregate semantic information about instances within the scene; (2) learning the inherent semantic distribution of the scene through agent queries guided by diffusion features; (3) refining segmentation features using optimized agent queries for robust mask predictions. Extensive experiments across various settings demonstrate that our method significantly outperforms previous state-of-the-art methods. Notably, it enhances the model’s ability to generalize effectively to extreme domains, such as cubist art styles. Code is available at https: //github. com/FanLiHub/QueryDiff.

UAI Conference 2025 Conference Paper

Enhanced Equilibria-Solving via Private Information Pre-Branch Structure in Adversarial Team Games

  • Chen Qiu
  • Haobo Fu
  • Kai Li
  • Jiajia Zhang
  • Xuan Wang

In ex ante coordinated adversarial team games (ATGs), a team competes against an adversary, and team members can only coordinate their strategies before the game starts. The team-maxmin equilibrium with correlation (TMECor) is a suitable solution concept for extensive-form sequential ATGs. One class of TMECor-solving methods transforms the problem into solving NE in two-player zero-sum games, leveraging well-established tools for the latter. However, existing methods are fundamentally action-based, resulting in poor generalizability and low solving efficiency due to the exponential growth in the size of the transformed game. To address the above issues, we propose an efficient game transformation method based on private information, where all team members are represented by a single coordinator. We designed a structure called private information pre-branch, which makes decisions considering all possible private information from teammates. We prove that the size of the game transformed by our method is exponentially reduced compared to the current state-of-the-art. Moreover, we demonstrate equilibria equivalence. Experimentally, our method achieves a significant speedup of 182. 89$\times$ to 694. 44$\times$ in scenarios where the current state-of-the-art method can work, such as small-scale Kuhn poker and Leduc poker. Furthermore, our method is applicable to larger games and those with dynamically changing private information, such as Goofspiel.

ICML Conference 2025 Conference Paper

Enhancing Target-unspecific Tasks through a Features Matrix

  • Fangming Cui
  • Yonggang Zhang 0003
  • Xuan Wang
  • Xinmei Tian 0001
  • Jun Yu 0002

Recent developments in prompt learning of large Vision-Language Models (VLMs) have significantly improved performance in target-specific tasks. However, these prompting methods often struggle to tackle the target-unspecific or generalizable tasks effectively. It may be attributed to the fact that overfitting training causes the model to forget its general knowledge. The general knowledge has a strong promotion on target-unspecific tasks. To alleviate this issue, we propose a novel Features Matrix (FM) approach designed to enhance these models on target-unspecific tasks. Our method extracts and leverages general knowledge, shaping a Features Matrix (FM). Specifically, the FM captures the semantics of diverse inputs from a deep and fine perspective, preserving essential general knowledge, which mitigates the risk of overfitting. Representative evaluations demonstrate that: 1) the FM is compatible with existing frameworks as a generic and flexible module, and 2) the FM significantly showcases its effectiveness in enhancing target-unspecific tasks (base-to-novel generalization, domain generalization, and cross-dataset generalization), achieving state-of-the-art performance.

IROS Conference 2025 Conference Paper

Heterogeneous Mixed Traffic Control and Coordination

  • Iftekharul Islam
  • Weizi Li
  • Xuan Wang
  • Shuai Li
  • Kevin Heaslip

Urban intersections with diverse vehicle types, from small cars to large semi-trailers, pose significant challenges for traffic control. This study explores how robot vehicles (RVs) can enhance heterogeneous traffic flow, particularly at unsignalized intersections where traditional methods fail during power outages. Using reinforcement learning (RL) and real-world data, we simulate mixed traffic at complex intersections with RV penetration rates ranging from 10% to 90%. Results show that average waiting times drop by up to 86% and 91% compared to signalized and unsignalized intersections, respectively. We observe a "rarity advantage, " where less frequent vehicles benefit the most (up to 87%). Although CO 2 emissions and fuel consumption increase with RV penetration, they remain well below those of traditional signalized traffic. Decreased space headways also indicate more efficient road usage. These findings highlight RVs’ potential to improve traffic efficiency and reduce environmental impact in complex, heterogeneous settings.

IROS Conference 2025 Conference Paper

Joint Pedestrian and Vehicle Traffic Optimization in Urban Environments using Reinforcement Learning

  • Bibek Poudel
  • Xuan Wang
  • Weizi Li
  • Lei Zhu
  • Kevin Heaslip

Reinforcement learning (RL) holds significant promise for adaptive traffic signal control. While existing RL-based methods demonstrate effectiveness in reducing vehicular congestion, their predominant focus on vehicle-centric optimization leaves pedestrian mobility needs and safety challenges unaddressed. In this paper, we present a deep RL framework for adaptive control of eight traffic signals along a real-world urban corridor, jointly optimizing both pedestrian and vehicular efficiency. Our single-agent policy is trained using real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis. The results demonstrate significant performance improvements over traditional fixed-time signals, reducing average wait times per pedestrian and per vehicle by up to 67% and 52% respectively, while simultaneously decreasing total wait times for both groups by up to 67% and 53%. Additionally, our results demonstrate generalization capabilities across varying traffic demands, including conditions entirely unseen during training, validating RL’s potential for developing transportation systems that serve all road users.

IJCAI Conference 2025 Conference Paper

Learning Dynamical Coupled Operator For High-dimensional Black-box Partial Differential Equations

  • Yichi Wang
  • Tian Huang
  • Dandan Huang
  • Zhaohai Bai
  • Xuan Wang
  • Lin Ma
  • Haodi Zhang

The deep operator networks (DON), a class of neural operators that learn mappings between function spaces, have recently emerged as surrogate models for parametric partial differential equations (PDEs). However, their full potential for accurately approximating general black-box PDEs remains underexplored due to challenges in training stability and performance, primarily arising from difficulties in learning mappings between low-dimensional inputs and high-dimensional outputs. Furthermore, inadequate encoding of input functions and query positions limits the generalization ability of DONs. To address these challenges, we propose the Dynamical Coupled Operator (DCO), which incorporates temporal dynamics to learn coupled functions, reducing information loss and improving training robustness. Additionally, we introduce an adaptive spectral input function encoder based on empirical mode decomposition to enhance input function representation, as well as a hybrid location encoder to improve query location encoding. We provide theoretical guarantees on the universal expressiveness of DCO, ensuring its applicability to a wide range of PDE problems. Extensive experiments on real-world, high-dimensional PDE datasets demonstrate that DCO significantly outperforms DONs.

NeurIPS Conference 2025 Conference Paper

Lie Detector: Unified Backdoor Detection via Cross-Examination Framework

  • Xuan Wang
  • Siyuan Liang
  • Dongping Liao
  • Han Fang
  • Aishan Liu
  • Xiaochun Cao
  • Yu-liang Lu
  • Ee-Chien Chang

Institutions with limited data and computing resources often outsource model training to third-party providers in a semi-honest setting, assuming adherence to prescribed training protocols with pre-defined learning paradigm (e. g. , supervised or semi-supervised learning). However, this practice can introduce severe security risks, as adversaries may poison the training data to embed backdoors into the resulting model. Existing detection approaches predominantly rely on statistical analyses, which often fail to maintain universally accurate detection accuracy across different learning paradigms. To address this challenge, we propose a unified backdoor detection framework in the semi-honest setting that exploits cross-examination of model inconsistencies between two independent service providers. Specifically, we integrate central kernel alignment to enable robust feature similarity measurements across different model architectures and learning paradigms, thereby facilitating precise recovery and identification of backdoor triggers. We further introduce backdoor fine-tuning sensitivity analysis to distinguish backdoor triggers from adversarial perturbations, substantially reducing false positives. Extensive experiments demonstrate that our method achieves superior detection performance, improving accuracy by 4. 4%, 1. 7%, and 10. 6% over SoTA baselines across supervised, self-supervised, and autoregressive learning tasks, respectively. Notably, it is the first to effectively detect backdoors in multimodal large language models, further highlighting its broad applicability and advancing secure deep learning.

IROS Conference 2025 Conference Paper

MIAT: Maneuver-Intention-Aware Transformer for Spatio-Temporal Trajectory Prediction

  • Chandra Raskoti
  • Iftekharul Islam
  • Xuan Wang
  • Weizi Li

Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors—such as acceleration, deceleration, and left and right maneuvers—pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short-and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4. 7% in short-horizon predictions and a 1. 6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11. 1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.

JBHI Journal 2025 Journal Article

MSTG-Transformer: Multivariate Spatial-Temporal Gated Transformer Model for 3D Skeleton Data-based Fall Risk Prediction

  • Junjie Cao
  • Xuan Wang
  • Keyi Huang
  • Lisha Yu
  • Xiaomao Fan
  • Yang Zhao

As the aging population continues to grow, falls among older adults have become a significant public health concern worldwide. Data-driven approaches for effective fall risk prediction, which integrate standard functional tests with 3D skeleton data from depth sensors, are gaining increasing attention. However, the complex physiological and functional interactions among skeletal keypoints during ambulation pose challenges for multidimensional feature extraction in most predictive models. In this study, we developed a novel approach based on preprocessed 3D skeleton data, named Multivariate SpatialTemporal Gated Transformer (MSTG-Transformer). This approach consists of three main stages. First, gait cycle sequences are constructed to sophisticatedly depict the movement patterns of subjects, amplifying the distinctions between groups. Then, spatial and topological features are extracted via convolutional modules, and a dual-stream encoder block is employed to encode the features of 3D skeleton data across both time steps and time channels. Finally, a voting scheme is used to determine fall risk by integrating the classification results of individual gait cycle segments. Validation experiments on a real-world dataset demonstrate that our proposed approach outperforms classical methods, achieving a superior prediction accuracy of 0. 9510 ± 0. 0240. Additionally, our study highlights the crucial role of potential interactions between skeletal keypoints in accurately predicting fall risk

NeurIPS Conference 2025 Conference Paper

No Object Is an Island: Enhancing 3D Semantic Segmentation Generalization with Diffusion Models

  • Fan Li
  • Xuan Wang
  • Xuanbin Wang
  • Zhaoxiang Zhang
  • Yuelei Xu

Enhancing the cross-domain generalization of 3D semantic segmentation is a pivotal task in computer vision that has recently gained increasing attention. Most existing methods, whether using consistency regularization or cross-modal feature fusion, focus solely on individual objects while overlooking implicit semantic dependencies among them, resulting in the loss of useful semantic information. Inspired by the diffusion model's ability to flexibly compose diverse objects into high-quality images across varying domains, we seek to harness its capacity for capturing underlying contextual distributions and spatial arrangements among objects to address the challenging task of cross-domain 3D semantic segmentation. In this paper, we propose a novel cross-modal learning framework based on diffusion models to enhance the generalization of 3D semantic segmentation, named XDiff3D. XDiff3D comprises three key ingredients: (1) constructing object agent queries from diffusion features to aggregate instance semantic information; (2) decoupling fine-grained local details from object agent queries to prevent interference with 3D semantic representation; (3) leveraging object agent queries as an interface to enhance the modeling of object semantic dependencies in 3D representations. Extensive experiments validate the effectiveness of our method, achieving state-of-the-art performance across multiple benchmarks in different task settings. Code is available at \url{https: //github. com/FanLiHub/XDiff3D}.

EAAI Journal 2025 Journal Article

Optimizing strategy selection in hidden role games

  • Yingying Xu
  • Chen Qiu
  • Jinheng Xiao
  • Jiajia Zhang
  • Shuhan Qi
  • Xuan Wang

We address hidden-role decision making under uncertainty in The Resistance: Avalon. We present DeepBayes, which augments a standard Counterfactual Regret Minimization Plus (CFR+) decision procedure with two complementary inference components. First, a history-driven role assignment prediction network generates role-assignment hypotheses from past gameplay, which are used to improve the estimation of Counterfactual Values (CFVs). Second, a Bayesian Identity Recognition (BIR) method produces explicit posterior beliefs about opposing identities online as play unfolds. During CFR+ iterations, the algorithm selects actions by jointly considering the CFVs estimated under the generated role assignments and the posterior beliefs from BIR. In five-player Avalon experiments, DeepBayes achieves consistent gains in win rate over strong baselines.

NeurIPS Conference 2025 Conference Paper

Spike4DGS: Towards High-Speed Dynamic Scene Rendering with 4D Gaussian Splatting via a Spike Camera Array

  • Qinghong Ye
  • Yiqian Chang
  • Jianing Li
  • Haoran Xu
  • Xuan Wang
  • Wei Zhang
  • Yonghong Tian
  • Peixi Peng

Spike camera with high temporal resolution offers a new perspective on high-speed dynamic scene rendering. Most existing rendering methods rely on Neural Radiance Fields (NeRF) or 3D Gaussian Splatting (3DGS) for static scenes using a monocular spike camera. However, these methods struggle with dynamic motion, while a single camera suffers from limited spatial coverage, making it challenging to reconstruct fine details in high-speed scenes. To address these problems, we propose Spike4DGS, the first high-speed dynamic scene rendering framework with 4D Gaussian Splatting using spike camera arrays. Technically, we first build a multi-view spike camera array to validate our solution, then establish both synthetic and real-world multi-view spike-based reconstruction datasets. Then, we design a multi-view spike-based dense initialization module that obtains dense point clouds and camera poses from continuous spike streams. Finally, we propose a spike-pixel synergy constraint supervision to optimize Spike4DGS, incorporating both rendered image quality loss and dynamic spatiotemporal spike loss. The results show that our Spike4DGS outperforms state-of-the-art methods in terms of novel view rendering quality on both synthetic and real-world datasets. More details are available at https: //github. com/Qinghongye/Spike4DGS.

AAAI Conference 2025 Short Paper

Towards Building Human-like Smart Agents in Modern 3D Video Games (Student Abstract)

  • Zhihang Sun
  • Shuhan Qi
  • Xinhao Huang
  • Xinyu Xiao
  • Jiajia Zhang
  • Xuan Wang
  • Peixi Peng

In recent years, reinforcement learning has been widely applied in the field of games. However, most studies focus on assisting agents to achieve victory, with less attention paid to whether the agents exhibit human-like characteristics. In order to build human-like agents with high performance, we propose a method for learning the strategies of human players in modern three-dimensional video games. Our method utilizes a hierarchical framework, learning basic behaviors and intentions of human players at the lower level through imitation learning, and generalized policies at the high level through reinforcement learning. Compared with other existing methods, our method demonstrates significant advantages in learning human-like strategies in complex environments.

AAAI Conference 2024 Conference Paper

A Pre-convolved Representation for Plug-and-Play Neural Illumination Fields

  • Yiyu Zhuang
  • Qi Zhang
  • Xuan Wang
  • Hao Zhu
  • Ying Feng
  • Xiaoyu Li
  • Ying Shan
  • Xun Cao

Recent advances in implicit neural representation have demonstrated the ability to recover detailed geometry and material from multi-view images. However, the use of simplified lighting models such as environment maps to represent non-distant illumination, or using a network to fit indirect light modeling without a solid basis, can lead to an undesirable decomposition between lighting and material. To address this, we propose a fully differentiable framework named Neural Illumination Fields (NeIF) that uses radiance fields as a lighting model to handle complex lighting in a physically based way. Together with integral lobe encoding for roughness-adaptive specular lobe and leveraging the pre-convolved background for accurate decomposition, the proposed method represents a significant step towards integrating physically based rendering into the NeRF representation. The experiments demonstrate the superior performance of novel-view rendering compared to previous works, and the capability to re-render objects under arbitrary NeRF-style environments opens up exciting possibilities for bridging the gap between virtual and real-world scenes.

AAAI Conference 2024 Conference Paper

Decoupling Degradations with Recurrent Network for Video Restoration in Under-Display Camera

  • Chengxu Liu
  • Xuan Wang
  • Yuanting Fan
  • Shuai Li
  • Xueming Qian

Under-display camera (UDC) systems are the foundation of full-screen display devices in which the lens mounts under the display. The pixel array of light-emitting diodes used for display diffracts and attenuates incident light, causing various degradations as the light intensity changes. Unlike general video restoration which recovers video by treating different degradation factors equally, video restoration for UDC systems is more challenging that concerns removing diverse degradation over time while preserving temporal consistency. In this paper, we introduce a novel video restoration network, called D2RNet, specifically designed for UDC systems. It employs a set of Decoupling Attention Modules (DAM) that effectively separate the various video degradation factors. More specifically, a soft mask generation function is proposed to formulate each frame into flare and haze based on the diffraction arising from incident light of different intensities, followed by the proposed flare and haze removal components that leverage long- and short-term feature learning to handle the respective degradations. Such a design offers an targeted and effective solution to eliminating various types of degradation in UDC systems. We further extend our design into multi-scale to overcome the scale-changing of degradation that often occur in long-range videos. To demonstrate the superiority of D2RNet, we propose a large-scale UDC video benchmark by gathering HDR videos and generating realistically degraded videos using the point spread function measured by a commercial UDC system. Extensive quantitative and qualitative evaluations demonstrate the superiority of D2RNet compared to other state-of-the-art video restoration and UDC image restoration methods.

IJCAI Conference 2024 Conference Paper

P2P: Transforming from Point Supervision to Explicit Visual Prompt for Object Detection and Segmentation

  • Guangqian Guo
  • Dian Shao
  • Chenguang Zhu
  • Sha Meng
  • Xuan Wang
  • Shan Gao

Point-supervised vision tasks, including detection and segmentation, aiming to learn a network that transforms from points to pseudo labels, have attracted much attention in recent years. However, the lack of precise object size and boundary annotations in the point-supervised condition results in a large performance gap between point- and fully-supervised methods. In this paper, we propose a novel iterative learning framework, Point to Prompt (P2P), for point-supervised object detection and segmentation, with the key insight of transforming from point supervision to explicit visual prompt of the foundation model. The P2P is formulated as an iterative refinement process of two stages: Semantic Explicit Prompt Generation (SEPG) and Prompt Guided Spatial Refinement (PGSR). Specifically, SEPG serves as a prompt generator for generating semantic-explicit prompts from point input via a group-based learning strategy. In the PGSR stage, prompts guide the visual foundation model to further refine the object regions, by leveraging the outstanding generalization ability of the foundation model. The two stages are iterated multiple times to improve the quality of predictions progressively. Experimental results on multiple datasets demonstrate that P2P achieves SOTA performance in both detection and segmentation tasks, further narrowing the performance gap with fully-supervised methods. The source code and supplementary material can be found at https: //github. com/guangqian-guo/P2P.

JBHI Journal 2024 Journal Article

Sensor-Based Multifaceted Feature Extraction and Ensemble Elastic Net Approach for Assessing Fall Risk in Community-Dwelling Older Adults

  • Xuan Wang
  • Lisha Yu
  • Hailiang Wang
  • Kwok Leung Tsui
  • Yang Zhao

Accurate identification of community-dwelling older adults at high fall risk can facilitate timely intervention and significantly reduce fall incidents. Analyzing gait and balance capabilities via feature extraction and modeling through sensor-based motion data has emerged as a viable approach for fall risk assessment. However, the existing approaches for extracting key features related to fall risk lack inclusiveness, with limited consideration of the non-linear characteristics of sensor signals, such as signal complexity, self-similarity, and local stability. In this study, we developed a multifaceted feature extraction scheme employing diverse feature types, including demographic, descriptive statistical, non-linear, spatiotemporal and spectral features, derived from three-axis accelerometers and gyroscope data. This study is the first attempt to investigate non-linear features related to fall risk in multi-task scenarios from a dynamic system perspective. Based on the extracted multifaceted features, we propose an ensemble elastic net (E-E-N) approach for handling imbalanced data and offering high model interpretability. The E-E-N utilizes bootstrap sampling to construct base classifiers and employs a weighting mechanism to aggregate the base classifiers. We conducted a set of validation experiments using real-world data for comprehensive comparative analysis. The results demonstrate that the E-E-N approach exhibits superior predictive performance on fall risk classification. Our proposed approach offers a cost-effective tool for accurately assessing fall risk and alleviating the burden of continuous health monitoring in the long term.

IJCAI Conference 2023 Conference Paper

CiT-Net: Convolutional Neural Networks Hand in Hand with Vision Transformers for Medical Image Segmentation

  • Tao Lei
  • Rui Sun
  • Xuan Wang
  • Yingbo Wang
  • Xi He
  • Asoke Nandi

The hybrid architecture of convolutional neural networks (CNNs) and Transformer are very popular for medical image segmentation. However, it suffers from two challenges. First, although a CNNs branch can capture the local image features using vanilla convolution, it cannot achieve adaptive feature learning. Second, although a Transformer branch can capture the global features, it ignores the channel and cross-dimensional self-attention, resulting in a low segmentation accuracy on complex-content images. To address these challenges, we propose a novel hybrid architecture of convolutional neural networks hand in hand with vision Transformers (CiT-Net) for medical image segmentation. Our network has two advantages. First, we design a dynamic deformable convolution and apply it to the CNNs branch, which overcomes the weak feature extraction ability due to fixed-size convolution kernels and the stiff design of sharing kernel parameters among different inputs. Second, we design a shifted-window adaptive complementary attention module and a compact convolutional projection. We apply them to the Transformer branch to learn the cross-dimensional long-term dependency for medical images. Experimental results show that our CiT-Net provides better medical image segmentation results than popular SOTA methods. Besides, our CiT-Net requires lower parameters and less computational costs and does not rely on pre-training. The code is publicly available at https: //github. com/SR0920/CiT-Net.

EAAI Journal 2023 Journal Article

Deep reinforcement learning-PID based supervisor control method for indirect-contact heat transfer processes in energy systems

  • Xuan Wang
  • Jinwen Cai
  • Rui Wang
  • Gequn Shu
  • Hua Tian
  • Mingtao Wang
  • Bowen Yan

Indirect-contact heat exchangers have been widely used in various energy systems, and the precise tracking control of important heat transfer parameters, such as temperature, is vital for safe and efficient operation. However, the high nonlinearity of heat transfer and large disturbance brings difficulty to optimal control. Considering the strong perception and decision-making capabilities of deep reinforcement learning (DRL), this study proposed a supervisor control method combined DRL and proportional–integral–derivative (PID). A set of the fewest conveniently measurable variables was derived as agent observations to describe the heat transfer process effectively and thereby improve the control efficiency under large disturbances. In addition, the local heat transfer process was used as a training environment to reduce training costs significantly. Finally, superheat temperature control in a complex organic Rankine cycle was simulated with SIMULINK to evaluate the effectiveness of the proposed observation variables and the training and control methods. The results showed that the proposed control method achieved satisfactory performance. The average absolute tracking error was only 0. 246 K under trained and untrained disturbances, whereas that of the PID control was 4. 645 K. Compared with the model predictive control, the DRL-PID-based supervisory control evidently performed better under a large disturbance; the average absolute tracking errors under DRL-PID control and MPC were 0. 288 K and 0. 509 K, respectively.

IS Journal 2020 Journal Article

Commonsense Knowledge Enhanced Memory Network for Stance Classification

  • Jiachen Du
  • Lin Gui
  • Ruifeng Xu
  • Yunqing Xia
  • Xuan Wang

Stance classification aims at identifying, in the text, the attitude toward the given targets as favorable, negative, or unrelated. In existing models for stance classification, only textual representation is leveraged, while commonsense knowledge is ignored. In order to better incorporate commonsense knowledge into stance classification, we propose a novel model named commonsense knowledge enhanced memory network, which jointly represents textual and commonsense knowledge representation of given target and text. The textual memory module in our model treats the textual representation as memory vectors, and uses attention mechanism to embody the important parts. For commonsense knowledge memory module, we jointly leverage the entity and relation embeddings learned by TransE model to take full advantage of constraints of the knowledge graph. Experimental results on the SemEval dataset show that the combination of the commonsense knowledge memory and textual memory can improve stance classification.

IS Journal 2015 Journal Article

Footstep-Identification System Based on Walking Interval

  • Xuan Wang
  • Tengfei Yang
  • Yao Yu
  • Ruixin Zhang
  • Fangxia Guo

Footsteps, as a main kind of behavioral trait, are a universally available signal, but constructing an identity verification system based on them remains a challenging problem: footsteps not only reflect a person's physiological basis but also depend on the person's psychological makeup, footwear, and floor. This article describes a novel footstep-identification system. To eliminate footwear and floor variations as limiting factors, the footstep duration and interval times are extracted from footsteps, and a timing vector is obtained as a feature. To smooth instability in footsteps, the authors developed a novel pattern-recognition method, in which the training procedure can be split into several parallel subprocedures, with each subprocedure only considering one class sample. It can be periodically retrained using several of the user's most recent successful identification footsteps. Theoretical and experimental results show this system is relatively robust to the variations of footwear, floor, and the examinee's psychological makeup, and yields a better classification performance compared with the existing methods.