Arrow Research search

Author name cluster

Yan Xu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers
2 author rows

Possible papers

24

AAAI Conference 2026 Conference Paper

ECD: Evidence-guided Contrastive Decoding in Retrieval-Augmented Generation with Accurate Knowledge Reference Adjustment

  • Yize Sui
  • Yan Xu
  • Kun Hu
  • Jing Ren
  • Wenjing Yang

Retrieval-Augmented Generation (RAG) enhances the quality of question answering by integrating external knowledge with internal knowledge. A robust RAG system needs to precisely regulate the dependence of the response on the two types of knowledge. The recently proposed context-aware contrastive decoding (CCD) method attempts to achieve this goal by adjusting the knowledge reference weights by comparing the output distribution differences of LLMS when they rely on different knowledge sources. However, these methods are based on probabilistic knowledge reference adjustment strategies (such as the highest probability or entropy), only focus on the relative confidence of the output responses at each decoding step, without considering the absolute confidence of the responses, which may lead to misjudgment of the external knowledge and internal knowledge reference degree in the decoding process. To this end, we propose a novel decoding method, Evidence-guided Contrastive Decoding (ECD), which conducts evidence modeling by constructing the Dirichlet distribution and regards logits as evidence vectors, so as to regulate the reference degree of internal and external knowledge more accurately, and finally improve the quality of generated responses. Extensive evaluations across four public benchmark datasets on three mainstream LLMs have demonstrated the effectiveness and advantages of ECD.

JBHI Journal 2026 Journal Article

Multidomain Selective Feature Fusion and Stacking Based Ensemble Framework for EEG-Based Neonatal Sleep Stratification

  • Muhammad Irfan
  • Laishuan Wang
  • Husnain Shahid
  • Yan Xu
  • Abdulhamit Subasi
  • Adnan Munawar
  • Noman Mustafa
  • Chen Chen

Employing a minimal array of electroencephalography (EEG) channels for neonatal sleep stage classification is essential for data acquisition in the Internet of Medical Things (IoMT), as single-channel and edge-based features can reduce data transfer and processing requirements, enhancing cost-effectiveness and practicality. In this paper, we evaluate the efficacy of a single channel and the viability of a binary classification scheme for discerning awake and sleep states and transitions to quiet sleep. For this, two datasets of EEG signals for neonate sleep analysis were recorded from Children's Hospital of Fudan University, Shanghai, comprising recordings from 64 and 19 neonates, respectively. From each epoch, a diverse ensemble of 490 features was extracted through a blend of discrete and continuous wavelet transforms (DWT, CWT), spectral statistics, and temporal features. In addition, we introduced an innovative hybrid univariate and ensemble feature selection approach with multidomain feature fusion, a stacking-based ensemble classifier that outperforms existing work. We achieved 90. 37%, 91. 13%, and 94. 88% accuracy for sleep/awake, quiet sleep/non-quiet sleep, and quiet sleep/awake, respectively. This was corroborated by significant Kappa values of 77. 5%, 80. 29%, and 89. 76%. Using SelectPercentile, we devised three distinct feature selection mechanisms: one using DWT, one with CWT, and another incorporating both spectral and temporal features. Subsequently, SelectKBest was used to determine the most effective features. For our stacked model, we incorporated a trifecta of the ExtraTree model with variable estimators, a Random Forest, and an Artificial Neural Network (ANN) as base classifiers, and for the final prediction phase, ANN was implemented again. The model's performance was evaluated using K-fold and leave-one-subject cross-validation.

JBHI Journal 2026 Journal Article

Restore-RWKV: Efficient and Effective Medical Image Restoration With RWKV

  • Zhiwen Yang
  • Jiayin Li
  • Hui Zhang
  • Dan Zhao
  • Bingzheng Wei
  • Yan Xu

Transformers have revolutionized medical image restoration, but the quadratic complexity still poses limitations for their application to high-resolution medical images. The recent advent of the Receptance Weighted Key Value (RWKV) model in the natural language processing field has attracted much attention due to its ability to process long sequences efficiently. To leverage its advanced design, we propose Restore-RWKV, the first RWKV-based model for medical image restoration. Since the original RWKV model is designed for 1D sequences, we make two necessary modifications for modeling spatial relations in 2D medical images. First, we present a recurrent WKV (Re-WKV) attention mechanism that captures global dependencies with linear computational complexity. Re-WKV incorporates bidirectional attention as basic for a global receptive field and recurrent attention to effectively model 2D dependencies from various scan directions. Second, we develop an omnidirectional token shift (Omni-Shift) layer that enhances local dependencies by shifting tokens from all directions and across a wide context range. These adaptations make the proposed Restore-RWKV an efficient and effective model for medical image restoration. Even a lightweight variant of Restore-RWKV, with only 1. 16 million parameters, achieves comparable or even superior results compared to existing state-of-the-art (SOTA) methods. Extensive experiments demonstrate that the resulting Restore-RWKV achieves SOTA performance across a range of medical image restoration tasks, including PET image synthesis, CT image denoising, MRI image super-resolution, and all-in-one medical image restoration.

NeurIPS Conference 2025 Conference Paper

NopeRoomGS: Indoor 3D Gaussian Splatting Optimization without Camera Pose Input

  • Wenbo Li
  • Yan Xu
  • Mingde Yao
  • Fengjie Liang
  • Jiankai Sun
  • Menglu Wang
  • Guofeng Zhang
  • Linjiang Huang

Recent advances in 3D Gaussian Splatting (3DGS) have enabled real-time, high-fidelity view synthesis, but remain critically dependent on camera poses estimated by Structure-from-Motion (SfM), which is notoriously unreliable in textureless indoor environments. To eliminate this dependency, recent pose-free variants have been proposed, yet they often fail under abrupt camera motion due to unstable initialization and purely photometric objectives. In this work, we introduce Nope-RoomGS, an optimization framework with no need for camera pose inputs, which effectively addresses the textureless regions and abrupt camera motion in indoor room environments through a local-to-global optimization paradigm for 3DGS reconstruction. In the local stage, we propose a lightweight local neural geometric representation to bootstrap a set of reliable local 3D Gaussians for separated short video clips, regularized by multi-frame tracking constraints and foundation model depth priors. This enables reliable initialization even in textureless regions or under abrupt camera motions. In the global stage, we fuse local 3D Gaussians into a unified 3DGS representation through an alternating optimization strategy that jointly refines camera poses and Gaussian parameters, effectively mitigating gradient interference between them. Furthermore, we decompose camera pose optimization based on a piecewise planarity assumption, further enhancing robustness under abrupt camera motion. Extensive experiments on Replica, ScanNet and Tanks & Temples demonstrate the state-of-the-art performance of our method in both camera pose estimation and novel view synthesis.

NeurIPS Conference 2025 Conference Paper

Novel View Synthesis from A Few Glimpses via Test-Time Natural Video Completion

  • Yan Xu
  • Yixing Wang
  • Stella X. Yu

Given just a few glimpses of a scene, can you imagine the movie playing out as the camera glides through it? That’s the lens we take on sparse-input novel view synthesis, not only as filling spatial gaps between widely spaced views, but also as completing a natural video unfolding through space. We recast the task as test-time natural video completion, using powerful priors from pretrained video diffusion models to hallucinate plausible in-between views. Our zero-shot, generation-guided framework produces pseudo views at novel camera poses, modulated by an uncertainty-aware mechanism for spatial coherence. These synthesized frames densify supervision for 3D Gaussian Splatting (3D-GS) for scene reconstruction, especially in under-observed regions. An iterative feedback loop lets 3D geometry and 2D view synthesis inform each other, improving both the scene reconstruction and the generated views. The result is coherent, high-fidelity renderings from sparse inputs without any scene-specific training or fine-tuning. On LLFF, DTU, DL3DV, and MipNeRF-360, our method significantly outperforms strong 3D-GS baselines under extreme sparsity. Our project page is at https: //decayale. github. io/project/SV2CGS.

NeurIPS Conference 2025 Conference Paper

QFFT, Question-Free Fine-Tuning for Adaptive Reasoning

  • Wanlong Liu
  • Junxiao Xu
  • Fei Yu
  • Yukang Lin
  • Ke Ji
  • Wenyu Chen
  • Lifeng Shang
  • Yasheng Wang

Recent advancements in Long Chain-of-Thought (CoT) reasoning models have improved performance on complex tasks, but they suffer from overthinking, which generates redundant reasoning steps, especially for simple questions. This paper revisits the reasoning patterns of Long and Short CoT models, observing that the Short CoT patterns offer concise reasoning efficiently, while the Long CoT patterns excel in challenging scenarios where the Short CoT patterns struggle. To enable models to leverage both patterns, we propose Question-Free Fine-Tuning (QFFT), a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses. This approach enables the model to adaptively employ both reasoning patterns: it prioritizes the Short CoT patterns and activates the Long CoT patterns only when necessary. Experiments on various mathematical datasets demonstrate that QFFT reduces average response length by more than 50\%, while achieving performance comparable to Supervised Fine-Tuning (SFT). Additionally, QFFT exhibits superior performance compared to SFT in noisy, out-of-domain, and low-resource scenarios.

JBHI Journal 2025 Journal Article

TDFormer: Top-Down Token Generation for 3D Medical Image Segmentation

  • Hao Du
  • Qihua Dong
  • Yan Xu
  • Jing Liao

Accurate medical image segmentation is critical to effective treatment strategies. Existing transformer-based methods for image segmentation mostly split the input image into a fixed and regular grid and regard cells in the grid as the vision tokens. However, not all tokens are of equal importance in the medical segmentation tasks, e. g. , the tokens in tumor areas must be processed in a higher resolution than the background tokens which can be easily predicted with fewer transformer layers. In this paper, we propose a simple yet efficient segmentation framework called Top-Down Transformer (TDFormer), which incorporates a spatially adaptive token generation scheme into the transformer. The proposed top-down token generation comprises the following three components: attentiveness calculation, token splitting, and token fusion, where the collaboration of these components gradually fuses redundant background tokens and focuses only on the most critical areas. This allows for allocating more computation to process tokens containing delicate details in a finer resolution. Extensive experiments are conducted to demonstrate the robustness and effectiveness of the proposed TDFormer, that our method are superior to other state-of-the-art methods on the following publicly accessible datasets: BTCV Challenge, LiTS and BraTS 2020. We also dissect our method and evaluate the performance of each component.

NeurIPS Conference 2024 Conference Paper

Pedestrian-Centric 3D Pre-collision Pose and Shape Estimation from Dashcam Perspective

  • Meijun Wang
  • Yu Meng
  • Zhongwei Qiu
  • Chao Zheng
  • Yan Xu
  • Xiaorui Peng
  • Jian Gao

Pedestrian pre-collision pose is one of the key factors to determine the degree of pedestrian-vehicle injury in collision. Human pose estimation algorithm is an effective method to estimate pedestrian emergency pose from accident video. However, the pose estimation model trained by the existing daily human pose datasets has poor robustness under specific poses such as pedestrian pre-collision pose, and it is difficult to obtain human pose datasets in the wild scenes, especially lacking scarce data such as pedestrian pre-collision pose in traffic scenes. In this paper, we collect pedestrian-vehicle collision pose from the dashcam perspective of dashcam and construct the first Pedestrian-Vehicle Collision Pose dataset (PVCP) in a semi-automatic way, including 40k+ accident frames and 20K+ pedestrian pre-collision pose annotation (2D, 3D, Mesh). Further, we construct a Pedestrian Pre-collision Pose Estimation Network (PPSENet) to estimate the collision pose and shape sequence of pedestrians from pedestrian-vehicle accident videos. The PPSENet first estimates the 2D pose from the image (Image to Pose, ITP) and then lifts the 2D pose to 3D mesh (Pose to Mesh, PTM). Due to the small size of the dataset, we introduce a pre-training model that learns the human pose prior on a large number of pose datasets, and use iterative regression to estimate the pre-collision pose and shape of pedestrians. Further, we classify the pre-collision pose sequence and introduce pose class loss, which achieves the best accuracy compared with the existing relevant \textit{state-of-the-art} methods. Code and data are available for research at https: //github. com/wmj142326/PVCP.

NeurIPS Conference 2023 Conference Paper

Exploiting Contextual Objects and Relations for 3D Visual Grounding

  • Li Yang
  • Chunfeng Yuan
  • Ziqi Zhang
  • Zhongang Qi
  • Yan Xu
  • Wei Liu
  • Ying Shan
  • Bing Li

3D visual grounding, the task of identifying visual objects in 3D scenes based on natural language inputs, plays a critical role in enabling machines to understand and engage with the real-world environment. However, this task is challenging due to the necessity to capture 3D contextual information to distinguish target objects from complex 3D scenes. The absence of annotations for contextual objects and relations further exacerbates the difficulties. In this paper, we propose a novel model, CORE-3DVG, to address these challenges by explicitly learning about contextual objects and relations. Our method accomplishes 3D visual grounding via three sequential modular networks, including a text-guided object detection network, a relation matching network, and a target identification network. During training, we introduce a pseudo-label self-generation strategy and a weakly-supervised method to facilitate the learning of contextual objects and relations, respectively. The proposed techniques allow the networks to focus more effectively on referred objects within 3D scenes by understanding their context better. We validate our model on the challenging Nr3D, Sr3D, and ScanRefer datasets and demonstrate state-of-the-art performance. Our code will be public at https: //github. com/yangli18/CORE-3DVG.

JBHI Journal 2022 Journal Article

Vision-Based Finger Tapping Test in Patients With Parkinson’s Disease via Spatial-Temporal 3D Hand Pose Estimation

  • Zhilin Guo
  • Weiqi Zeng
  • Taidong Yu
  • Yan Xu
  • Yang Xiao
  • Xuebing Cao
  • Zhiguo Cao

Finger tapping test is crucial for diagnosing Parkinson’s Disease (PD), but manual visual evaluations can result in score discrepancy due to clinicians’ subjectivity. Moreover, applying wearable sensors requires making physical contact and may hinder PD patient’s raw movement patterns. Accordingly, a novel computer-vision approach is proposed using depth camera and spatial-temporal 3D hand pose estimation to capture and evaluate PD patients’ 3D hand movement. Within this approach, a temporal encoding module is leveraged to extend A2J’s deep learning framework to counter the pose jittering problem, and a pose refinement process is utilized to alleviate dependency on massive data. Additionally, the first vision-based 3D PD hand dataset of 112 hand samples from 48 PD patients and 11 control subjects is constructed, fully annotated by qualified physicians under clinical settings. Testing on this real-world data, this new model achieves 81. 2% classification accuracy, even surpassing that of individual clinicians in comparison, fully demonstrating this proposition’s effectiveness. The demo video can be accessed at https://github.com/ZhilinGuo/ST-A2J.

AAAI Conference 2021 Conference Paper

CrossNER: Evaluating Cross-Domain Named Entity Recognition

  • Zihan Liu
  • Yan Xu
  • Tiezheng Yu
  • Wenliang Dai
  • Ziwei Ji
  • Samuel Cahyawijaya
  • Andrea Madotto
  • Pascale Fung

Cross-domain named entity recognition (NER) models are able to cope with the scarcity issue of NER samples in target domains. However, most of the existing NER benchmarks lack domain-specialized entity types or do not focus on a certain domain, leading to a less effective cross-domain evaluation. To address these obstacles, we introduce a cross-domain NER dataset (CrossNER), a fully-labeled collection of NER data spanning over five diverse domains with specialized entity categories for different domains. Additionally, we also provide a domain-related corpus since using it to continue pre-training language models (domain-adaptive pre-training) is effective for the domain adaptation. We then conduct comprehensive experiments to explore the effectiveness of leveraging different levels of the domain corpus and pre-training strategies to do domain-adaptive pre-training for the crossdomain task. Results show that focusing on the fractional corpus containing domain-specialized entities and utilizing a more challenging pre-training strategy in domain-adaptive pre-training are beneficial for the NER domain adaptation, and our proposed method can consistently outperform existing cross-domain NER baselines. Nevertheless, experiments also illustrate the challenge of this cross-domain NER task. We hope that our dataset and baselines will catalyze research in the NER domain adaptation area. The code and data are available at https: //github. com/zliucr/CrossNER.

JBHI Journal 2020 Journal Article

Unsupervised 3D End-to-End Medical Image Registration With Volume Tweening Network

  • Shengyu Zhao
  • Tingfung Lau
  • Ji Luo
  • Eric I-Chao Chang
  • Yan Xu

3D medical image registration is of great clinical importance. However, supervised learning methods require a large amount of accurately annotated corresponding control points (or morphing), which are very difficult to obtain. Unsupervised learning methods ease the burden of manual annotation by exploiting unlabeled data without supervision. In this article, we propose a new unsupervised learning method using convolutional neural networks under an end-to-end framework, Volume Tweening Network (VTN), for 3D medical image registration. We propose three innovative technical components: (1) An end-to-end cascading scheme that resolves large displacement; (2) An efficient integration of affine registration network; and (3) An additional invertibility loss that encourages backward consistency. Experiments demonstrate that our algorithm is 880x faster (or 3. 3x faster without GPU acceleration) than traditional optimization-based methods and achieves state-of-the-art performance in medical image registration.

JBHI Journal 2019 Journal Article

Unsupervised Learning for Cell-Level Visual Representation in Histopathology Images With Generative Adversarial Networks

  • Bo Hu
  • Ye Tang
  • Eric I-Chao Chang
  • Yubo Fan
  • Maode Lai
  • Yan Xu

The visual attributes of cells, such as the nuclear morphology and chromatin openness, are critical for histopathology image analysis. By learning cell-level visual representation, we can obtain a rich mix of features that are highly reusable for various tasks, such as celllevel classification, nuclei segmentation, and cell counting. In this paper, we propose a unified generative adversarial networks architecture with a new formulation of loss to perform robust cell-level visual representation learning in an unsupervised setting. Our model is not only label-free and easily trained but also capable of cell-level unsupervised classification with interpretable visualization, which achieves promising results in the unsupervised classification of bone marrow cellular components. Based on the proposed cell-level visual representation learning, we further develop a pipeline that exploits the varieties of cellular elements to perform histopathology image classification, the advantages of which are demonstrated on bone marrow datasets.

IJCAI Conference 2018 Conference Paper

3D-Aided Deep Pose-Invariant Face Recognition

  • Jian Zhao
  • Lin Xiong
  • Yu Cheng
  • Yi Cheng
  • Jianshu Li
  • Li Zhou
  • Yan Xu
  • Jayashree Karlekar

Learning from synthetic faces, though perhaps appealing for high data efficiency, may not bring satisfactory performance due to the distribution discrepancy of the synthetic and real face images. To mitigate this gap, we propose a 3D-Aided Deep Pose-Invariant Face Recognition Model (3D-PIM), which automatically recovers realistic frontal faces from arbitrary poses through a 3D face model in a novel way. Specifically, 3D-PIM incorporates a simulator with the aid of a 3D Morphable Model (3D MM) to obtain shape and appearance prior for accelerating face normalization learning, requiring less training data. It further leverages a global-local Generative Adversarial Network (GAN) with multiple critical improvements as a refiner to enhance the realism of both global structures and local details of the face simulator’s output using unlabelled real data only, while preserving the identity information. Qualitative and quantitative experiments on both controlled and in-the-wild benchmarks clearly demonstrate superiority of the proposed model over state-of-the-arts.

AAAI Conference 2017 Conference Paper

Optimizing Quantiles in Preference-Based Markov Decision Processes

  • Hugo Gilbert
  • Paul Weng
  • Yan Xu

In the Markov decision process model, policies are usually evaluated by expected cumulative rewards. As this decision criterion is not always suitable, we propose in this paper an algorithm for computing a policy optimal for the quantile criterion. Both finite and infinite horizons are considered. Finally we experimentally evaluate our approach on random MDPs and on a data center control problem.

ICRA Conference 2016 Conference Paper

Control and experimental validation of robot-assisted automatic measurement system for Multi-Stud Tensioning Machine (MSTM)

  • Meng Li
  • Xingguang Duan
  • Haoyuan Li
  • Tengfei Cui
  • Liang Gao
  • Yue Zhan
  • Yan Xu

Multi-Stud Tensioning Machine (MSTM) is a specialized equipment used to open/seal the cover of the Reactor Pressure Vessel (RPV) during nuclear power plant maintenance. The tensioning residual values of the 58 studs are monitored for procedure evaluation. It is time-consuming for human operators to place the measurement meters into working positions. In order to reduce labor intensity and eliminate radiation exposure time, we develop a robot-assisted automatic measurement system to achieve meter placement and real-time data monitoring. The Field Programmable Gate Array (FPGA)-based distributed control scheme realizes high-speed data acquisition and coordinated control of the 58 node robots. The control software performs data analysis and sends emergency stop signals to the MSTM control PLC. The proposed system is validated in China Nuclear Power Technology Research Institute. Total operation time decreases from over 580 s to less than 120 s.

IROS Conference 2006 Conference Paper

Camera Calibration Based on the RBF Neural Network with Tunable Nodes forVisual Servoing in Robotics

  • Xiaoping Zong
  • Yan Xu
  • Lei Hao
  • Xiaoli Huai

In this paper, a new approach based on the radial basis function network for solving the camera calibration problem in visual servoing robot is proposed. In this approach, an extended multi-input and multi-output orthogonal forward selection algorithm based on the leave-one-out criterion is applied for the construction of radial basis function (RBF) networks with tunable nodes. This algorithm is computationally efficient and is capable of identifying parsimonious RBF networks that generalize well. Moreover, the proposed algorithm is fully automatic and the user does not need to specify a termination criterion for the construction process. The constructed parsimonious multi-input and multi-output RBF network can complete camera calibration automatically and rapidly, and the simulation has proved that the approach is feasible