Arrow Research search

Author name cluster

Shaoyi Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers
2 author rows

Possible papers

19

AAAI Conference 2026 Conference Paper

Cog-RAG: Cognitive-Inspired Dual-Hypergraph with Theme Alignment Retrieval-Augmented Generation

  • Hao Hu
  • Yifan Feng
  • Ruoxue Li
  • Rundong Xue
  • Xingliang Hou
  • Zhiqiang Tian
  • Yue Gao
  • Shaoyi Du

Retrieval-Augmented Generation (RAG) enhances the response quality and domain-specific performance of large language models (LLMs) by incorporating external knowledge to combat hallucinations. In recent research, graph structures have been integrated into RAG to enhance the capture of semantic relations between entities. However, it primarily focuses on low-order pairwise entity relations, limiting the high-order associations among multiple entities. Hypergraph-enhanced approaches address this limitation by modeling multi-entity interactions via hyperedges, but they are typically constrained to inter-chunk entity-level representations, overlooking the global thematic organization and alignment across chunks. Drawing inspiration from the top-down cognitive process of human reasoning, we propose a theme-aligned dual-hypergraph RAG framework (Cog-RAG) that uses a theme hypergraph to capture inter-chunk thematic structure and an entity hypergraph to model high-order semantic relations. Furthermore, we design a cognitive-inspired two-stage retrieval strategy that first activates query-relevant thematic content from the theme hypergraph, and then guides fine-grained recall and diffusion in the entity hypergraph, achieving semantic alignment and consistent generation from global themes to local details. Our extensive experiments demonstrate that Cog-RAG significantly outperforms existing state-of-the-art baseline approaches.

AAAI Conference 2026 Conference Paper

DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction

  • Xiuquan Hou
  • Meiqin Liu
  • Senlin Zhang
  • Shaoyi Du

Dense visual prediction tasks, including object detection and segmentation, inherently require precise and discriminative positional information to delineate object boundaries and pixel regions. Recent DETR-based frameworks advance dense prediction tasks through iterative attention applied to content queries, with sampled proposals as position references. However, this paradigm suffers from the misaligned sampling distribution and insufficient interaction between the content and position features, thereby limiting the encoding effectiveness. To overcome these limitations, we investigate the encoding paradigm for content-position harmonization and propose an effective predictor for dense visual tasks, termed DAPE (DETR with hArmonized content-Position Encoding). DAPE introduces explicit position encoding to facilitate content enhancement while maintaining low memory overhead. To achieves this process, DAPE comprises a Shifted Query Sampler (SQS) that enforces strict alignment between the distributions of content and position queries, and a 2D Low-Rank Position Encoder (LRPE) that progressively modulates attention maps based on the aligned representations. DAPE provides a unified solution for various dense prediction tasks. Extensive experiments on object detection, instance segmentation, and few-shot detection benchmarks demonstrate that DAPE achieves state-of-the-art performance while reducing memory consumption.

AAAI Conference 2026 Conference Paper

MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

  • Shengwei Zhao
  • Jingwen Yao
  • Sitong Wei
  • Linhai Xu
  • Yuying Liu
  • Dong Zhang
  • Zhiqiang Tian
  • Shaoyi Du

Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.

JBHI Journal 2026 Journal Article

Multi-Scale Temporal Analysis With a Dual-Branch Attention Network for Interpretable Gait-Based Classification of Neurodegenerative Diseases

  • Wei Zeng
  • Zhangbo Peng
  • Yang Chen
  • Shaoyi Du

The accurate diagnosis of neurodegenerative diseases (NDDs), such as Amyotrophic Lateral Sclerosis (ALS), Huntington’s Disease (HD), and Parkinson’s Disease (PD), remains a clinical challenge due to the complexity and subtlety of gait abnormalities. This paper proposes the Dual-Branch Attention-Enhanced Residual Network (DAERN), a novel deep learning architecture that integrates Dilated Causal Convolutions (DCCBlock) for local gait pattern extraction and Multi-Head Self-Attention (MHSA) for long-range dependency modeling. A Cross-Attention Fusion module enhances feature integration, while SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG) improve interpretability, providing clinically relevant insights into gait-based NDD classification. Uniform Manifold Approximation and Projection (UMAP) visualizations reveal well-separated clusters corresponding to distinct NDDs categories, demonstrating the model’s ability to capture discriminative features. Comprehensive ablation studies validate the contributions of model components and preprocessing strategies, highlighting the significance of each in achieving state-of-the-art classification performance. Experimental evaluations on the Gait in Neurodegenerative Disease (GaitNDD) dataset demonstrate that DAERN achieves an accuracy of 99. 64%, an F1-score of 99. 65%, and an AUC of 0. 9997, significantly outperforming conventional deep learning and machine learning baselines. These findings suggest that DAERN could be a valuable and interpretable tool for clinical gait assessment, aiding in early-stage monitoring and automated screening of NDDs, with potential applications in real-time wearable sensor-based gait analysis.

AAAI Conference 2026 Conference Paper

Role Hypergraph Contrastive Learning for Multivariate Time-Series Analysis

  • Rundong Xue
  • Hao Hu
  • Zhitao Zeng
  • Xiangmin Han
  • Zhiqiang Tian
  • Shaoyi Du
  • Yue Gao

Multivariate Time-Series (MTS) analysis is crucial across various domains. Considering the spatial and temporal consistency of MTS, existing methods leverage graph structures with temporal augmentation and contrastive learning to achieve robust learning of spatial dependencies and temporal patterns. Given the inherent high-order correlations in MTS, hypergraphs present a promising approach. However, two key challenges limit their further development: 1) Feature-based perspectives capture limited spatial information, while structural perspectives encode richer spatial consistency and evolution dependency; 2) Various semantic patterns (e.g., synergy, inhibition) entangle in sensor correlations, leading to semantic ambiguity. The underlying reason is that conventional hypergraph structures cannot distinguish specific semantic roles within or across hyperedges. Thus, we propose Role Hypergraph Contrastive Learning for MTS analysis. Specifically, we introduce the concept of role to generalize hypergraphs to Role Hypergraphs, enabling precise modeling of sensor correlations by assigning each vertex-hyperedge pair with a semantic role. Building on this structure, we design a role hypergraph contrastive learning paradigm to comprehensively capture the spatial and temporal dependencies: From a structural perspective, role hypergraph structural contrasting captures spatial short-term consistency and long-term evolution; from a feature perspective, alignment of complementary role information ensures sensor-level temporal consistency. Experiments on classification and forecasting tasks demonstrate the effectiveness and interpretability of our method.

ICLR Conference 2025 Conference Paper

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

  • Yifan Feng 0001
  • Chengwu Yang
  • Xingliang Hou
  • Shaoyi Du
  • Shihui Ying
  • Zongze Wu 0001
  • Yue Gao 0002

Existing benchmarks like NLGraph and GraphQA evaluate LLMs on graphs by focusing mainly on pairwise relationships, overlooking the high-order correlations found in real-world data. Hypergraphs, which can model complex beyond-pairwise relationships, offer a more robust framework but are still underexplored in the context of LLMs. To address this gap, we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low-order, five high-order, and two isomorphism tasks, utilizing both synthetic and real-world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT-4o, demonstrating our benchmark’s effectiveness in identifying model strengths and weaknesses. Our specialized prompt- ing framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper-BAG and Hyper-COT, which enhance high-order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks. This work establishes a foundational testbed for integrating hypergraph computational capabilities into LLMs, advancing their comprehension.

ICRA Conference 2025 Conference Paper

ERetinex: Event Camera Meets Retinex Theory for Low-Light Image Enhancement

  • Xuejian Guo
  • Zhiqiang Tian
  • Yuehang Wang
  • Siqi Li 0001
  • Yu Jiang 0006
  • Shaoyi Du
  • Yue Gao 0002

Low-light image enhancement aims to restore the under-exposure image captured in dark scenarios. Under such scenarios, traditional frame-based cameras may fail to capture the structure and color information due to the exposure time limitation. Event cameras are bio-inspired vision sensors that respond to pixel-wise brightness changes asynchronously. Event cameras' high dynamic range is pivotal for visual perception in extreme low-light scenarios, surpassing traditional cameras and enabling applications in challenging dark environments. In this paper, inspired by the success of the retinex theory for traditional frame-based low-light image restoration, we introduce the first methods that combine the retinex theory with event cameras and propose a novel retinex-based lowlight image restoration framework named ERetinex. Among our contributions, the first is developing a new approach that leverages the high temporal resolution data from event cameras with traditional image information to estimate scene illumination accurately. This method outperforms traditional image-only techniques, especially in low-light environments, by providing more precise lighting information. Additionally, we propose an effective fusion strategy that combines the high dynamic range data from event cameras with the color information of traditional images to enhance image quality. Through this fusion, we can generate clearer and more detailrich images, maintaining the integrity of visual information even under extreme lighting conditions. The experimental results indicate that our proposed method outperforms state-of-theart (SOTA) methods, achieving a gain of 1. 0613 dB in PSNR while reducing FLOPS by 84. 28 %. The code is available at https://github.com/lodew920/ERetinex.

JBHI Journal 2025 Journal Article

HSC-T: B-Ultrasound-to-Elastography Translation via Hierarchical Structural Consistency Learning for Thyroid Cancer Diagnosis

  • Hongcheng Han
  • Zhiqiang Tian
  • Qinbo Guo
  • Jue Jiang
  • Shaoyi Du
  • Juan Wang

Elastography ultrasound imaging is increasingly important in the diagnosis of thyroid cancer and other diseases, but its reliance on specialized equipment and techniques limits widespread adoption. This paper proposes a novel multimodal ultrasound diagnostic pipeline that expands the application of elastography ultrasound by translating B-ultrasound (BUS) images into elastography images (EUS). Additionally, to address the limitations of existing image-to-image translation methods, which struggle to effectively model inter-sample variations and accurately capture regional-scale structural consistency, we propose a BUS-to-EUS translation method based on hierarchical structural consistency. By incorporating domain-level, sample-level, patch-level, and pixel-level constraints, our approach guides the model in learning a more precise mapping from BUS to EUS, thereby enhancing diagnostic accuracy. Experimental results demonstrate that the proposed method significantly improves the accuracy of BUS-to-EUS translation on the MTUSI dataset and that the generated elastography images enhance nodule diagnostic accuracy compared to solely using BUS images on the STUSI and the BUSI datasets. This advancement highlights the potential for broader application of elastography in clinical practice.

NeurIPS Conference 2025 Conference Paper

Point-MaDi: Masked Autoencoding with Diffusion for Point Cloud Pre-training

  • Xiaoyang Xiao
  • Runzhao Yao
  • Zhiqiang Tian
  • Shaoyi Du

Self-supervised pre-training is essential for 3D point cloud representation learning, as annotating their irregular, topology-free structures is costly and labor-intensive. Masked autoencoders (MAEs) offer a promising framework but rely on explicit positional embeddings, such as patch center coordinates, which leak geometric information and limit data-driven structural learning. In this work, we propose Point-MaDi, a novel Point cloud Masked autoencoding Diffusion framework for pre-training that integrates a dual-diffusion pretext task into an MAE architecture to address this issue. Specifically, we introduce a center diffusion mechanism in the encoder, noising and predicting the coordinates of both visible and masked patch centers without ground-truth positional embeddings. These predicted centers are processed using a transformer with self-attention and cross-attention to capture intra- and inter-patch relationships. In the decoder, we design a conditional patch diffusion process, guided by the encoder's latent features and predicted centers to reconstruct masked patches directly from noise. This dual-diffusion design drives comprehensive global semantic and local geometric representations during pre-training, eliminating external geometric priors. Extensive experiments on ScanObjectNN, ModelNet40, ShapeNetPart, S3DIS, and ScanNet demonstrate that Point-MaDi achieves superior performance across downstream tasks, surpassing Point-MAE by 5. 51\% on OBJ-BG, 5. 17\% on OBJ-ONLY, and 4. 34\% on PB-T50-RS for 3D object classification on the ScanObjectNN dataset.

IROS Conference 2024 Conference Paper

ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition

  • Weidong Xie
  • Lun Luo
  • Nanfei Ye
  • Yi Ren
  • Shaoyi Du
  • Minhang Wang
  • Jintao Xu 0001
  • Rui Ai 0001

Place recognition is an important task for robots and autonomous cars to localize themselves and close loops in pre-built maps. While single-modal sensor-based methods have shown satisfactory performance, cross-modal place recognition that retrieving images from a point-cloud database remains a challenging problem. Current cross-modal methods transform images into 3D points using depth estimation for modality conversion, which are usually computationally intensive and need expensive labeled data for depth supervision. In this work, we introduce a fast and lightweight framework to encode images and point clouds into place-distinctive descriptors. We propose an effective Field of View (FoV) transformation module to convert point clouds into an analogous modality as images. This module eliminates the necessity for depth estimation and helps subsequent modules achieve real-time performance. We further design a non-negative factorization-based encoder to extract mutually consistent semantic features between point clouds and images. This encoder yields more distinctive global descriptors for retrieval. Experimental results on the KITTI dataset show that our proposed methods achieve state-of-the-art performance while running in real time. Additional evaluation on the HAOMO dataset covering a 17 km trajectory further shows the practical generalization capabilities. We have released the implementation of our methods as open source at: https://github.com/haomo-ai/ModaLink.git.

AAAI Conference 2024 Conference Paper

PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer

  • Wenting Cui
  • Runzhao Yao
  • Shaoyi Du

Fragment assembly involves restoring broken objects to their original geometries, and has many applications, such as archaeological restoration. Existing learning based frameworks have shown potential for solving part assembly problems with semantic decomposition, but cannot handle such geometrical decomposition problems. In this work, we propose a novel assembly framework, proxy level hybrid Transformer, with the core idea of using a hybrid graph to model and reason complex structural relationships between patches of fragments, dubbed as proxies. To this end, we propose a hybrid attention module, composed of intra and inter attention layers, enabling capturing of crucial contextual information within fragments and relative structural knowledge across fragments. Furthermore, we propose an adjacency aware hierarchical pose estimator, exploiting a decompose and integrate strategy. It progressively predicts adjacent probability and relative poses between fragments, and then implicitly infers their absolute poses by dynamic information integration. Extensive experimental results demonstrate that our method effectively reduces assembly errors while maintaining fast inference speed. The code is available at https://github.com/521piglet/PHFormer.

ICLR Conference 2024 Conference Paper

Semantic Flow: Learning Semantic Fields of Dynamic Scenes from Monocular Videos

  • Fengrui Tian
  • Yueqi Duan
  • Angtian Wang
  • Jianfei Guo
  • Shaoyi Du

In this work, we pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos. In contrast to previous NeRF methods that reconstruct dynamic scenes from the colors and volume densities of individual points, Semantic Flow learns semantics from continuous flows that contain rich 3D motion information. As there is 2D-to-3D ambiguity problem in the viewing direction when extracting 3D flow features from 2D video frames, we consider the volume densities as opacity priors that describe the contributions of flow features to the semantics on the frames. More specifically, we first learn a flow network to predict flows in the dynamic scene, and propose a flow feature aggregation module to extract flow features from video frames. Then, we propose a flow attention module to extract motion information from flow features, which is followed by a semantic network to output semantic logits of flows. We integrate the logits with volume densities in the viewing direction to supervise the flow features with semantic labels on video frames. Experimental results show that our model is able to learn from multiple dynamic scenes and supports a series of new tasks such as instance-level scene editing, semantic completions, dynamic scene tracking and semantic adaption on novel scenes.

JBHI Journal 2023 Journal Article

An Uncertainty-Aware and Sex-Prior Guided Biological Age Estimation From Orthopantomogram Images

  • Dong Zhang
  • Jing Yang
  • Shaoyi Du
  • Wenqing Bu
  • Yu-cheng Guo

Bone age, as a measure of biological age (BA), plays an important role in a variety of fields, including forensics, orthodontics, sports, and immigration. Despite its significance, accurate estimation of BA remains a challenge due to the uncertainty error between BA and chronological age (CA) caused by individual diversity and the difficult integration of multiple factors, such as sex, and identified or measured anatomical structures, into the estimation process. To address problems, we propose an uncertainty-aware and sex-prior guided biological age estimation from orthopantomogram images (OPGs), named UASP-BAE, which models uncertainty errors while setting sex dimorphism as tractive features to enhance age-related specific features, aiming to improve the accuracy of BA estimation. Furthermore, considering the global relevance of the anatomic structure, such as the mandible, teeth, maxillary sinus, etc. , a cross-attention module based on CNN and self-attention is proposed to mine the local texture and global semantic features of OPGs. Moreover, we design a novel age composition loss by cross-entropy, probability bias, and regression functions, aiming at evaluating BA's uncertainty errors and results to obtain an accurate and robust model. On 10703 OPGs from 5. 00 to 25. 00 years of age, our model had a best MAE value of 0. 8005 years and higher than the comparison popular algorithms, which also demonstrates the method's potential for improved accuracy in BA estimation.

AAAI Conference 2020 Conference Paper

CF-LSTM: Cascaded Feature-Based Long Short-Term Networks for Predicting Pedestrian Trajectory

  • Yi Xu
  • Jing Yang
  • Shaoyi Du

Pedestrian trajectory prediction is an important but difficult task in self-driving or autonomous mobile robot field because there are complex unpredictable human-human interactions in crowded scenarios. There have been a large number of studies that attempt to understand humans’ social behavior. However, most of these studies extract location features from previous one time step while neglecting the vital velocity features. In order to address this issue, we propose a novel feature-cascaded framework for long short-term network (CF-LSTM) without extra artificial settings or social rules. In this framework, feature information from previous two time steps are firstly extracted and then integrated as a cascaded feature to LSTM, which is able to capture the previous location information and dynamic velocity information, simultaneously. In addition, this scene-agnostic cascaded feature is the external manifestation of complex human-human interactions, which can also effectively capture dynamic interaction information in different scenes without any other pedestrians’ information. Experiments on public benchmark datasets indicate that our model achieves better performance than the state-of-the-art methods and this feature-cascaded framework has the ability to implicitly learn human-human interactions.

IROS Conference 2020 Conference Paper

CoBigICP: Robust and Precise Point Set Registration using Correntropy Metrics and Bidirectional Correspondence

  • Pengyu Yin
  • Di Wang 0028
  • Shaoyi Du
  • Shihui Ying
  • Yue Gao 0002
  • Nanning Zheng 0001

In this paper, we propose a novel probabilistic variant of iterative closest point (ICP) dubbed as CoBigICP. The method leverages both local geometrical information and global noise characteristics. Locally, the 3D structure of both target and source clouds are incorporated into the objective function through bidirectional correspondence. Globally, error metric of correntropy is introduced as noise model to resist outliers. Importantly, the close resemblance between normal-distributions transform (NDT) and correntropy is revealed. To ease the minimization step, an on-manifold parameterization of the special Euclidean group is proposed. Extensive experiments validate that CoBigICP outperforms several well-known and state-of-the-art methods.

IROS Conference 2018 Conference Paper

Accurate Mix-Norm-Based Scan Matching

  • Di Wang 0028
  • Jianru Xue
  • Zhongxing Tao
  • Yang Zhong
  • Dixiao Cui
  • Shaoyi Du
  • Nanning Zheng 0001

Highly accurate mapping and localization is of prime importance for mobile robotics, and its core lies in efficient scan matching. Previous research are focusing on designing a robust objective function and the residual error distribution is often ignored or simply assumed as unitary or mixture of simple distributions. In this paper, a mixture of exponential power (MoEP) distributions is proposed to approximate the residual error distribution. The objective function induced by MoEP-based residual error modelling ensembles a mix-norm-based scan matching (MiNoM), which enhances the matching accuracy and convergence characteristic. Both the parameters of transformation (rotation and translation) and residual error distribution are estimated efficiently via an EM-like algorithm. The optimization of MiNoM is iteratively achieved via two phases: An on-line parameter learning (OPL) phase to learn residual error distribution for better representation according to the likelihood field model (LFM), and an iteratively reweighted least squares (IRLS) phase to attain transformation for accuracy and efficiency. Extensive experimental results validate that the proposed MiNoM out-performs several state-of-the-art scan matching algorithms in both convergence characteristic and matching accuracy.

IROS Conference 2014 Conference Paper

Real-time global localization of intelligent road vehicles in lane-level via lane marking detection and shape registration

  • Dixiao Cui
  • Jianru Xue
  • Shaoyi Du
  • Nanning Zheng 0001

In this paper, we propose an accurate and real-time positioning method for intelligent road vehicles in urban environments. The proposed method uses a robust lane marking detection algorithm, as well as an efficient shape registration algorithm between the detected lane markings and a GPS based road shape prior, to improve the robustness and accuracy of global localization of a road vehicle. We exploit both the state-of-the-art technologies of visual localization based on lane marking detection and the wide availability of Global Positioning System (GPS) based localization. We show that by formulating the positioning problem in a relative sense, we can estimate the vehicle localization in real-time and bound its absolute error in centimeter-level by a cross validation scheme. The validation scheme integrates the vision based lane marking detection with the shape registration, and improves the performance of the overall localization system. The GPS localization can be refined by using lane marking detection when the GPS suffers from frequent satellite signal masking or blockage, while lane marking detection is validated and completing by the GPS based road shape prior when it does not work well in adverse weather conditions or with poor lane signature. We extensively evaluate the proposed method with a single forward-looking camera mounted on an autonomous vehicle which travels at 60km/h through several urban street scenes.

IS Journal 2008 Journal Article

50 Years of Image Processing and Pattern Recognition in China

  • Nanning Zheng
  • Qubo You
  • Gaofeng Meng
  • Jihua Zhu
  • Shaoyi Du
  • Jianyi Liu

This article briefly reviews the development of image recognition in and outside China. It presents theoretical research achievements and applied research as well as several typical applications of image recognition in China. Finally, it discusses future trends in image recognition integrated with cognitive science. This article is part of a special issue on AI in China.