Author name cluster

Shaoyi Du

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

19 papers

2 author rows

AAAI Conference 2026 Conference Paper

Cog-RAG: Cognitive-Inspired Dual-Hypergraph with Theme Alignment Retrieval-Augmented Generation

Hao Hu
Yifan Feng
Ruoxue Li
Rundong Xue
Xingliang Hou
Zhiqiang Tian
Yue Gao
Shaoyi Du

Retrieval-Augmented Generation (RAG) enhances the response quality and domain-specific performance of large language models (LLMs) by incorporating external knowledge to combat hallucinations. In recent research, graph structures have been integrated into RAG to enhance the capture of semantic relations between entities. However, it primarily focuses on low-order pairwise entity relations, limiting the high-order associations among multiple entities. Hypergraph-enhanced approaches address this limitation by modeling multi-entity interactions via hyperedges, but they are typically constrained to inter-chunk entity-level representations, overlooking the global thematic organization and alignment across chunks. Drawing inspiration from the top-down cognitive process of human reasoning, we propose a theme-aligned dual-hypergraph RAG framework (Cog-RAG) that uses a theme hypergraph to capture inter-chunk thematic structure and an entity hypergraph to model high-order semantic relations. Furthermore, we design a cognitive-inspired two-stage retrieval strategy that first activates query-relevant thematic content from the theme hypergraph, and then guides fine-grained recall and diffusion in the entity hypergraph, achieving semantic alignment and consistent generation from global themes to local details. Our extensive experiments demonstrate that Cog-RAG significantly outperforms existing state-of-the-art baseline approaches.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DAPE: Harmonizing Content-Position Encoding for Versatile Dense Visual Prediction

Xiuquan Hou
Meiqin Liu
Senlin Zhang
Shaoyi Du

Dense visual prediction tasks, including object detection and segmentation, inherently require precise and discriminative positional information to delineate object boundaries and pixel regions. Recent DETR-based frameworks advance dense prediction tasks through iterative attention applied to content queries, with sampled proposals as position references. However, this paradigm suffers from the misaligned sampling distribution and insufficient interaction between the content and position features, thereby limiting the encoding effectiveness. To overcome these limitations, we investigate the encoding paradigm for content-position harmonization and propose an effective predictor for dense visual tasks, termed DAPE (DETR with hArmonized content-Position Encoding). DAPE introduces explicit position encoding to facilitate content enhancement while maintaining low memory overhead. To achieves this process, DAPE comprises a Shifted Query Sampler (SQS) that enforces strict alignment between the distributions of content and position queries, and a 2D Low-Rank Position Encoder (LRPE) that progressively modulates attention maps based on the aligned representations. DAPE provides a unified solution for various dense prediction tasks. Extensive experiments on object detection, instance segmentation, and few-shot detection benchmarks demonstrate that DAPE achieves state-of-the-art performance while reducing memory consumption.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MMRAG-RFT: Two-stage Reinforcement Fine-tuning for Explainable Multi-modal Retrieval-augmented Generation

Shengwei Zhao
Jingwen Yao
Sitong Wei
Linhai Xu
Yuying Liu
Dong Zhang
Zhiqiang Tian
Shaoyi Du

Multi-modal Retrieval-Augmented Generation (MMRAG) enables highly credible generation by integrating external multi-modal knowledge, thus demonstrating impressive performance in complex multi-modal scenarios. However, existing MMRAG methods fail to clarify the reasoning logic behind retrieval and response generation, which limits the explainability of the results. To address this gap, we propose to introduce reinforcement learning into multi-modal retrieval-augmented generation, enhancing the reasoning capabilities of multi-modal large language models through a two-stage reinforcement fine-tuning framework to achieve explainable multi-modal retrieval-augmented generation. Specifically, in the first stage, rule-based reinforcement fine-tuning is employed to perform coarse-grained point-wise ranking of multi-modal documents, effectively filtering out those that are significantly irrelevant. In the second stage, reasoning-based reinforcement fine-tuning is utilized to jointly optimize fine-grained list-wise ranking and answer generation, guiding multi-modal large language models to output explainable reasoning logic in the MMRAG process. Our method achieves state-of-the-art results on WebQA and MultimodalQA, two benchmark datasets for multi-modal retrieval-augmented generation, and its effectiveness is validated through comprehensive ablation experiments.

PDF Details DOI

JBHI Journal 2026 Journal Article

Multi-Scale Temporal Analysis With a Dual-Branch Attention Network for Interpretable Gait-Based Classification of Neurodegenerative Diseases

Wei Zeng
Zhangbo Peng
Yang Chen
Shaoyi Du

The accurate diagnosis of neurodegenerative diseases (NDDs), such as Amyotrophic Lateral Sclerosis (ALS), Huntington’s Disease (HD), and Parkinson’s Disease (PD), remains a clinical challenge due to the complexity and subtlety of gait abnormalities. This paper proposes the Dual-Branch Attention-Enhanced Residual Network (DAERN), a novel deep learning architecture that integrates Dilated Causal Convolutions (DCCBlock) for local gait pattern extraction and Multi-Head Self-Attention (MHSA) for long-range dependency modeling. A Cross-Attention Fusion module enhances feature integration, while SHapley Additive exPlanations (SHAP) and Integrated Gradients (IG) improve interpretability, providing clinically relevant insights into gait-based NDD classification. Uniform Manifold Approximation and Projection (UMAP) visualizations reveal well-separated clusters corresponding to distinct NDDs categories, demonstrating the model’s ability to capture discriminative features. Comprehensive ablation studies validate the contributions of model components and preprocessing strategies, highlighting the significance of each in achieving state-of-the-art classification performance. Experimental evaluations on the Gait in Neurodegenerative Disease (GaitNDD) dataset demonstrate that DAERN achieves an accuracy of 99. 64%, an F1-score of 99. 65%, and an AUC of 0. 9997, significantly outperforming conventional deep learning and machine learning baselines. These findings suggest that DAERN could be a valuable and interpretable tool for clinical gait assessment, aiding in early-stage monitoring and automated screening of NDDs, with potential applications in real-time wearable sensor-based gait analysis.

Details DOI

AAAI Conference 2026 Conference Paper

Role Hypergraph Contrastive Learning for Multivariate Time-Series Analysis

Rundong Xue
Hao Hu
Zhitao Zeng
Xiangmin Han
Zhiqiang Tian
Shaoyi Du
Yue Gao

Multivariate Time-Series (MTS) analysis is crucial across various domains. Considering the spatial and temporal consistency of MTS, existing methods leverage graph structures with temporal augmentation and contrastive learning to achieve robust learning of spatial dependencies and temporal patterns. Given the inherent high-order correlations in MTS, hypergraphs present a promising approach. However, two key challenges limit their further development: 1) Feature-based perspectives capture limited spatial information, while structural perspectives encode richer spatial consistency and evolution dependency; 2) Various semantic patterns (e.g., synergy, inhibition) entangle in sensor correlations, leading to semantic ambiguity. The underlying reason is that conventional hypergraph structures cannot distinguish specific semantic roles within or across hyperedges. Thus, we propose Role Hypergraph Contrastive Learning for MTS analysis. Specifically, we introduce the concept of role to generalize hypergraphs to Role Hypergraphs, enabling precise modeling of sensor correlations by assigning each vertex-hyperedge pair with a semantic role. Building on this structure, we design a role hypergraph contrastive learning paradigm to comprehensively capture the spatial and temporal dependencies: From a structural perspective, role hypergraph structural contrasting captures spatial short-term consistency and long-term evolution; from a feature perspective, alignment of complementary role information ensures sensor-level temporal consistency. Experiments on classification and forecasting tasks demonstrate the effectiveness and interpretability of our method.

PDF Details DOI

ICLR Conference 2025 Conference Paper

Beyond Graphs: Can Large Language Models Comprehend Hypergraphs?

Yifan Feng 0001
Chengwu Yang
Xingliang Hou
Shaoyi Du
Shihui Ying
Zongze Wu 0001
Yue Gao 0002

Existing benchmarks like NLGraph and GraphQA evaluate LLMs on graphs by focusing mainly on pairwise relationships, overlooking the high-order correlations found in real-world data. Hypergraphs, which can model complex beyond-pairwise relationships, offer a more robust framework but are still underexplored in the context of LLMs. To address this gap, we introduce LLM4Hypergraph, the first comprehensive benchmark comprising 21,500 problems across eight low-order, five high-order, and two isomorphism tasks, utilizing both synthetic and real-world hypergraphs from citation networks and protein structures. We evaluate six prominent LLMs, including GPT-4o, demonstrating our benchmark’s effectiveness in identifying model strengths and weaknesses. Our specialized prompt- ing framework incorporates seven hypergraph languages and introduces two novel techniques, Hyper-BAG and Hyper-COT, which enhance high-order reasoning and achieve an average 4% (up to 9%) performance improvement on structure classification tasks. This work establishes a foundational testbed for integrating hypergraph computational capabilities into LLMs, advancing their comprehension.

Details

EAAI Journal 2025 Journal Article

Dual-head detector with point-driven transformer and semantic-spatial gating for liquid crystal display defects

Chaofan Zhou
Meiqin Liu
Senlin Zhang
Shanling Dong
Ronghao Zheng
Shaoyi Du

Details DOI

ICRA Conference 2025 Conference Paper

ERetinex: Event Camera Meets Retinex Theory for Low-Light Image Enhancement

Xuejian Guo
Zhiqiang Tian
Yuehang Wang
Siqi Li 0001
Yu Jiang 0006
Shaoyi Du
Yue Gao 0002

Low-light image enhancement aims to restore the under-exposure image captured in dark scenarios. Under such scenarios, traditional frame-based cameras may fail to capture the structure and color information due to the exposure time limitation. Event cameras are bio-inspired vision sensors that respond to pixel-wise brightness changes asynchronously. Event cameras' high dynamic range is pivotal for visual perception in extreme low-light scenarios, surpassing traditional cameras and enabling applications in challenging dark environments. In this paper, inspired by the success of the retinex theory for traditional frame-based low-light image restoration, we introduce the first methods that combine the retinex theory with event cameras and propose a novel retinex-based lowlight image restoration framework named ERetinex. Among our contributions, the first is developing a new approach that leverages the high temporal resolution data from event cameras with traditional image information to estimate scene illumination accurately. This method outperforms traditional image-only techniques, especially in low-light environments, by providing more precise lighting information. Additionally, we propose an effective fusion strategy that combines the high dynamic range data from event cameras with the color information of traditional images to enhance image quality. Through this fusion, we can generate clearer and more detailrich images, maintaining the integrity of visual information even under extreme lighting conditions. The experimental results indicate that our proposed method outperforms state-of-theart (SOTA) methods, achieving a gain of 1. 0613 dB in PSNR while reducing FLOPS by 84. 28 %. The code is available at https://github.com/lodew920/ERetinex.

Details

JBHI Journal 2025 Journal Article

HSC-T: B-Ultrasound-to-Elastography Translation via Hierarchical Structural Consistency Learning for Thyroid Cancer Diagnosis

Hongcheng Han
Zhiqiang Tian
Qinbo Guo
Jue Jiang
Shaoyi Du
Juan Wang

Elastography ultrasound imaging is increasingly important in the diagnosis of thyroid cancer and other diseases, but its reliance on specialized equipment and techniques limits widespread adoption. This paper proposes a novel multimodal ultrasound diagnostic pipeline that expands the application of elastography ultrasound by translating B-ultrasound (BUS) images into elastography images (EUS). Additionally, to address the limitations of existing image-to-image translation methods, which struggle to effectively model inter-sample variations and accurately capture regional-scale structural consistency, we propose a BUS-to-EUS translation method based on hierarchical structural consistency. By incorporating domain-level, sample-level, patch-level, and pixel-level constraints, our approach guides the model in learning a more precise mapping from BUS to EUS, thereby enhancing diagnostic accuracy. Experimental results demonstrate that the proposed method significantly improves the accuracy of BUS-to-EUS translation on the MTUSI dataset and that the generated elastography images enhance nodule diagnostic accuracy compared to solely using BUS images on the STUSI and the BUSI datasets. This advancement highlights the potential for broader application of elastography in clinical practice.

Details DOI

NeurIPS Conference 2025 Conference Paper

Point-MaDi: Masked Autoencoding with Diffusion for Point Cloud Pre-training

Xiaoyang Xiao
Runzhao Yao
Zhiqiang Tian
Shaoyi Du

Self-supervised pre-training is essential for 3D point cloud representation learning, as annotating their irregular, topology-free structures is costly and labor-intensive. Masked autoencoders (MAEs) offer a promising framework but rely on explicit positional embeddings, such as patch center coordinates, which leak geometric information and limit data-driven structural learning. In this work, we propose Point-MaDi, a novel Point cloud Masked autoencoding Diffusion framework for pre-training that integrates a dual-diffusion pretext task into an MAE architecture to address this issue. Specifically, we introduce a center diffusion mechanism in the encoder, noising and predicting the coordinates of both visible and masked patch centers without ground-truth positional embeddings. These predicted centers are processed using a transformer with self-attention and cross-attention to capture intra- and inter-patch relationships. In the decoder, we design a conditional patch diffusion process, guided by the encoder's latent features and predicted centers to reconstruct masked patches directly from noise. This dual-diffusion design drives comprehensive global semantic and local geometric representations during pre-training, eliminating external geometric priors. Extensive experiments on ScanObjectNN, ModelNet40, ShapeNetPart, S3DIS, and ScanNet demonstrate that Point-MaDi achieves superior performance across downstream tasks, surpassing Point-MAE by 5. 51\% on OBJ-BG, 5. 17\% on OBJ-ONLY, and 4. 34\% on PB-T50-RS for 3D object classification on the ScanObjectNN dataset.

PDF Details

IROS Conference 2024 Conference Paper

ModaLink: Unifying Modalities for Efficient Image-to-PointCloud Place Recognition

Weidong Xie
Lun Luo
Nanfei Ye
Yi Ren
Shaoyi Du
Minhang Wang
Jintao Xu 0001
Rui Ai 0001

Place recognition is an important task for robots and autonomous cars to localize themselves and close loops in pre-built maps. While single-modal sensor-based methods have shown satisfactory performance, cross-modal place recognition that retrieving images from a point-cloud database remains a challenging problem. Current cross-modal methods transform images into 3D points using depth estimation for modality conversion, which are usually computationally intensive and need expensive labeled data for depth supervision. In this work, we introduce a fast and lightweight framework to encode images and point clouds into place-distinctive descriptors. We propose an effective Field of View (FoV) transformation module to convert point clouds into an analogous modality as images. This module eliminates the necessity for depth estimation and helps subsequent modules achieve real-time performance. We further design a non-negative factorization-based encoder to extract mutually consistent semantic features between point clouds and images. This encoder yields more distinctive global descriptors for retrieval. Experimental results on the KITTI dataset show that our proposed methods achieve state-of-the-art performance while running in real time. Additional evaluation on the HAOMO dataset covering a 17 km trajectory further shows the practical generalization capabilities. We have released the implementation of our methods as open source at: https://github.com/haomo-ai/ModaLink.git.

Details

AAAI Conference 2024 Conference Paper

PHFormer: Multi-Fragment Assembly Using Proxy-Level Hybrid Transformer

Wenting Cui
Runzhao Yao
Shaoyi Du

Fragment assembly involves restoring broken objects to their original geometries, and has many applications, such as archaeological restoration. Existing learning based frameworks have shown potential for solving part assembly problems with semantic decomposition, but cannot handle such geometrical decomposition problems. In this work, we propose a novel assembly framework, proxy level hybrid Transformer, with the core idea of using a hybrid graph to model and reason complex structural relationships between patches of fragments, dubbed as proxies. To this end, we propose a hybrid attention module, composed of intra and inter attention layers, enabling capturing of crucial contextual information within fragments and relative structural knowledge across fragments. Furthermore, we propose an adjacency aware hierarchical pose estimator, exploiting a decompose and integrate strategy. It progressively predicts adjacent probability and relative poses between fragments, and then implicitly infers their absolute poses by dynamic information integration. Extensive experimental results demonstrate that our method effectively reduces assembly errors while maintaining fast inference speed. The code is available at https://github.com/521piglet/PHFormer.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Semantic Flow: Learning Semantic Fields of Dynamic Scenes from Monocular Videos

Fengrui Tian
Yueqi Duan
Angtian Wang
Jianfei Guo
Shaoyi Du

In this work, we pioneer Semantic Flow, a neural semantic representation of dynamic scenes from monocular videos. In contrast to previous NeRF methods that reconstruct dynamic scenes from the colors and volume densities of individual points, Semantic Flow learns semantics from continuous flows that contain rich 3D motion information. As there is 2D-to-3D ambiguity problem in the viewing direction when extracting 3D flow features from 2D video frames, we consider the volume densities as opacity priors that describe the contributions of flow features to the semantics on the frames. More specifically, we first learn a flow network to predict flows in the dynamic scene, and propose a flow feature aggregation module to extract flow features from video frames. Then, we propose a flow attention module to extract motion information from flow features, which is followed by a semantic network to output semantic logits of flows. We integrate the logits with volume densities in the viewing direction to supervise the flow features with semantic labels on video frames. Experimental results show that our model is able to learn from multiple dynamic scenes and supports a series of new tasks such as instance-level scene editing, semantic completions, dynamic scene tracking and semantic adaption on novel scenes.

Details

JBHI Journal 2023 Journal Article

An Uncertainty-Aware and Sex-Prior Guided Biological Age Estimation From Orthopantomogram Images

Dong Zhang
Jing Yang
Shaoyi Du
Wenqing Bu
Yu-cheng Guo

Bone age, as a measure of biological age (BA), plays an important role in a variety of fields, including forensics, orthodontics, sports, and immigration. Despite its significance, accurate estimation of BA remains a challenge due to the uncertainty error between BA and chronological age (CA) caused by individual diversity and the difficult integration of multiple factors, such as sex, and identified or measured anatomical structures, into the estimation process. To address problems, we propose an uncertainty-aware and sex-prior guided biological age estimation from orthopantomogram images (OPGs), named UASP-BAE, which models uncertainty errors while setting sex dimorphism as tractive features to enhance age-related specific features, aiming to improve the accuracy of BA estimation. Furthermore, considering the global relevance of the anatomic structure, such as the mandible, teeth, maxillary sinus, etc. , a cross-attention module based on CNN and self-attention is proposed to mine the local texture and global semantic features of OPGs. Moreover, we design a novel age composition loss by cross-entropy, probability bias, and regression functions, aiming at evaluating BA's uncertainty errors and results to obtain an accurate and robust model. On 10703 OPGs from 5. 00 to 25. 00 years of age, our model had a best MAE value of 0. 8005 years and higher than the comparison popular algorithms, which also demonstrates the method's potential for improved accuracy in BA estimation.

Details DOI

AAAI Conference 2020 Conference Paper

CF-LSTM: Cascaded Feature-Based Long Short-Term Networks for Predicting Pedestrian Trajectory

Yi Xu
Jing Yang
Shaoyi Du

Pedestrian trajectory prediction is an important but difﬁcult task in self-driving or autonomous mobile robot ﬁeld because there are complex unpredictable human-human interactions in crowded scenarios. There have been a large number of studies that attempt to understand humans’ social behavior. However, most of these studies extract location features from previous one time step while neglecting the vital velocity features. In order to address this issue, we propose a novel feature-cascaded framework for long short-term network (CF-LSTM) without extra artiﬁcial settings or social rules. In this framework, feature information from previous two time steps are ﬁrstly extracted and then integrated as a cascaded feature to LSTM, which is able to capture the previous location information and dynamic velocity information, simultaneously. In addition, this scene-agnostic cascaded feature is the external manifestation of complex human-human interactions, which can also effectively capture dynamic interaction information in different scenes without any other pedestrians’ information. Experiments on public benchmark datasets indicate that our model achieves better performance than the state-of-the-art methods and this feature-cascaded framework has the ability to implicitly learn human-human interactions.

PDF Details

IROS Conference 2020 Conference Paper

CoBigICP: Robust and Precise Point Set Registration using Correntropy Metrics and Bidirectional Correspondence

Pengyu Yin
Di Wang 0028
Shaoyi Du
Shihui Ying
Yue Gao 0002
Nanning Zheng 0001

In this paper, we propose a novel probabilistic variant of iterative closest point (ICP) dubbed as CoBigICP. The method leverages both local geometrical information and global noise characteristics. Locally, the 3D structure of both target and source clouds are incorporated into the objective function through bidirectional correspondence. Globally, error metric of correntropy is introduced as noise model to resist outliers. Importantly, the close resemblance between normal-distributions transform (NDT) and correntropy is revealed. To ease the minimization step, an on-manifold parameterization of the special Euclidean group is proposed. Extensive experiments validate that CoBigICP outperforms several well-known and state-of-the-art methods.

Details

IROS Conference 2018 Conference Paper

Accurate Mix-Norm-Based Scan Matching

Di Wang 0028
Jianru Xue
Zhongxing Tao
Yang Zhong
Dixiao Cui
Shaoyi Du
Nanning Zheng 0001

Highly accurate mapping and localization is of prime importance for mobile robotics, and its core lies in efficient scan matching. Previous research are focusing on designing a robust objective function and the residual error distribution is often ignored or simply assumed as unitary or mixture of simple distributions. In this paper, a mixture of exponential power (MoEP) distributions is proposed to approximate the residual error distribution. The objective function induced by MoEP-based residual error modelling ensembles a mix-norm-based scan matching (MiNoM), which enhances the matching accuracy and convergence characteristic. Both the parameters of transformation (rotation and translation) and residual error distribution are estimated efficiently via an EM-like algorithm. The optimization of MiNoM is iteratively achieved via two phases: An on-line parameter learning (OPL) phase to learn residual error distribution for better representation according to the likelihood field model (LFM), and an iteratively reweighted least squares (IRLS) phase to attain transformation for accuracy and efficiency. Extensive experimental results validate that the proposed MiNoM out-performs several state-of-the-art scan matching algorithms in both convergence characteristic and matching accuracy.

Details

IROS Conference 2014 Conference Paper

Real-time global localization of intelligent road vehicles in lane-level via lane marking detection and shape registration

Dixiao Cui
Jianru Xue
Shaoyi Du
Nanning Zheng 0001

In this paper, we propose an accurate and real-time positioning method for intelligent road vehicles in urban environments. The proposed method uses a robust lane marking detection algorithm, as well as an efficient shape registration algorithm between the detected lane markings and a GPS based road shape prior, to improve the robustness and accuracy of global localization of a road vehicle. We exploit both the state-of-the-art technologies of visual localization based on lane marking detection and the wide availability of Global Positioning System (GPS) based localization. We show that by formulating the positioning problem in a relative sense, we can estimate the vehicle localization in real-time and bound its absolute error in centimeter-level by a cross validation scheme. The validation scheme integrates the vision based lane marking detection with the shape registration, and improves the performance of the overall localization system. The GPS localization can be refined by using lane marking detection when the GPS suffers from frequent satellite signal masking or blockage, while lane marking detection is validated and completing by the GPS based road shape prior when it does not work well in adverse weather conditions or with poor lane signature. We extensively evaluate the proposed method with a single forward-looking camera mounted on an autonomous vehicle which travels at 60km/h through several urban street scenes.

Details

IS Journal 2008 Journal Article

50 Years of Image Processing and Pattern Recognition in China

Nanning Zheng
Qubo You
Gaofeng Meng
Jihua Zhu
Shaoyi Du
Jianyi Liu

This article briefly reviews the development of image recognition in and outside China. It presents theoretical research achievements and applied research as well as several typical applications of image recognition in China. Finally, it discusses future trends in image recognition integrated with cognitive science. This article is part of a special issue on AI in China.

Details DOI