Author name cluster

Rui Yao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

AAAI Conference 2026 Conference Paper

Causal Decoupling Domain Generalization for Remote Sensing Change Detection

Jiaqi Zhao
Jianpeng Xie
Yong Zhou
Wen-Liang Du
Hancheng Zhu
Rui Yao

While current state-of-the-art Remote Sensing Change Detection (RSCD) methods can achieve impressive results on individual datasets, they become unreliable in unseen environments and imaging conditions, with performance metrics declining by as much as 60% to 80%. Simultaneously, variable environments and complex imaging conditions are the main characteristics of remote sensing data, calling for generalizable RSCD methods. To address this issue, we propose a novel RSCD method capable of domain generalization—CDDGNet. This method is based on causal decoupling theory, which progressively decouples invariant change features from variable domain features to extract generalizable characteristics. This enables a network trained on a single domain to accurately identify change regions in other domains. Specifically, firstly, the Causal Feature Adaptation Module is proposed to preliminarily decouple and simplify feature information during the encoding process by using wavelet transformation and feature energy spectralization methods. Secondly, the Causal Feature Fusion Module is presented to fully decouple features and aggregate significant change features during the decoding process through frequency domain processing and feature re-attention mechanisms. Thirdly, the Decoupling Effect Loss Function is proposed to optimize the process by evaluating the effectiveness of causal decoupling. Extensive experiments have shown that our model significantly outperforms existing methods across multiple groups of generalization tasks with varying levels of difficulty.

PDF Details DOI

AAAI Conference 2026 Conference Paper

CLIPDet3D: Vision-Language Collaborative Distillation for 3D Object Detection

Jiaqi Zhao
Huanfeng Hu
Yong Zhou
Wen-Liang Du
Kunyang Sun
Rui Yao
Qigong Sun

Multi-view 3D object detection plays a vital role in autonomous driving systems due to its ability to perceive complex scenes accurately. However, real-world driving data often exhibits a long-tailed distribution, causing significant drops in detection accuracy for rare categories in existing methods. To mitigate this issue, we propose CLIPDet3D, a novel vision-language collaborative framework for multi-view 3D object detection. First, to tackle the difficulty of capturing the semantic information of rare categories, a Vision-Language Collaborative Learning strategy is proposed to incorporate class-level semantic priors from CLIP. Second, a Depth Feature Contrastive Distillation module is designed to overcome the large depth estimation error for rare categories by aligning depth features between a teacher and a student network. Furthermore, to alleviate the difficulty in focusing on regions of rare categories, a Dual-Stream Prompt Attention mechanism is devised to inject learnable prompts and compute attention along both horizontal and vertical BEV directions. Evaluations on the nuScenes dataset demonstrate that CLIPDet3D achieves state-of-the-art accuracy while maintaining efficient inference.

PDF Details DOI

AAAI Conference 2026 Conference Paper

DTTNet: Improving Video Shadow Detection via Dark-Aware Guidance and Tokenized Temporal Modeling

Zhicheng Li
Kunyang Sun
Rui Yao
Hancheng Zhu
Fuyuan Hu
Jiaqi Zhao
Zhiwen Shao
Yong Zhou

Video shadow detection confronts two entwined difficulties: distinguishing shadows from complex backgrounds and modeling dynamic shadow deformations under varying illumination. To address shadow-background ambiguity, we leverage linguistic priors through the proposed Vision-language Match Module (VMM) and a Dark-aware Semantic Block (DSB), extracting text-guided features to explicitly differentiate shadows from dark objects. Furthermore, we introduce adaptive mask reweighting to downweight penumbra regions during training and apply edge masks at the final decoder stage for better supervision. For temporal modeling of variable shadow shapes, we propose a Tokenized Temporal Block (TTB) that decouples spatiotemporal learning. TTB summarizes cross-frame shadow semantics into learnable temporal tokens, enabling efficient sequence encoding with minimal computation overhead. Comprehensive Experiments on multiple benchmark datasets demonstrate state-of-the-art accuracy and real-time inference efficiency.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Interpreting Fedspeak with Confidence: A LLM-Based Uncertainty-Aware Framework Guided by Monetary Policy Transmission Paths

Rui Yao
Qi Chai
Jinhai Yao
Siyuan Li
Junhao Chen
Qi Zhang
Hao Wang

"Fedspeak", the stylized and often nuanced language used by the U.S. Federal Reserve, encodes implicit policy signals and strategic stances. The Federal Open Market Committee strategically employs Fedspeak as a communication tool to shape market expectations and influence both domestic and global economic conditions. As such, automatically parsing and interpreting Fedspeak presents a high-impact challenge, with significant implications for financial forecasting, algorithmic trading, and data-driven policy analysis. Technically, to enrich the semantic and contextual representation of Fedspeak texts, we incorporate domain-specific reasoning grounded in the monetary policy transmission mechanism. We further introduce a dynamic uncertainty decoding module to assess the confidence of model predictions, thereby enhancing both classification accuracy and model reliability. Experimental results demonstrate that our framework achieves state-of-the-art performance on the policy stance analysis task. Moreover, statistical analysis reveals a significant positive correlation between perceptual uncertainty and model error rates, validating the effectiveness of perceptual uncertainty as a diagnostic signal.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Unified Representation Causal Prompt Distillation for Re-Inference-Free Lifelong Person Re-Identification

Jiaqi Zhao
Jie Luo
Yong Zhou
Wen-Liang Du
Xixi Li
Rui Yao

Lifelong person re-identification (LReID) aims to retrieve the target person from sequentially collected data. Due to significant domain gaps between datasets and the continuous increase of training data from different scenarios, weak inter-domain generalization and catastrophic forgetting issues have remained major challenges for LReID. To tackle these issues, a novel LReID method called Unified Representation Causal Prompt Distillation (URCPD) is proposed. Specifically, to reduce domain gaps among different scene datasets and improve model inter-domain generalization capability, a Feature Decoupling Style Transfer module (FDST) is proposed to map new features into a unified feature space. Furthermore, to reduce the accumulated forgetting of old knowledge during the training stage, a Causal Prompt Distillation module (CPD) is introduced. This module eliminates the re-inference process for distillation and embeds memory prompts to combat catastrophic forgetting. Extensive experiments on five classic LReID seen datasets and seven unseen datasets demonstrate that our method significantly outperforms state-of-the-art methods.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Beyond Individual and Point: Next POI Recommendation via Region-aware Dynamic Hypergraph with Dual-level Modeling

Xixi Li
Zhuo Gu
Rui Yao
Yong Zhou
Hancheng Zhu
Jiaqi Zhao
Wen-Liang Du

Next POI recommendation contributes to the prosperity of various intelligent location-based services. Existing studies focus on exploring sequential patterns and POI interactions using sequential and graph-based methods to enhance recommendation performance. However, they don't effectively exploit geographical information. In addition, methods that focus on modeling mobility patterns using individual limited data may suffer from data sparsity and the information cocoons problem. Moreover, most graph structures focus on adjacent nodes, failing to capture potential high-order associations among POIs. To address these challenges, we propose the Region-aware dynamic Hypergraph learning method with Dual-level interaction Modeling (ReHDM), which exploits users' dynamic mobility beyond individual and point. Specifically, ReHDM utilizes regional encoding to mine the potential spatial relationships among POIs with coarse-grained geographical information. By incorporating POI-level and trajectory-level associations within a hypergraph convolutional network, ReHDM comprehensively captures cross-user collaborative information. Furthermore, ReHDM captures not only dependencies among POIs within each trajectory for a single user, but also the high-order collaborative information across individual user trajectories and associated users' trajectories. Experimental results on three public datasets demonstrate the superiority of ReHDM to the state-of-the-art.

PDF Details DOI

IJCAI Conference 2025 Conference Paper

Counterfactual Knowledge Maintenance for Unsupervised Domain Adaptation

Yao Li
Yong Zhou
Jiaqi Zhao
Wen-Liang Du
Rui Yao
Bing Liu

Traditional unsupervised domain adaptation (UDA) struggles to extract rich semantics due to backbone limitations. Recent large-scale pre-trained visual-language models (VLMs) have shown strong zero-shot learning capabilities in UDA tasks. However, directly using VLMs results in a mixture of semantic and domain-specific information, complicating knowledge transfer. Complex scenes with subtle semantic differences are prone to misclassification, which in turn can result in the loss of features that are crucial for distinguishing between classes. To address these challenges, we propose a novel counterfactual knowledge maintenance UDA framework. Specifically, we employ counterfactual disentanglement to separate the representation of semantic information from domain features, thereby reducing domain bias. Furthermore, to clarify ambiguous visual information specific to classes, we maintain the discriminative knowledge of both visual and textual information. This approach synergistically leverages multimodal information to preserve modality-specific distinguishable features. We conducted extensive experimental evaluations on several public datasets to demonstrate the effectiveness of our method. The source code is available at https: //github. com/LiYaolab/CMKUDA

PDF Details DOI

IJCAI Conference 2025 Conference Paper

GSDet: Gaussian Splatting for Oriented Object Detection

Zeyu Ding
Jiaqi Zhao
Yong Zhou
Wen-Liang Du
Hancheng Zhu
Rui Yao

Oriented object detection has advanced with the development of convolutional neural networks (CNNs) and transformers. However, modern detectors still rely on predefined object candidates, such as anchors in CNN-based methods or queries in transformer-based methods, which struggle to capture spatial information effectively. To address the limitations, we propose GSDet, a novel framework that formulates oriented object detection as Gaussian splatting. Specifically, our approach performs detection within a 3D feature space constructed from image features, where 3D Gaussians are employed to represent oriented objects. These 3D Gaussians are projected onto the image plane to form 2D Gaussians, which are then transformed into oriented boxes. Furthermore, we optimize the mean, anisotropic covariance, and confidence scores of these randomly initialized 3D Gaussians, using a decoder that incorporates 3D Gaussian sampling. Moreover, our method exhibits flexibility, enabling adaptive control and a dynamic number of Gaussians during inference. Experiments on 3 datasets indicate that GSDet achieves AP50 gains of 0. 7% on DIOR-R, 0. 3% on DOTA-v1. 0, and 0. 55% on DOTA-v1. 5 when evaluated with adaptive control and outperforms mainstream detectors.

PDF Details DOI

ICML Conference 2025 Conference Paper

Learning Gaussian DAG Models without Condition Number Bounds

Constantinos Daskalakis
Anthimos Vardis Kandiros
Rui Yao

We study the problem of learning the topology of a directed Gaussian Graphical Model under the equal-variance assumption, where the graph has $n$ nodes and maximum in-degree $d$. Prior work has established that $O(d \log n)$ samples are sufficient for this task. However, an important factor that is often overlooked in these analyses is the dependence on the condition number of the covariance matrix of the model. Indeed, all algorithms from prior work require a number of samples that grows polynomially with this condition number. In many cases this is unsatisfactory, since the condition number could grow polynomially with $n$, rendering these prior approaches impractical in high-dimensional settings. In this work, we provide an algorithm that recovers the underlying graph and prove that the number of samples required is independent of the condition number. Furthermore, we establish lower bounds that nearly match the upper bound up to a $d$-factor, thus providing an almost tight characterization of the true sample complexity of the problem. Moreover, under a further assumption that all the variances of the variables are bounded, we design a polynomial-time algorithm that recovers the underlying graph, at the cost of an additional polynomial dependence of the sample complexity on $d$. We complement our theoretical findings with simulations on synthetic datasets that confirm our predictions.

IJCAI Conference 2025 Conference Paper

Modality-Guided Dynamic Graph Fusion and Temporal Diffusion for Self-Supervised RGB-T Tracking

Shenglan Li
Rui Yao
Yong Zhou
Hancheng Zhu
Kunyang Sun
Bing Liu
Zhiwen Shao
Jiaqi Zhao

To reduce the reliance on large-scale annotations, self-supervised RGB-T tracking approaches have garnered significant attention. However, the omission of the object region by erroneous pseudo-label or the introduction of background noise affects the efficiency of modality fusion, while pseudo-label noise triggered by similar object noise can further affect the tracking performance. In this paper, we propose GDSTrack, a novel approach that introduces dynamic graph fusion and temporal diffusion to address the above challenges in self-supervised RGB-T tracking. GDSTrack dynamically fuses the modalities of neighboring frames, treats them as distractor noise, and leverages the denoising capability of a generative model. Specifically, by constructing an adjacency matrix via an Adjacency Matrix Generator (AMG), the proposed Modality-guided Dynamic Graph Fusion (MDGF) module uses a dynamic adjacency matrix to guide graph attention, focusing on and fusing the object’s coherent regions. Temporal Graph-Informed Diffusion (TGID) models MDGF features from neighboring frames as interference, and thus improving robustness against similar-object noise. Extensive experiments conducted on four public RGB-T tracking datasets demonstrate that GDSTrack outperforms the existing state-of-the-art methods. The source code is available at https: //github. com/LiShenglana/GDSTrack.

PDF Details DOI

JBHI Journal 2024 Journal Article

NKUT: Dataset and Benchmark for Pediatric Mandibular Wisdom Teeth Segmentation

Zhenhuan Zhou
Yuzhu Chen
Along He
Xitao Que
Kai Wang
Rui Yao
Tao Li

Germectomy is a common surgery in pediatric dentistry to prevent the potential dangers caused by impacted mandibular wisdom teeth. Segmentation of mandibular wisdom teeth is a crucial step in surgery planning. However, manually segmenting teeth and bones from 3D volumes is time-consuming and may cause delays in treatment. Deep learning based medical image segmentation methods have demonstrated the potential to reduce the burden of manual annotations, but they still require a lot of well-annotated data for training. In this paper, we initially curated a Cone Beam Computed Tomography (CBCT) dataset, NKUT, for the segmentation of pediatric mandibular wisdom teeth. This marks the first publicly available dataset in this domain. Second, we propose a semantic separation scale-specific feature fusion network named WTNet, which introduces two branches to address the teeth and bones segmentation tasks. In WTNet, We design a Input Enhancement (IE) block and a Teeth-Bones Feature Separation (TBFS) block to solve the feature confusions and semantic-blur problems in our task. Experimental results suggest that WTNet performs better on NKUT compared to previous state-of-the-art segmentation methods (such as TransUnet), with a maximum DSC lead of nearly 16%.

TIST Journal 2023 Journal Article

Attention-guided Adversarial Attack for Video Object Segmentation

Rui Yao
Ying Chen
Yong Zhou
Fuyuan Hu
Jiaqi Zhao
Bing Liu
Zhiwen Shao

Video Object Segmentation (VOS) methods have made many breakthroughs with the help of the continuous development and advancement of deep learning. However, the deep learning model is vulnerable to malicious adversarial attacks, which mislead the model to make wrong decisions by adding adversarial perturbation that humans cannot perceive to the input image. Threats to deep learning models remind us that video object segmentation methods are also vulnerable to attacks, thereby threatening their security. Therefore, we study adversarial attacks on the VOS task to better identify the vulnerabilities of the VOS method, which in turn provides an opportunity to improve its robustness. In this paper, we propose an attention-guided adversarial attack method, which uses spatial attention blocks to capture features with global dependencies to construct correlations between consecutive video frames, and performs multipath aggregation to effectively integrate spatial-temporal perturbation, thereby guiding the deconvolution network to generate adversarial examples with strong attack capability. Specifically, the class loss function is designed to enable the deconvolution network to better activate noise in other regions and suppress the activation related to the object class based on the enhanced feature map of the object class. At the same time, attentional feature loss is designed to enhance the transferability against attack. The experimental results on the DAVIS dataset show that the proposed attention-guided adversarial attack method can significantly reduce the segmentation accuracy of OSVOS, and the J & F mean on DAVIS 2016 can reach 73.6% drop rate. The generated adversarial examples are also highly transferable to other video object segmentation models.

ICRA Conference 2023 Conference Paper

Dimensional Optimization and Anti-Disturbance Analysis of an Upgraded Feed Mechanism in FAST

Xiaoyan Wang
Bin Zhang 0035
Zhaoyang Li
Xinyu Gao
Fei Zhang 0006
Yifan Ma
Rui Yao
Jianing Yin

Five-hundred-meter aperture spherical radio telescope (FAST) is a very famous large-scale scientific facility with excellent performance for astronomical observation in the world, but it currently fails to observe the center of the Milky Way Galaxy due to the limited observation angle that is affected by the heavy weight of the feed cabin. To improve this problem, an upgraded feed mechanism (UFM) with a lighter cable structure is designed and employed to replace the existing heavy rigid A-B rotator and Stewart platform in the feed cabin of FAST. The structural dimension of the UFM is analyzed and optimized under cable tension constraints to meet the requirements of the observation angle. Then, a novel disturbance increment method is proposed to analyze the anti-disturbance ability of the UFM, where a gradually increased disturbance wrench is applied to the UFM with the stiffness matrix iteratively updated. Through the dimensional optimization and further anti-disturbance analysis, the newly-designed UFM can indeed meet the higher demand for astronomical observation with the larger observation angle, which benefits from the lightweight cable structure. Besides, the UFM also has the appreciable anti-disturbance ability for long-term stable operation of FAST.

EAAI Journal 2022 Journal Article

Edge-aware and spectral–spatial information aggregation network for multispectral image semantic segmentation

Di Zhang
Jiaqi Zhao
Jingyang Chen
Yong Zhou
Boyu Shi
Rui Yao

Semantic segmentation is a fundamental task in the field of remote sensing image intelligent interpretation and computer vision. Multispectral remote sensing images have attracted more and more researchers’ attention because they can accurately describe different types of reflection spectra. However, inaccurate multispectral feature description leads to edge semantic ambiguity and misclassification of small objects. In this article, we propose a novel network named edge-aware and spectral–spatial information aggregation net (ESSANet) to capture both high-level semantic features and low-level edge details for semantic segmentation of remote sensing images. Specifically, on the one hand, in order to improve the representation ability of discriminant features, we design a two-stream spectral–spatial feature extraction network via 3D hybrid convolution and multi-level aggregation network. On the other hand, in order to eliminate the effect of edge semantic ambiguity, we develop a siamese edge-aware structure and multi-stage edge loss function. Experimental results show that our method achieved 3. 5% and 4. 09% mean intersection over union (mIoU) score improvements and 2. 59% and 3. 32% Kappa score improvements compared with the competitive baseline algorithm on the SEN12MS and US3D datasets, respectively. In addition, the method proposed in this paper also achieves a better trade-off between speed and accuracy.

NeurIPS Conference 2021 Conference Paper

Efficient Truncated Linear Regression with Unknown Noise Variance

Constantinos Daskalakis
Patroklos Stefanou
Rui Yao
Emmanouil Zampetakis

Truncated linear regression is a classical challenge in Statistics, wherein a label, $y = w^T x + \varepsilon$, and its corresponding feature vector, $x \in \mathbb{R}^k$, are only observed if the label falls in some subset $S \subseteq \mathbb{R}$; otherwise the existence of the pair $(x, y)$ is hidden from observation. Linear regression with truncated observations has remained a challenge, in its general form, since the early works of [Tobin'58, Amemiya '73]. When the distribution of the error is normal with known variance, recent work of [Daskalakis et al. '19] provides computationally and statistically efficient estimators of the linear model, $w$. In this paper, we provide the first computationally and statistically efficient estimators for truncated linear regression when the noise variance is unknown, estimating both the linear model and the variance of the noise. Our estimator is based on an efficient implementation of Projected Stochastic Gradient Descent on the negative log-likelihood of the truncated sample. Importantly, we show that the error of our estimates is asymptotically normal, and we use this to provide explicit confidence regions for our estimates.

TIST Journal 2021 Journal Article

Multi-Stage Fusion and Multi-Source Attention Network for Multi-Modal Remote Sensing Image Segmentation

Jiaqi Zhao
Yong Zhou
Boyu Shi
Jingsong Yang
Di Zhang
Rui Yao

With the rapid development of sensor technology, lots of remote sensing data have been collected. It effectively obtains good semantic segmentation performance by extracting feature maps based on multi-modal remote sensing images since extra modal data provides more information. How to make full use of multi-model remote sensing data for semantic segmentation is challenging. Toward this end, we propose a new network called Multi-Stage Fusion and Multi-Source Attention Network ((MS) 2 -Net) for multi-modal remote sensing data segmentation. The multi-stage fusion module fuses complementary information after calibrating the deviation information by filtering the noise from the multi-modal data. Besides, similar feature points are aggregated by the proposed multi-source attention for enhancing the discriminability of features with different modalities. The proposed model is evaluated on publicly available multi-modal remote sensing data sets, and results demonstrate the effectiveness of the proposed method.

TIST Journal 2020 Journal Article

Video Object Segmentation and Tracking

Rui Yao
Guosheng Lin
Shixiong Xia
Jiaqi Zhao
Yong Zhou

Object segmentation and object tracking are fundamental research areas in the computer vision community. These two topics are difficult to handle some common challenges, such as occlusion, deformation, motion blur, scale variation, and more. The former contains heterogeneous object, interacting object, edge ambiguity, and shape complexity; the latter suffers from difficulties in handling fast motion, out-of-view, and real-time processing. Combining the two problems of Video Object Segmentation and Tracking (VOST) can overcome their respective difficulties and improve their performance. VOST can be widely applied to many practical applications such as video summarization, high definition video compression, human computer interaction, and autonomous vehicles. This survey aims to provide a comprehensive review of the state-of-the-art VOST methods, classify these methods into different categories, and identify new trends. First, we broadly categorize VOST methods into Video Object Segmentation (VOS) and Segmentation-based Object Tracking (SOT). Each category is further classified into various types based on the segmentation and tracking mechanism. Moreover, we present some representative VOS and SOT methods of each time node. Second, we provide a detailed discussion and overview of the technical characteristics of the different methods. Third, we summarize the characteristics of the related video dataset and provide a variety of evaluation metrics. Finally, we point out a set of interesting future works and draw our own conclusions.

AAAI Conference 2017 Conference Paper

Solving Constrained Combinatorial Optimisation Problems via MAP Inference without High-Order Penalties

Zhen Zhang
Qinfeng Shi
Julian McAuley
Wei Wei
Yanning Zhang
Rui Yao
Anton van den Hengel

Solving constrained combinatorial optimization problems via MAP inference is often achieved by introducing extra potential functions for each constraint. This can result in very high order potentials, e. g. a 2nd -order objective with pairwise potentials and a quadratic constraint over all N variables would correspond to an unconstrained objective with an order-N potential. This limits the practicality of such an approach, since inference with high order potentials is tractable only for a few special classes of functions. We propose an approach which is able to solve constrained combinatorial problems using belief propagation without increasing the order. For example, in our scheme the 2nd -order problem above remains order 2 instead of order N. Experiments on applications ranging from foreground detection, image reconstruction, quadratic knapsack, and the M-best solutions problem demonstrate the effectiveness and efﬁciency of our method. Moreover, we show several situations in which our approach outperforms commercial solvers like CPLEX and others designed for speciﬁc constrained MAP inference problems.