Arrow Research search

Author name cluster

Ruoyu Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

TIST Journal 2026 Journal Article

Learning Causality-Aware Exploration with Transformers for Goal-Oriented Navigation

  • Ruoyu Wang
  • Tong Yu
  • Mingjie Li
  • Yuanjiang Cao
  • Yao Liu
  • Lina Yao

Navigation is a fundamental task in the research of Embodied AI, and recent advances in machine learning algorithms have garnered growing interest in developing versatile Embodied AI systems. However, current research in this domain reveals opportunities for improvement. First, the direct application of RNNs and Transformers often overlooks the distinct characteristics of navigation tasks compared to traditional sequential data modeling. These methods are inherently designed to capture long-term dependencies, which are relatively weak in navigation scenarios, potentially limiting their performance in such tasks. Second, the reliance on task-specific configurations, such as pre-trained modules and dataset-specific logic, compromises the generalizability of these methods. We address these constraints by initially exploring the unique differences between Navigation tasks and other sequential data tasks through the lens of Causality, presenting a causal framework to elucidate the inadequacies of conventional sequential methods for Navigation. By leveraging this causal perspective, we propose Causality-Aware Transformer (CAT) Networks for Navigation, featuring a Causal Understanding Module to enhance the model’s Environmental Understanding capability. Meanwhile, our method is devoid of task-specific inductive biases and can be trained in an End-to-End manner, which enhances the method’s generalizability across various contexts. Empirical evaluations demonstrate that our methodology consistently surpasses benchmark performances across a spectrum of settings, tasks, and simulation environments, specifically, in Object Navigation within RoboTHOR, Objective Navigation, Point Navigation in Habitat, and R2R Navigation. Extensive ablation studies reveal that the performance gains can be attributed to the Causal Understanding Module, which demonstrates effectiveness and efficiency in both Reinforcement Learning and Supervised Learning settings. Additionally, further analysis highlights the robustness of our method, demonstrating its capacity to consistently perform well across diverse experimental settings and varying conditions. This robustness underscores the adaptability and generalizability of our approach, reinforcing its potential for application across a wide range of tasks.

AAAI Conference 2026 Conference Paper

LongSplat: Online Generalizable 3D Gaussian Splatting from Long Sequence Images

  • Guichen Huang
  • Ruoyu Wang
  • Xiangjun Gao
  • Che Sun
  • Yuwei Wu
  • Shenghua Gao
  • Yunde Jia

3D Gaussian Splatting (3DGS) achieves high-fidelity novel view synthesis, but its application in online long-sequence scenarios is still restricted. Existing methods either rely on slow per-scene optimization or lack efficient frame-wise 3DGS updates, making them unsuitable for online long-sequence videos. In this paper, we propose LongSplat, an online real-time 3D Gaussian reconstruction framework designed for long-sequence image input. The core idea of LongSplat is to maintain a global 3DGS set and design a streaming 3DGS update mechanism that selectively compressing redundant historical Gaussians and introducing new Gaussians by comparing the current observations with the historical Gaussian. To achieve this goal, we design a Gaussian-Image Representation (GIR), which encodes 3D Gaussian parameters into a structured, image-like 2D format. GIR simultaneously enables identity-aware redundancy compression as well as the fusion of current view and historical Gaussians, which are used for online reconstruction and adapt the model to long sequences without overwhelming memory or computational costs. Extensive experiments demonstrate that LongSplat achieves state-of-the-art efficiency-quality trade-offs in real-time novel view synthesis, delivering real-time reconstruction while reducing Gaussian counts by 44% compared to per-pixel prediction paradigms.

AAAI Conference 2026 Conference Paper

Seeing from Another Perspective: Evaluating Multi-View Understanding in MLLMs

  • Chun-Hsiao Yeh
  • Chenyu Wang
  • Shengbang Tong
  • Ta-Ying Cheng
  • Ruoyu Wang
  • Tianzhe Chu
  • Yuexiang Zhai
  • Yubei Chen

Multi-view understanding, the ability to reconcile visual information across diverse viewpoints for effective navigation, manipulation, and 3D scene comprehension, is a fundamental challenge in Multi-Modal Large Language Models (MLLMs) to be used as embodied agents. While recent MLLMs have shown impressive advances in high-level reasoning and planning, they frequently fall short when confronted with multi-view geometric consistency and cross-view correspondence. To comprehensively evaluate the challenges of MLLMs in multi-view scene reasoning, we introduce All-Angles Bench, a human carefully benchmark with over 2,100 question-answer pairs from 90 diverse, real-world scenes. Our broad evaluation across 38 general-purpose and 3D spatial reasoning MLLMs reveals a substantial performance gap compared to humans. More critically, our analysis identifies two root failure modes: (1) cross-view object mismatch—the inability to establish consistent object correspondence across views; and (2) cross-view spatial misalignment—the failure to infer accurate camera poses and spatial layouts. These findings underscore a lack of multi-view awareness in current MLLMs, calling for architectural innovations beyond prompt tuning alone. We believe that our benchmark offers valuable insights toward building spatially-intelligent MLLMs.

AIIM Journal 2026 Journal Article

Topo-UNet: A topology-aware multi-task network for pulmonary vessel segmentation

  • Lu Liu
  • Ye Yuan
  • Yanxin Ma
  • WEI SHAO
  • Jiahe Song
  • Zhe Wang
  • Ruoyu Wang
  • Wenjun Tan

The precise segmentation of pulmonary vessels is crucial for the early diagnosis and treatment of pulmonary diseases. However, vessel images are frequently compromised by high levels of noise and blurred boundaries, which complicate the extraction of vessel features. Current state-of-the-art (SOTA) methods also encounter challenges such as segmenting fine vessels, interruptions in vessel continuity, and loss of inter-layer information. To address these issues, this study proposes a topology-aware multi-task network called Topo-UNet, which integrates the Bidirectional Slice-wise ConvLSTM (BS-ConvLSTM) module and topology-aware auxiliary task to enhance the accurate capture of vessel structural features. The BS-ConvLSTM module mitigates discontinuities in vessel structures by extracting spatial continuity features. Meanwhile, the topology-aware auxiliary task employs a Gaussian function to simulate the intensity distribution within vessels, improving the network's capability to accurately identify vessel structures. Additionally, this study introduces a joint auxiliary task-based method for vessel refinement that increases the recognition rate of fine vessels while enhancing segmentation continuity. Extensive experiments were conducted on CT and CTA datasets to evaluate the performance of Topo-UNet. Comparisons with various SOTA methods across multiple metrics show that Topo-UNet demonstrates superior performance in the task of pulmonary vessel segmentation. Specifically, it achieved Dice coefficients of 90. 78% and 91. 91%, along with Intersection over Union (IoU) scores of 83. 31% and 85. 09% across two test datasets. Furthermore, the discussion section presents a grouping evaluation strategy to address the segmentation performance of vessels of varying sizes, and explores a quadratic approach for vessel refinement, enhancing the segmentation of fine vessels. The code of the proposed Topo-UNet is publicly available at https: //github. com/liu66-git/Topo-UNet.

NeurIPS Conference 2025 Conference Paper

Adaptive Stochastic Coefficients for Accelerating Diffusion Sampling

  • Ruoyu Wang
  • Beier Zhu
  • Junzhi Li
  • Liangyu Yuan
  • Chi Zhang

Diffusion-based generative processes, formulated as differential equation solving, frequently balance computational speed with sample quality. Our theoretical investigation of ODE- and SDE-based solvers reveals complementary weaknesses: ODE solvers accumulate irreducible gradient error along deterministic trajectories, while SDE methods suffer from amplified discretization errors when the step budget is limited. Building upon this insight, we introduce AdaSDE, a novel single-step SDE solver that aims to unify the efficiency of ODEs with the error resilience of SDEs. Specifically, we introduce a single per-step learnable coefficient, estimated via lightweight distillation, which dynamically regulates the error correction strength to accelerate diffusion sampling. Notably, our framework can be integrated with existing solvers to enhance their capabilities. Extensive experiments demonstrate state-of-the-art performance: at 5 NFE, AdaSDE achieves FID scores of $4. 18$ on CIFAR-10, $8. 05$ on FFHQ and $6. 96$ on LSUN Bedroom. Codes are available https: //github. com/WLU-wry02/AdaSDE.

IROS Conference 2025 Conference Paper

L2COcc: Lightweight Camera-Centric Semantic Scene Completion via Distillation of LiDAR Model

  • Ruoyu Wang
  • Yukai Ma
  • Yi Yao
  • Sheng Tao
  • Haoang Li
  • Zongzhi Zhu
  • Yong Liu 0007
  • Xingxing Zuo 0001

Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page: https://studyingfufu.github.io/L2COcc/.

IJCAI Conference 2025 Conference Paper

QA-MDT: Quality-aware Masked Diffusion Transformer for Enhanced Music Generation

  • Chang Li
  • Ruoyu Wang
  • Lijuan Liu
  • Jun Du
  • Yixuan Sun
  • Zilu Guo
  • Zhengrong Zhang
  • Yuan Jiang

Text-to-music (TTM) generation, which converts textual descriptions into audio, opens up innovative avenues for multimedia creation. Achieving high quality and diversity in this process demands extensive, high-quality data, which are often scarce in available datasets. Most open-source datasets frequently suffer from issues like low-quality waveforms and low text-audio consistency, hindering the advancement of music generation models. To address these challenges, we propose a novel quality-aware training paradigm for generating high-quality, high-musicality music from large-scale, quality-imbalanced datasets. Additionally, by leveraging unique properties in the latent space of musical signals, we adapt and implement a masked diffusion transformer (MDT) model for the TTM task, showcasing its capacity for quality control and enhanced musicality. Furthermore, we introduce a three-stage caption refinement approach to address low-quality captions' issue. Experiments show state-of-the-art (SOTA) performance on benchmark datasets including MusicCaps and the Song-Describer Dataset with both objective and subjective metrics. Demo audio samples are available at https: //qa-mdt. github. io/, code and pretrained checkpoints are open-sourced at https: //github. com/ivcylc/OpenMusic.

AAAI Conference 2025 Conference Paper

ScamNet: Toward Explainable Large Language Model-Based Fraudulent Shopping Website Detection

  • Marzieh Bitaab
  • Alireza Karimi
  • Zhuoer Lyu
  • Ahmadreza Mosallanezhad
  • Adam Oest
  • Ruoyu Wang
  • Tiffany Bao
  • Yan Shoshitaishvili

Fraudulent shopping websites pose a significant threat to online consumers and legitimate businesses: in 2023, victims of such scams reported $392 million in losses to the Federal Trade Commission. This alarming trend not only impacts individuals but also erodes societal trust in e-commerce, necessitating urgent countermeasures. While previous studies have attempted to identify these fraudulent websites at scale, they face limitations such as potential bias in data collection, overreliance on easily manipulated features, and the lack of explainable results. This study explores the potential of Large Language Models (LLMs) in identifying fraudulent shopping websites, revealing that current LLMs underperform compared to existing machine learning models. To address this, we propose ScamNet, a fine-tuned LLM for explainable fraudulent shopping website detection. Our experimental results on real-world datasets demonstrate a breakthrough in detection performance from 22.35% detection rate to 95.59%, particularly in identifying subtle deceptive tactics such as using a legitimate-looking website template. ScamNet offers interpretable insights into its decision-making process, enhancing transparency and overcoming a key limitation of previous approaches.

JMLR Journal 2023 Journal Article

Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data

  • Ruoyu Wang
  • Miaomiao Su
  • Qihua Wang

Nonparametric regression imputation is commonly used in missing data analysis. However, it suffers from the curse of dimension. The problem can be alleviated by the explosive sample size in the era of big data, while the large-scale data size presents some challenges in the storage of data and the calculation of estimators. These challenges make the classical nonparametric regression imputation methods no longer applicable. This motivates us to develop two distributed nonparametric regression imputation methods. One is based on kernel smoothing and the other on the sieve method. The kernel-based distributed imputation method has extremely low communication cost, and the sieve-based distributed imputation method can accommodate more local machines. The response mean estimation is considered to illustrate the proposed imputation methods. Two distributed nonparametric regression imputation estimators are proposed for the response mean, which are proved to be asymptotically normal with asymptotic variances achieving the semiparametric efficiency bound. The proposed methods are evaluated through simulation studies and illustrated in real data analysis. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Characterization of Excess Risk for Locally Strongly Convex Population Risk

  • Mingyang Yi
  • Ruoyu Wang
  • Zhi-Ming Ma

We establish upper bounds for the expected excess risk of models trained by proper iterative algorithms which approximate the local minima. Unlike the results built upon the strong globally strongly convexity or global growth conditions e. g. , PL-inequality, we only require the population risk to be \emph{locally} strongly convex around its local minima. Concretely, our bound under convex problems is of order $\tilde{\mathcal{O}}(1/n)$. For non-convex problems with $d$ model parameters such that $d/n$ is smaller than a threshold independent of $n$, the order of $\tilde{\mathcal{O}}(1/n)$ can be maintained if the empirical risk has no spurious local minima with high probability. Moreover, the bound for non-convex problem becomes $\tilde{\mathcal{O}}(1/\sqrt{n})$ without such assumption. Our results are derived via algorithmic stability and characterization of the empirical risk's landscape. Compared with the existing algorithmic stability based results, our bounds are dimensional insensitive and without restrictions on the algorithm's implementation, learning rate, and the number of iterations. Our bounds underscore that with locally strongly convex population risk, the models trained by any proper iterative algorithm can generalize well, even for non-convex problems, and $d$ is large.