Author name cluster

Ruihui Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers

2 author rows

AAAI Conference 2026 Conference Paper

LiNeXt: Revisiting LiDAR Completion with Efficient Non-Diffusion Architectures

Wenzhe He
Xiaojun Chen
Ruiqi Wang
Ruihui Li
Huilong Pi
Jiapeng Zhang
Zhuo Tang
Kenli Li

3D LiDAR scene completion from point clouds is a fundamental component of perception systems in autonomous vehicles. Previous methods have predominantly employed diffusion models for high‑fidelity reconstruction. However, their multi-step iterative sampling incurs significant computational overhead, limiting its real-time applicability. To address this, we propose LiNeXt: a lightweight, non‐diffusion network optimized for rapid and accurate point cloud completion. Specifically, LiNeXt first applies the Noise‑to‑Coarse (N2C) Module to denoise the input noisy point cloud in a single pass, thereby obviating the multi‑step iterative sampling of diffusion‑based methods. The Refine Module then takes the coarse point cloud and its intermediate features from the N2C Module to perform more precise refinement, further enhancing structural completeness. Furthermore, we observe that LiDAR point clouds exhibit a distance-dependent spatial distribution, being densely sampled at proximal ranges and sparsely sampled at distal ranges. Accordingly, we propose the Distance‑aware Selected Repeat strategy to generate a more uniformly distributed noisy point cloud. On the SemanticKITTI dataset, LiNeXt achieves a 199.8 times speedup in inference, reduces Chamfer Distance by 50.7 percent, and uses only 6.1 percent of the parameters compared with LiDiff. These results demonstrate the superior efficiency and effectiveness of LiNeXt for real-time scene completion.

PDF Details DOI

AAAI Conference 2026 Conference Paper

PointSLAM++: Robust Dense Neural Gaussian Point Cloud-based SLAM

Xu Wang
Boyao Han
Xiaojun Chen
Ying Liu
Ruihui Li

Real-time 3D reconstruction is crucial for robotics and augmented reality, yet current simultaneous localization and mapping(SLAM) approaches often struggle to maintain structural consistency and robust pose estimation in the presence of depth noise. This work introduces PointSLAM++, a novel RGB-D SLAM system that leverages a hierarchically constrained neural Gaussian representation to preserve structural relationships while generating Gaussian primitives for scene mapping. It also employs progressive pose optimization to mitigate depth sensor noise, significantly enhancing localization accuracy. Furthermore, it utilizes a dynamic neural representation graph that adjusts the distribution of Gaussian nodes based on local geometric complexity, enabling the map to adapt to intricate scene details in real time. This combination yields high-precision 3D mapping and photorealistic scene rendering. Experimental results show PointSLAM++ outperforms existing 3DGS-based SLAM methods in reconstruction accuracy and rendering quality, demonstrating its advantages for large-scale AR and robotics.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance

Meng Wang
Fan Wu
Ruihui Li
Qin Yunchuan
Zhuo Tang
Li Ken Li

3D Semantic Scene Completion (SSC) provides comprehensive scene geometry and semantics for autonomous driving perception, which is crucial for enabling accurate and reliable decision-making. However, existing SSC methods are limited to capturing sparse information from the current frame or naively stacking multi-frame temporal features, thereby failing to acquire effective scene context. These approaches ignore critical motion dynamics and struggle to achieve temporal consistency. To address the above challenges, we propose a novel temporal SSC method FlowScene: Learning Temporal 3D Semantic Scene Completion via Optical Flow Guidance. By leveraging optical flow, FlowScene can integrate motion, different viewpoints, occlusions, and other contextual cues, thereby significantly improving the accuracy of 3D scene completion. Specifically, our framework introduces two key components: (1) a Flow-Guided Temporal Aggregation module that aligns and aggregates temporal features using optical flow, capturing motion-aware context and deformable structures; and (2) an Occlusion-Guided Voxel Refinement module that injects occlusion masks and temporally aggregated features into 3D voxel space, adaptively refining voxel representations for explicit geometric modeling. Experimental results demonstrate that FlowScene achieves state-of-the-art performance, with mIoU of 17. 70 and 20. 81 on the SemanticKITTI and SSCBench-KITTI-360 benchmarks.

PDF Details

IROS Conference 2025 Conference Paper

SA-MVSNet: Spatial-aware Multi-view Stereo Network with Attention Cost Volume

Haoran Kong
Fanzi Zeng
Longbao Dai
Jingyang Hu
Jiang-hao Cai
Jianxia Chen
Ruihui Li
Hongbo Jiang 0001

Deep learning-based multi-view stereo (MVS) methods enable dense point cloud reconstruction in texture-rich areas. However, existing methods incur significant computational costs to capture pixel dependencies for complete reconstruction in low-texture regions. Additionally, discrete depth layers in occluded environments hinder the cost volume’s ability to model object information effectively. To address these issues, we propose a spatial-aware multi-view stereo network with attention cost volume, termed SA-MVSNet. The network introduces the pixel-driven spatial interaction (PDSI) module, which integrates the hierarchical spatial location enhancement mechanism (HSLE) and the spatial context aggregation mechanism (SCA). Leveraging an efficient parallel architecture, the PDSI module captures pixel-level spatial dependencies with the HSLE and strengthens global contextual information through the SCA. This design improves the network’s ability to represent features in low-texture regions while maintaining high inference efficiency. Furthermore, SA-MVSNet incorporates an attention weight generation branch that refines the cost volume by aggregating multi-scale depth cues, effectively mitigating the impact of occlusion. Experiments on the DTU dataset and the Tanks and Temples dataset show that our method outperforms other learning-based methods, achieving superior performance and strong generalization ability.

Details

AAAI Conference 2025 Conference Paper

VLScene: Vision-Language Guidance Distillation for Camera-Based 3D Semantic Scene Completion

Meng Wang
Huilong Pi
Ruihui Li
Yunchuan Qin
Zhuo Tang
Kenli Li

Camera-based 3D semantic scene completion (SSC) provides dense geometric and semantic perception for autonomous driving. However, images provide limited information making the model susceptible to geometric ambiguity caused by occlusion and perspective distortion. Existing methods often lack explicit semantic modeling between objects, limiting their perception of 3D semantic context. To address these challenges, we propose a novel method VLScene: Vision-Language Guidance Distillation for Camera-based 3D Semantic Scene Completion. The key insight is to use the vision-language model to introduce high-level semantic priors to provide the object spatial context required for 3D scene understanding. Specifically, we design a vision-language guidance distillation process to enhance image features, which can effectively capture semantic knowledge from the surrounding environment and improve spatial context reasoning. In addition, we introduce a geometric-semantic sparse awareness mechanism to propagate geometric structures in the neighborhood and enhance semantic information through contextual sparse interactions. Experimental results demonstrate that VLScene achieves rank-1st performance on challenging benchmarks—SemanticKITTI and SSCBench-KITTI-360, yielding remarkably mIoU scores of 17.52 and 19.10, respectively.

PDF Details DOI

ECAI Conference 2024 Conference Paper

MC-SORT: A Motion Correction-Based Framework for Long-Term Multiple Object Tracking

Xiangyu Li
Yunchuan Qin
Ruihui Li
Guanghua Tan
Zhuo Tang
Kenli Li 0001

Long-term occlusion is one of the most formidable challenges in Multi-Object Tracking (MOT). The motion models of existing SORT-based trackers are unreliable in estimating the motion states of long-term occluded targets. This is mainly because as the occlusion period increases, the increases speed of estimation errors in the motion model increases faster. In practical applications, we believe that the estimation error of the tracker during long-term occlusion is mainly concentrated in the estimation error of the motion model on the velocity of the occluded target. In this work, we have demonstrated that in the long-term occlusion period, appropriately correcting the estimated values of the motion model on the target motion velocity and fully utilizing the temporal and attribute information of the target’s historical trajectory as calculation indicators of correlation are beneficial for improving the robustness of the tracker in long-term occlusion. We refer to our proposed motion correction-based framework as MC-SORT, which mainly consists of a Momentum Compensation Module (MCM) and a Backtracking Re-association (BRA) module. The former can correct the estimated value of the target’s motion state during long-term occlusion, the latter uses the temporal and attribute information of the target’s historical trajectory during long-term occlusion as correlation indicators to measure the degree of correlation between the target and trajectory. Our proposed MC-SORT has the characteristics of simplicity, online, real-time, and plug-and-play, particularly improving the robustness of the tracker in long-term occlusion. The extensive experimental results on the MOT17 and MOT20 datasets demonstrate the robustness and superiority of our framework.

Details

ICLR Conference 2023 Conference Paper

ISS: Image as Stepping Stone for Text-Guided 3D Shape Generation

Zhengzhe Liu
Peng Dai 0003
Ruihui Li
Xiaojuan Qi 0001
Chi-Wing Fu

Text-guided 3D shape generation remains challenging due to the absence of large paired text-shape dataset, the substantial semantic gap between these two modalities, and the structural complexity of 3D shapes. This paper presents a new framework called Image as Stepping Stone (ISS) for the task by introducing 2D image as a stepping stone to connect the two modalities and to eliminate the need for paired text-shape data. Our key contribution is a two-stage feature-space-alignment approach that maps CLIP features to shapes by harnessing a pre-trained single-view reconstruction (SVR) model with multi-view supervisions: first map the CLIP image feature to the detail-rich shape space in the SVR model, then map the CLIP text feature to the shape space and optimize the mapping by encouraging CLIP consistency between the input text and the rendered images. Further, we formulate a textguided shape stylization module to dress up the output shapes with novel structures and textures. Beyond existing works on 3D shape generation from text, our new approach is general for creating shapes in a broad range of categories, without requiring paired text-shape data. Experimental results manifest that our approach outperforms the state-of-the-arts and our baselines in terms of fidelity and consistency with text. Further, our approach can stylize the generated shapes with both realistic and fantasy structures and textures. Codes are available at https://github.com/liuzhengzhe/ISS-Image-as-Stepping-Stone-for-Text-Guided-3D-Shape-Generation.

Details