Author name cluster

Jiahui Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

AAAI Conference 2026 Conference Paper

EPSegFZ: Efficient Point Cloud Semantic Segmentation for Few- and Zero-Shot Scenarios with Language Guidance

Jiahui Wang
Haiyue Zhu
Haoren Guo
Abdullah Al Mamun
Cheng Xiang
Tong Heng Lee

Recent approaches for few-shot 3D point cloud semantic segmentation typically require a two-stage learning process, i.e., a pre-training stage followed by a few-shot training stage. While effective, these methods face overreliance on pre-training, which hinders model flexibility and adaptability. Some models tried to avoid pre-training yet failed to capture ample information. In addition, current approaches focus on visual information in the support set and neglect or do not fully exploit other useful data, such as textual annotations. This inadequate utilization of support information impairs the performance of the model and restricts its zero-shot ability. To address these limitations, we present a novel pre-training-free network, named Efficient Point Cloud Semantic Segmentation for Few- and Zero-shot scenarios. Our EPSegFZ incorporates three key components. A Prototype-Enhanced Registers Attention (ProERA) module and a Dual Relative Positional Encoding (DRPE)-based cross-attention mechanism for improved feature extraction and accurate query-prototype correspondence construction without pre-training. A Language-Guided Prototype Embedding (LGPE) module that effectively leverages textual information from the support set to improve few-shot performance and enable zero-shot inference.Extensive experiments show that our method outperforms the state-of-the-art method by 5.68% and 3.82% on the S3DIS and ScanNet benchmarks, respectively.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

DynaNav: Dynamic Feature and Layer Selection for Efficient Visual Navigation

Jiahui Wang
Changhao Chen

Visual navigation is essential for robotics and embodied AI. However, existing foundation models, particularly those with transformer decoders, suffer from high computational overhead and lack interpretability, limiting their deployment on edge devices. To address this, we propose DynaNav, a Dynamic Visual Navigation framework that adapts feature and layer selection based on scene complexity. It employs a trainable hard feature selector for sparse operations, enhancing efficiency and interpretability. Additionally, we integrate feature selection into an early-exit mechanism, with Bayesian Optimization determining optimal exit thresholds to reduce computational cost. Extensive experiments in real-world-based datasets and simulated environments demonstrate the effectiveness of DynaNav. Compared to ViNT, DynaNav achieves a $2. 6\times$ reduction in FLOPs, 42. 3% lower inference time, and 32. 8% lower memory usage while improving navigation performance across four public datasets.

PDF Details

AAAI Conference 2025 Conference Paper

Enhancing Multivariate Time-Series Domain Adaptation via Contrastive Frequency Graph Discovery and Language-Guided Adversary Alignment

Haoren Guo
Haiyue Zhu
Jiahui Wang
Prahlad Vadakkepat
Weng Khuen Ho
Tong Heng Lee

Unsupervised domain adaptation (UDA) is a machine learning approach designed to minimize reliance on labeled data by aligning features between a labeled source domain and an unlabeled target domain, thereby reducing feature discrepancies, which is efficient for multivariate time series (MTS) prediction. However, most MTS UDA methods focus solely on aligning intra-series temporal features, overlooking the valuable information in inter-series dependencies. Research has highlighted that analyzing decomposed frequency dependencies in time series can reveal significant trends, noise patterns, and intricate temporal details. To address these unexplored frequency dependencies, we introduce the Frequency Graph Discovery Module (FGD), which uncovers and aligns shared frequency information and correlations across domains. Additionally, we propose a Frequency-Contextual Contrastive Learning (FCCL) framework to better capture and align frequency-contextual representations in multivariate time series, ensuring the extraction of label-invariant information for prediction. Furthermore, considering existing models overlooking the valuable and abundant information outside source and target dataset, we enhance the MTS UDA prediction model with a Language-guided Adversary Alignment (LAA) module, which leverages the advancement and capabilities of Large Language Models (LLMs) to get text-encoded labeled embeddings and align the classification features, thereby improving prediction accuracy. Our model achieves state-of-the-art results on three public multivariate time-series datasets for unsupervised domain adaptation, as demonstrated by empirical evidence.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Focus-Then-Reuse: Fast Adaptation in Visual Perturbation Environments

Jiahui Wang
Chao Chen
Jiacheng Xu
Zongzhang Zhang
Yang Yu

Visual reinforcement learning has shown promise in various real-world applications. However, deploying policies in complex real-world environments with visual perturbations remains a significant challenge. We notice that humans tend to filter information at the object level prior to decision-making, facilitating efficient skill transfer across different contexts. Inspired by this, we introduce Focus-Then-Reuse (FTR), a method utilizing a novel object selection mechanism to focus on task-relevant objects, and directly reuse the simulation-trained policy on them. The training of the object selection mechanism integrates prior knowledge from a vision-language model and feedback from the environment. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that FTR enables rapid adaptation in visual perturbation environments and achieves state-of-the-art performance. The source code is available at https: //github. com/LAMDA-RL/FTR.

PDF Details

IROS Conference 2025 Conference Paper

OpenMIGS: Multi-granularity Information-preserving Open-Vocabulary 3D Gaussian Splatting

Jingyu Zhao
Jiahui Wang
Yinan Deng
Yufeng Yue

Open-vocabulary scene understanding is critical for robotics, yet existing 3D Gaussian Splatting (3DGS) methods rely on compressed feature embeddings, compromising semantic fidelity and fine-grained interpretation. Although utilizing uncompressed high-dimensional features offers a potential solution, their direct integration imposes prohibitive memory and computational costs. To address this challenge, we propose OpenMIGS, a novel 3DGS-based framework for multi-granularity, information-preserving open-vocabulary understanding across both object and part levels. Specifically, OpenMIGS first constructs object-level Gaussian fields as structured carriers where a two-stage clustering strategy ensures global consistency in object labeling, and a code-book subsequently associates these object label with their uncompressed high-dimensional features. Building on this, a lightweight implicit field processes the geometric coordinates of object Gaussians to regress part-level high-dimensional features, enabling multi-granularity understanding. Experimental results on multiple datasets show that OpenMIGS outperforms existing methods in open-vocabulary understanding and retrieval tasks. It also supports multi-granularity scene editing for flexible semantic manipulation. The code is available at https://github.com/jingyuzhao1010/OpenMIGS.

Details

NeurIPS Conference 2025 Conference Paper

SingRef6D: Monocular Novel Object Pose Estimation with a Single RGB Reference

Jiahui Wang
Haiyue Zhu
Haoren Guo
Abdullah Al Mamun
Cheng Xiang
Tong Heng Lee

Recent 6D pose estimation methods demonstrate notable performance but still face some practical limitations. For instance, many of them rely heavily on sensor depth, which may fail with challenging surface conditions, such as transparent or highly reflective materials. In the meantime, RGB-based solutions provide less robust matching performance in low-light and texture-less scenes due to the lack of geometry information. Motivated by these, we propose **SingRef6D**, a lightweight pipeline requiring only a **single RGB** image as a reference, eliminating the need for costly depth sensors, multi-view image acquisition, or training view synthesis models and neural fields. This enables SingRef6D to remain robust and capable even under resource-limited settings where depth or dense templates are unavailable. Our framework incorporates two key innovations. First, we propose a token-scaler-based fine-tuning mechanism with a novel optimization loss on top of Depth-Anything v2 to enhance its ability to predict accurate depth, even for challenging surfaces. Our results show a 14. 41% improvement (in $\delta_{1. 05}$) on REAL275 depth prediction compared to Depth-Anything v2 (with fine-tuned head). Second, benefiting from depth availability, we introduce a depth-aware matching process that effectively integrates spatial relationships within LoFTR, enabling our system to handle matching for challenging materials and lighting conditions. Evaluations of pose estimation on the REAL275, ClearPose, and Toyota-Light datasets show that our approach surpasses state-of-the-art methods, achieving a 6. 1% improvement in average recall.

PDF Details

IROS Conference 2024 Conference Paper

LCP-Fusion: A Neural Implicit SLAM with Enhanced Local Constraints and Computable Prior

Jiahui Wang
Yinan Deng
Yi Yang 0009
Yufeng Yue

Recently the dense Simultaneous Localization and Mapping (SLAM) based on neural implicit representation has shown impressive progress in hole filling and high-fidelity mapping. Nevertheless, existing methods either heavily rely on known scene bounds or suffer inconsistent reconstruction due to drift in potential loop-closure regions, or both, which can be attributed to the inflexible representation and lack of local constraints. In this paper, we present LCP-Fusion, a neural implicit SLAM system with enhanced local constraints and computable prior, which takes the sparse voxel octree structure containing feature grids and SDF priors as hybrid scene representation, enabling the scalability and robustness during mapping and tracking. To enhance the local constraints, we propose a novel sliding window selection strategy based on visual overlap to address the loop-closure, and a practical warping loss to constrain relative poses. Moreover, we estimate SDF priors as coarse initialization for implicit features, which brings additional explicit constraints and robustness, especially when a light but efficient adaptive early ending is adopted. Experiments demonstrate that our method achieve better localization accuracy and reconstruction consistency than existing RGB-D implicit SLAM, especially in challenging real scenes (ScanNet) as well as self-captured scenes with unknown scene bounds. The code is available at https://github.com/laliwang/LCP-Fusion.

Details

AAAI Conference 2024 Conference Paper

SEC: More Accurate Clustering Algorithm via Structural Entropy

Junyu Huang
Qilong Feng
Jiahui Wang
Ziyun Huang
Jinhui Xu
Jianxin Wang

As one of the most popular machine learning tools in the field of unsupervised learning, clustering has been widely used in various practical applications. While numerous methods have been proposed for clustering, a commonly encountered issue is that the existing clustering methods rely heavily on local neighborhood information during the optimization process, which leads to suboptimal performance on real-world datasets. Besides, most existing clustering methods use Euclidean distances or densities to measure the similarity between data points. This could constrain the effectiveness of the algorithms for handling datasets with irregular patterns. Thus, a key challenge is how to effectively capture the global structural information in clustering instances to improve the clustering quality. In this paper, we propose a new clustering algorithm, called SEC. This algorithm uses the global structural information extracted from an encoding tree to guide the clustering optimization process. Based on the relation between data points in the instance, a sparse graph of the clustering instance can be constructed. By leveraging the sparse graph constructed, we propose an iterative encoding tree method, where hierarchical abstractions of the encoding tree are iteratively extracted as new clustering features to obtain better clustering results. To avoid the influence of easily misclustered data points located on the boundaries of the clustering partitions, which we call "fringe points", we propose an iterative pre-deletion and reassignment technique such that the algorithm can delete and reassign the "fringe points" to obtain more resilient and precise clustering results. Empirical experiments on both synthetic and real-world datasets demonstrate that our proposed algorithm outperforms state-of-the-art clustering methods and achieves better clustering performances. On average, the clustering accuracy (ACC) is increased by 1.7% and the normalized mutual information (NMI) by 7.9% compared with the current state-of-the-art (SOTA) algorithm on synthetic datasets. On real-world datasets, our method outperforms other clustering methods with an average increase of 12.3% in ACC and 5.2% in NMI, respectively.

PDF Details DOI

ICRA Conference 2023 Conference Paper

Few-Shot Point Cloud Semantic Segmentation via Contrastive Self-Supervision and Multi-Resolution Attention

Jiahui Wang
Haiyue Zhu
Haoren Guo
Abdullah Al Mamun 0002
Cheng-Xiang Wang 0001
Tong Heng Lee

This paper presents an effective few-shot point cloud semantic segmentation approach for real-world applications. Existing few-shot segmentation methods on point cloud heavily rely on the fully-supervised pretrain with large annotated datasets, which causes the learned feature extraction bias to those pretrained classes. However, as the purpose of few-shot learning is to handle unknown/unseen classes, such class-specific feature extraction in pretrain is not ideal to generalize into new classes for few-shot learning. Moreover, point cloud datasets hardly have a large number of classes due to the annotation difficulty. To address these issues, we propose a contrastive self-supervision framework for few-shot learning pretrain, which aims to eliminate the feature extraction bias through class-agnostic contrastive supervision. Specifically, we implement a novel contrastive learning approach with a learnable augmentor for a 3D point cloud to achieve point-wise differentiation, so that to enhance the pretrain with managed overfitting through the self-supervision. Furthermore, we develop a multi-resolution attention module using both the nearest and farthest points to extract the local and global point information more effectively, and a center-concentrated multi-prototype is adopted to mitigate the intra-class sparsity. Comprehensive experiments are conducted to evaluate the proposed approach, which shows our approach achieves state-of-the-art performance. Moreover, a case study on practical CAM/CAD segmentation is presented to demonstrate the effectiveness of our approach for real-world applications.

Details