Author name cluster

Yang Cong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

AAAI Conference 2026 Conference Paper

Towards Efficient and Effective Interactive 3D Segmentation

Wei Cong
Yang Cong
Jiahua Dong
Gan Sun

Interactive 3D segmentation embodies an advanced human-in-the-loop paradigm, where a model iteratively refines the segmentation of interested objects within a 3D point cloud through user feedback. Existing methods have achieved notable advancements at the expense of substantial resource consumption. To address this challenge, we introduce E2I3D, an efficient and effective model for interactive 3D segmentation. Specifically, we propose a two-stage efficiency-to-effectiveness framework to decouple efficiency and effectiveness, avoiding the high training cost of joint optimization. For efficiency in the first stage, we present heterogeneous pruning, which reliably compresses the model by ranking and pruning the constructed heterogeneous groups separately based on gradient compensation. For effectiveness in the second stage, we design hierarchical click-aware attention that integrates geometric details from high-resolution features with global context from low-resolution features to enhance click-guided interaction. Extensive experiments across public datasets demonstrate that E2I3D exceeds state-of-the-art methods in both efficiency and effectiveness. For instance, on the KITTI-360 dataset, E2I3D boosts the IoU for interactive single-object segmentation from 44.4% to 49.0% with 5 user clicks, while simultaneously reducing parameters from 39.3M to 5.7M.

PDF Details DOI

ICRA Conference 2025 Conference Paper

DetailRefine: Towards Fine-Grained and Efficient Online Monocular 3D Reconstruction

Fupeng Chu
Yang Cong
Ronghan Chen

Online monocular 3D reconstruction has attracted widespread attention as it promotes the application of robots in interactive scenarios. Most existing methods focus on 1) real-time reconstruction, 2) accurate voxel featuring learning, and 3) effective voxel sparsification algorithm. To this end, 1) they adopt a coarse-to-fine pipeline, where all non-empty voxels are sent to the next level for refinement. However, this results in over-refinement of flat regions, leading to unnecessary computational overhead. Furthermore, 2) advanced methods focus on exploring view visibility but overlook the discriminability among visible views, which limits the representation of learned voxel features. Moreover, 3) existing sparsification algorithms struggle to distinguish detailed and empty voxels, resulting in either the loss of detailed voxels or the retention of empty voxels. To tackle these challenges, 1) we present Dynamic Detail Refinement (DDR) to allocate more voxels to detailed regions for refinement, which could alleviate the computational burden. Furthermore, 2) we propose Discriminability-Aware Fusion (DAF) to focus on discriminative views, which helps to capture accurate voxel features. In addition, 3) we propose Hierarchical Hybrid Sparsification (HHS) to balance global completeness and local refinement, which helps to preserve detailed voxels at hierarchical levels effectively. Extensive experiments conducted on the representative ScanNet (V2) and 7-Scenes datasets demonstrate the superiority of the proposed method.

Details

IROS Conference 2025 Conference Paper

Learning Generalizable 3D Manipulation With 10 Demonstrations

Yu Ren
Yang Cong
Bohao Huang
Jiahao Long
Ronghan Chen
Hongbo Li
Huijie Fan

Learning robust and generalizable manipulation skills from few demonstrations remains a key challenge in robotics, with broad applications in industrial automation and service robotics. Although recent imitation learning methods have achieved impressive results, they often require a large amount of demonstration data and struggle to generalize across different spatial variants. In this work, we propose a framework that learns 3D manipulation policies from only 10 demonstrations while achieving robust generalization to unseen spatial configurations through semantic-guided perception and spatial-equivariant policy learning. Our framework consists of two key modules: a Semantic Guided Perception module that extracts task-aware 3D representations from RGB-D inputs using semantic priors and a Spatial Generalized Decision module implementing a diffusion-based policy that preserves spatial equivariance through denoising. Central to our framework is a spatially equivariant training strategy, which adapts 2D data augmentation principles to 3D manipulation by maintaining gripper-object spatial relationships during trajectory augmentation. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. Our method demonstrates a significant improvement in success rates over state-of-the-art approaches on a series of challenging tasks, particularly under significant object pose variations. This work shows significant potential to advance efficient and generalizable manipulation skill learning in real-world applications.

Details

ICRA Conference 2024 Conference Paper

Marrying NeRF with Feature Matching for One-step Pose Estimation

Ronghan Chen
Yang Cong
Yu Ren

Given the image collection of an object, we aim at building a real-time image-based pose estimation method, which requires neither its CAD model nor hours of object-specific training. Recent NeRF-based methods provide a promising solution by directly optimizing the pose from pixel loss between rendered and target images. However, during inference, they require long converging time, and suffer from local minima, making them impractical for real-time robot applications. We aim at solving this problem by marrying image matching with NeRF. With 2D matches and depth rendered by NeRF, we directly solve the pose in one step by building 2D-3D correspondences between target and initial view, thus allowing for real-time prediction. Moreover, to improve the accuracy of 2D-3D correspondences, we propose a 3D consistent point mining strategy, which effectively discards unfaithful points reconstruted by NeRF. Moreover, current NeRF-based methods naively optimizing pixel loss fail at occluded images. Thus, we further propose a 2D matches based sampling strategy to preclude the occluded area. Experimental results on representative datasets prove that our method outperforms state-of-the-art methods, and improves inference efficiency by 90×, achieving real-time prediction at 6 FPS.

Details

IROS Conference 2022 Conference Paper

Class-Incremental Gesture Recognition Learning with Out-of-Distribution Detection

Mingxue Li
Yang Cong
Yuyang Liu
Gan Sun

Gesture recognition is a popular human-computer interaction technology, which has been widely applied in many fields (e. g. , autonomous driving, medical care, VR and AR). However, 1) most existing gesture recognition methods focus on the fixed recognition scenarios with several gestures, which could lead to memory consumption and computational effort when continuously learning new gestures; 2) Meanwhile, the performance of popular class-incremental methods degrades significantly for previously learned classes (i. e. , catastrophic forgetting) due to the ambiguity and variability of gestures. To tackle these challenges, we propose a novel class-incremental gesture recognition method with out-of-distribution (OOD) detection, which can continuously adapt to new gesture classes and achieve high performance for both learned and new gestures. Specifically, we construct an episodic memory with a subset of learned training samples to preserve the previous knowledge from forgetting. Moreover, the OOD detection-based memory management is developed for exploring the most representative and informative core set from the learned datasets. When a new gesture recognition task with strange classes comes, rehearsal enhancement is adopted to increase the diversity of memory exemplars for better fitting the real characteristics of gesture recognition. After deriving an effective class-incremental gesture recognition strategy, we perform experiments on two representative datasets to validate the superiority of our method. Evaluation experiments demonstrate that our proposed method substantially outperforms the state-of-the-art methods with about 2. 17%-3. 81% improvement under different class-incremental learning scenarios.

Details

AAAI Conference 2021 Conference Paper

Generative Partial Visual-Tactile Fused Object Clustering

Tao Zhang
Yang Cong
Gan Sun
Jiahua Dong
Yuyang Liu
Zhengming Ding

Visual-tactile fused sensing for object clustering has achieved significant progresses recently, since the involvement of tactile modality can effectively improve clustering performance. However, the missing data (i. e. , partial data) issues always happen due to occlusion and noises during the data collecting process. This issue is not well solved by most existing partial multi-view clustering methods for the heterogeneous modality challenge. Naively employing these methods would inevitably induce a negative effect and further hurt the performance. To solve the mentioned challenges, we propose a Generative Partial Visual-Tactile Fused (i. e. , GPVTF) framework for object clustering. More specifically, we first do partial visual and tactile features extraction from the partial visual and tactile data, respectively, and encode the extracted features in modality-specific feature subspaces. A conditional cross-modal clustering generative adversarial network is then developed to synthesize one modality conditioning on the other modality, which can compensate missing samples and align the visual and tactile modalities naturally by adversarial learning. To the end, two pseudo-label based KLdivergence losses are employed to update the corresponding modality-specific encoders. Extensive comparative experiments on three public visual-tactile datasets prove the effectiveness of our method.

PDF Details

AAAI Conference 2021 Conference Paper

I3DOL: Incremental 3D Object Learning without Catastrophic Forgetting

Jiahua Dong
Yang Cong
Gan Sun
Bingtao Ma
Lichen Wang

3D object classification has attracted appealing attentions in academic researches and industrial applications. However, most existing methods need to access the training data of past 3D object classes when facing the common real-world scenario: new classes of 3D objects arrive in a sequence. Moreover, the performance of advanced approaches degrades dramatically for past learned classes (i. e. , catastrophic forgetting), due to the irregular and redundant geometric structures of 3D point cloud data. To address these challenges, we propose a new Incremental 3D Object Learning (i. e. , I3DOL) model, which is the first exploration to learn new classes of 3D object continually. Specifically, an adaptive-geometric centroid module is designed to construct discriminative local geometric structures, which can better characterize the irregular point cloud representation for 3D object. Afterwards, to prevent the catastrophic forgetting brought by redundant geometric information, a geometric-aware attention mechanism is developed to quantify the contributions of local geometric structures, and explore unique 3D geometric characteristics with high contributions for classes incremental learning. Meanwhile, a score fairness compensation strategy is proposed to further alleviate the catastrophic forgetting caused by unbalanced data between past and new classes of 3D object, by compensating biased prediction for new classes in the validation phase. Experiments on 3D representative datasets validate the superiority of our I3DOL framework.

PDF Details

JBHI Journal 2021 Journal Article

Multi-Scale Context-Guided Deep Network for Automated Lesion Segmentation With Endoscopy Images of Gastrointestinal Tract

Shuai Wang
Yang Cong
Hancan Zhu
Xianyi Chen
Liangqiong Qu
Huijie Fan
Qiang Zhang
Mingxia Liu

Accurate lesion segmentation based on endoscopy images is a fundamental task for the automated diagnosis of gastrointestinal tract (GI Tract) diseases. Previous studies usually use hand-crafted features for representing endoscopy images, while feature definition and lesion segmentation are treated as two standalone tasks. Due to the possible heterogeneity between features and segmentation models, these methods often result in sub-optimal performance. Several fully convolutional networks have been recently developed to jointly perform feature learning and model training for GI Tract disease diagnosis. However, they generally ignore local spatial details of endoscopy images, as down-sampling operations (e. g. , pooling and convolutional striding) may result in irreversible loss of image spatial information. To this end, we propose a multi-scale context-guided deep network (MCNet) for end-to-end lesion segmentation of endoscopy images in GI Tract, where both global and local contexts are captured as guidance for model training. Specifically, one global subnetwork is designed to extract the global structure and high-level semantic context of each input image. Then we further design two cascaded local subnetworks based on output feature maps of the global subnetwork, aiming to capture both local appearance information and relatively high-level semantic information in a multi-scale manner. Those feature maps learned by three subnetworks are further fused for the subsequent task of lesion segmentation. We have evaluated the proposed MCNet on 1, 310 endoscopy images from the public EndoVis-Ab and CVC-ClinicDB datasets for abnormal segmentation and polyp segmentation, respectively. Experimental results demonstrate that MCNet achieves $\text{74}\%$ and $\text{85}\%$ mean intersection over union (mIoU) on two datasets, respectively, outperforming several state-of-the-art approaches in automated lesion segmentation with endoscopy images of GI Tract.

Details DOI

AAAI Conference 2020 Conference Paper

Lifelong Spectral Clustering

Gan Sun
Yang Cong
Qianqian Wang
Jun Li
Yun Fu

In the past decades, spectral clustering (SC) has become one of the most effective clustering algorithms. However, most previous studies focus on spectral clustering tasks with a ﬁxed task set, which cannot incorporate with a new spectral clustering task without accessing to previously learned tasks. In this paper, we aim to explore the problem of spectral clustering in a lifelong machine learning framework, i. e. , Lifelong Spectral Clustering (L2 SC). Its goal is to efﬁciently learn a model for a new spectral clustering task by selectively transferring previously accumulated experience from knowledge library. Speciﬁcally, the knowledge library of L2 SC contains two components: 1) orthogonal basis library: capturing latent cluster centers among the clusters in each pair of tasks; 2) feature embedding library: embedding the feature manifold information shared among multiple related tasks. As a new spectral clustering task arrives, L2 SC ﬁrstly transfers knowledge from both basis library and feature library to obtain encoding matrix, and further redeﬁnes the library base over time to maximize performance across all the clustering tasks. Meanwhile, a general online update formulation is derived to alternatively update the basis library and feature library. Finally, the empirical experiments on several real-world benchmark datasets demonstrate that our L2 SC model can effectively improve the clustering performance when comparing with other state-of-the-art spectral clustering algorithms.

PDF Details

AAAI Conference 2020 Conference Paper

Visual Tactile Fusion Object Clustering

Tao Zhang
Yang Cong
Gan Sun
Qianqian Wang
Zhenming Ding

Object clustering, aiming at grouping similar objects into one cluster with an unsupervised strategy, has been extensivelystudied among various data-driven applications. However, most existing state-of-the-art object clustering methods (e. g. , single-view or multi-view clustering methods) only explore visual information, while ignoring one of most important sensing modalities, i. e. , tactile information which can help capture different object properties and further boost the performance of object clustering task. To effectively beneﬁt both visual and tactile modalities for object clustering, in this paper, we propose a deep Auto-Encoder-like Non-negative Matrix Factorization framework for visual-tactile fusion clustering. Speciﬁcally, deep matrix factorization constrained by an under-complete Auto-Encoder-like architecture is employed to jointly learn hierarchical expression of visual-tactile fusion data, and preserve the local structure of data generating distribution of visual and tactile modalities. Meanwhile, a graph regularizer is introduced to capture the intrinsic relations of data samples within each modality. Furthermore, we propose a modality-level consensus regularizer to effectively align the visual and tactile data in a common subspace in which the gap between visual and tactile data is mitigated. For the model optimization, we present an efﬁcient alternating minimization strategy to solve our proposed model. Finally, we conduct extensive experiments on public datasets to verify the effectiveness of our framework.

PDF Details

ICRA Conference 2019 Conference Paper

Environment Driven Underwater Camera-IMU Calibration for Monocular Visual-Inertial SLAM

Changjun Gu
Yang Cong
Gan Sun

Most state-of-the-art underwater vision systems are calibrated manually in shallow water and used in open seas without changing. However, the refractivity of the water is adaptively changed depending on the salinity, temperature, depth or other underwater environmental indexes, which inevitably generate the calibration errors and induces incorrectness e. g. , for underwater Simultaneously Localization and Mapping (SLAM). To address this issue, in this paper, we propose a new underwater Camera-Inertial Measurement Unit (IMU) calibration model, which just needs to be calibrated once in the air, and then both the intrinsic parameters and extrinsic parameters between the camera and IMU could be automatically calculated depending on the environment indexes. To our best knowledge, this is the first work to consider the underwater Camera-IMU calibration via environmental indexes. We also build a verification platform to validate the effectiveness of our proposed method on real experiments, and use it for underwater monocular Visual-Inertial SLAM.

Details

AAAI Conference 2018 Conference Paper

Active Lifelong Learning With “Watchdog”

Gan Sun
Yang Cong
Xiaowei Xu

Lifelong learning intends to learn new consecutive tasks depending on previously accumulated experiences, i. e. , knowledge library. However, the knowledge among different new coming tasks are imbalance. Therefore, in this paper, we try to mimic an effective “human cognition” strategy by actively sorting the importance of new tasks in the process of unknown-to-known and selecting to learn the important tasks with more information preferentially. To achieve this, we consider to assess the importance of the new coming task, i. e. , unknown or not, as an outlier detection issue, and design a hierarchical dictionary learning model consisting of two-level task descriptors to sparse reconstruct each task with the 0 norm constraint. The new coming tasks are sorted depending on the sparse reconstruction score in descending order, and the task with high reconstruction score will be permitted to pass, where this mechanism is called as “watchdog”. Next, the knowledge library of the lifelong learning framework encode the selected task by transferring previous knowledge, and then can also update itself with knowledge from both previously learned task and current task automatically. For model optimization, the alternating direction method is employed to solve our model and converges to a ﬁxed point. Extensive experiments on both benchmark datasets and our own dataset demonstrate the effectiveness of our proposed model especially in task selection and dictionary learning.

PDF Details

IROS Conference 2017 Conference Paper

Deep learning of directional truncated signed distance function for robust 3D object recognition

Hongsen Liu
Yang Cong
Shuai Wang 0003
Huijie Fan
Dongying Tian
Yandong Tang

In this paper, we develop a novel 3D object recognition algorithm to perform detection and pose estimation jointly. We focus on analyzing the advantages of the 3D point cloud relative to the RGB-D image and try to eliminate the unpredictability of output values that inevitably occurs in regression tasks. To achieve this, we first adopt the Truncated Signed Distance Function (TSDF) to encode the point cloud and extract low compact discriminative feature via unsupervised deep learning network. This approach can not only eliminate the dense scale sampling for offline model training but also reduce the distortion by mapping the 3D shape to the 2D plane and overcome the dependence on color cues. Then, we train a Hough forests to achieve multi-object detection and 6-DoF pose estimation simultaneously. In addition, we propose a robust multilevel verification strategy that effectively reduces the unpredictability of output values which occurs in the hough regression module. Experiments on public datasets demonstrate that our approach provides effective results comparable to the state-of-the-arts.

Details

IROS Conference 2016 Conference Paper

A design of phase-closed-loop nanomachining control based ultrasonic vibration-assisted AFM

Jialin Shi
Lianqing Liu
Peng Yu
Yang Cong

This paper proposed a phase-closed-loop nanomachining control method to realize the directly control of machining depth based on ultrasonic vibration-assisted AFM. By using applied force to control the machining depth, conventional AFM machining approaches unable to machining a nanostructure with specified machined depth. With the proposed method, the vibration phase of micro-cantilever has a specific relationship with machining depth. Therefore, the nano-grooves with desired depth can be machined by using phase value as feedback of PID control. In this paper, the theoretical analysis and simulation are carried out, and the experiments of phase-closed-loop control method are conducted. The experimental results verify the primary feasibility of the proposed method. The present method also demonstrates the potential on the fabrication of three-dimension nanostructures and nanoelectronic device.

Details

AIIM Journal 2016 Journal Article

Scalable gastroscopic video summarization via similar-inhibition dictionary selection

Shuai Wang
Yang Cong
Jun Cao
Yunsheng Yang
Yandong Tang
Huaici Zhao
Haibin Yu

Details DOI