Author name cluster

Gim Hee Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

61 papers

2 author rows

AAAI Conference 2026 Conference Paper

Active3D: Active High-Fidelity 3D Reconstruction via Multi-Level Uncertainty Quantification

Yan Li
Yingzhao Li
Gim Hee Lee

In this paper, we present an active exploration framework for high-fidelity 3D reconstruction that incrementally builds a multi-level uncertainty space and selects next-best-views through an uncertainty-driven motion planner. We introduce a hybrid implicit–explicit representation that fuses neural fields with Gaussian primitives to jointly capture global structural priors and locally observed details. Based on this hybrid state, we derive a hierarchical uncertainty volume that quantifies both implicit global structure quality and explicit local surface confidence. To focus optimization on the most informative regions, we propose an uncertainty-driven keyframe selection strategy that anchors high-entropy viewpoints as sparse attention nodes, coupled with a viewpoint-space sliding window for uncertainty-aware local refinement. The planning module formulates next-best-view selection as an Expected Hybrid Information Gain problem and incorporates a risk-sensitive path planner to ensure efficient and safe exploration. Extensive experiments on challenging benchmarks demonstrate that our approach consistently achieves state-of-the-art accuracy, completeness, and rendering quality, highlighting its effectiveness for real-world active reconstruction and robotic perception tasks.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Condensed Data Expansion Using Model Inversion for Knowledge Distillation

Kuluhan Binici
Shivam Aggarwal
Cihan Acar
Nam Trung Pham
Karianto Leman
Gim Hee Lee
Tulika Mitra

Condensed datasets offer a compact representation of larger datasets, but training models directly on them or using them to enhance model performance through knowledge distillation (KD) can result in suboptimal outcomes due to limited information. To address this, we propose a method that expands condensed datasets using model inversion, a technique for generating synthetic data based on the impressions of a pre-trained model on its training data. This approach is particularly well-suited for KD scenarios, as the teacher model is already pre-trained and retains knowledge of the original training data. By creating synthetic data that complements the condensed samples, we enrich the training set and better approximate the underlying data distribution, leading to improvements in student model accuracy during knowledge distillation. Our method demonstrates significant gains in KD accuracy compared to using condensed datasets alone and outperforms standard model inversion-based KD methods by up to 11.4% across various datasets and model architectures. Importantly, it remains effective even when using as few as one condensed sample per class, and can also enhance performance in few-shot scenarios where only limited real data samples are available.

PDF Details DOI

AAAI Conference 2026 Conference Paper

HandMCM: Multi-modal Point Cloud-based Correspondence State Space Model for 3D Hand Pose Estimation

Wencan Cheng
Gim Hee Lee

3D hand pose estimation that involves accurate estimation of 3D human hand keypoint locations is crucial for many human-computer interaction applications such as augmented reality. However, this task poses significant challenges due to self-occlusion of the hands and occlusions caused by interactions with objects. In this paper, we propose HandMCM to address these challenges. Our HandMCM is a novel method based on the powerful state space model (Mamba). By incorporating modules for local information injection/filtering and correspondence modeling, the proposed correspondence Mamba effectively learns the highly dynamic kinematic topology of keypoints across various occlusion scenarios. Moreover, by integrating multi-modal image features, we enhance the robustness and representational capacity of the input, leading to more accurate hand pose estimation. Empirical evaluations on three benchmark datasets demonstrate that our model significantly outperforms current state-of-the-art methods, particularly in challenging scenarios involving severe occlusions. These results highlight the potential of our approach to advance the accuracy and reliability of 3D hand pose estimation in practical applications.

PDF Details DOI

AAAI Conference 2026 Conference Paper

RiemanLine: Riemannian Manifold Representation of 3D Lines for Factor Graph Optimization

Yan Li
Ze Yang
Keisuke Tateno
Federico Tombari
Liang Zhao
Gim Hee Lee

Minimal parametrization of 3D lines plays a critical role in camera localization and structural mapping. Existing representations in robotics and computer vision predominantly handle independent lines, overlooking structural regularities such as sets of parallel lines that are pervasive in man-made environments. This paper introduces RiemanLine, a unified minimal representation for 3D lines formulated on Riemannian manifolds that jointly accommodates both individual lines and parallel-line groups. Our key idea is to decouple each line landmark into global and local components: a shared vanishing direction optimized on the unit sphere, and scaled normal vectors constrained on orthogonal subspaces, enabling compact encoding of structural regularities. For n parallel lines, the proposed representation reduces the parameter space from 4n (orthonormal form) to 2n+2, naturally embedding parallelism without explicit constraints. We further integrate this parameterization into a factor graph framework, allowing global direction alignment and local reprojection optimization within a unified manifold-based bundle adjustment. Extensive experiments on ICL-NUIM, TartanAir, and synthetic benchmarks demonstrate that our method achieves significantly more accurate pose estimation and line reconstruction, while reducing parameter dimensionality and improving convergence stability.

PDF Details DOI

AAAI Conference 2026 Conference Paper

VPN: Visual Prompt Navigation

Shuo Feng
Zihan Wang
Yuchen Li
Rui Kong
Hengyi Cai
Shuaiqiang Wang
Gim Hee Lee
Piji Li

While natural language is commonly used to guide embodied agents, the inherent ambiguity and verbosity of language often hinder the effectiveness of language-guided navigation in complex environments. To this end, we propose Visual Prompt Navigation (VPN), a novel paradigm that guides agents to navigate using only user-provided visual prompts within 2D top-view maps. This visual prompt primarily focuses on marking the visual navigation trajectory on a top-down view of a scene, offering intuitive and spatially grounded guidance without relying on language instructions. It is more friendly for non-expert users and reduces interpretive ambiguity. We build VPN tasks in both discrete and continuous navigation settings, constructing two new datasets, R2R-VP and R2R-CE-VP, by extending existing R2R and R2R-CE episodes with corresponding visual prompts. Furthermore, we introduce VPNet, a dedicated baseline network to handle the VPN tasks, with two data augmentation strategies: view-level augmentation (altering initial headings and prompt orientations) and trajectory-level augmentation (incorporating diverse trajectories from large-scale 3D scenes), to enhance navigation performance. Extensive experiments evaluate how visual prompt forms, top-view map formats, and data augmentation strategies affect the performance of visual prompt navigation.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

4D3R: Motion-Aware Neural Reconstruction and Rendering of Dynamic Scenes from Monocular Videos

Mengqi Guo
Bo Xu
Yanyan Li
Gim Hee Lee

Novel view synthesis from monocular videos of dynamic scenes with unknown camera poses remains a fundamental challenge in computer vision and graphics. While recent advances in 3D representations such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have shown promising results for static scenes, they struggle with dynamic content and typically rely on pre-computed camera poses. We present 4D3R, a pose-free dynamic neural rendering framework that decouples static and dynamic components through a two-stage approach. Our method first leverages 3D foundational models for initial pose and geometry estimation, followed by motion-aware refinement. 4D3R introduces two key technical innovations: (1) a motion-aware bundle adjustment (MA-BA) module that combines transformer-based learned priors with SAM2 for robust dynamic object segmentation, enabling more accurate camera pose refinement; and (2) an efficient Motion-Aware Gaussian Splatting (MA-GS) representation that uses control points with a deformation field MLP and linear blend skinning to model dynamic motion, significantly reducing computational cost while maintaining high-quality reconstruction. Extensive experiments on real-world dynamic datasets demonstrate that our approach achieves up to 1. 8dB PSNR improvement over state-of-the-art methods, particularly in challenging scenarios with large dynamic objects, while reducing computational requirements by 5× compared to previous dynamic scene representations.

ICLR Conference 2025 Conference Paper

ComPC: Completing a 3D Point Cloud with 2D Diffusion Priors

Tianxin Huang
Zhiwen Yan
Yuyang Zhao
Gim Hee Lee

3D point clouds directly collected from objects through sensors are often incomplete due to self-occlusion. Conventional methods for completing these partial point clouds rely on manually organized training sets and are usually limited to object categories seen during training. In this work, we propose a test-time framework for completing partial point clouds across unseen categories without any requirement for training. Leveraging point rendering via Gaussian Splatting, we develop techniques of Partial Gaussian Initialization, Zero-shot Fractal Completion, and Point Cloud Extraction that utilize priors from pre-trained 2D diffusion models to infer missing regions and extract uniform completed point clouds. Experimental results on both synthetic and real-world scanned point clouds demonstrate that our approach outperforms existing methods in completing a variety of objects. Our project page is at \url{https://tianxinhuang.github.io/projects/ComPC/}.

NeurIPS Conference 2025 Conference Paper

Deep Gaussian from Motion: Exploring 3D Geometric Foundation Models for Gaussian Splatting

Yu Chen
Rolandos Alexandros Potamias
Evangelos Ververas
Jifei Song
Jiankang Deng
Gim Hee Lee

Neural radiance fields (NeRF) and 3D Gaussian Splatting (3DGS) are popular techniques to reconstruct and render photorealistic images. However, the prerequisite of running Structure-from-Motion (SfM) to get camera poses limits their completeness. Although previous methods can reconstruct a few unposed images, they are not applicable when images are unordered or densely captured. In this work, we propose a method to train 3DGS from unposed images. Our method leverages a pre-trained 3D geometric foundation model as the neural scene representation. Since the accuracy of the predicted pointmaps does not suffice for accurate image registration and high-fidelity image rendering, we propose to mitigate the issue by initializing and fine-tuning the pre-trained model from a seed image. The images are then progressively registered and added to the training buffer, which is used to train the model further. We also propose to refine the camera poses and pointmaps by minimizing a point-to-camera ray consistency loss across multiple views. When evaluated on diverse challenging datasets, our method outperforms state-of-the-art pose-free NeRF/3DGS methods in terms of both camera pose accuracy and novel view synthesis, and even renders higher fidelity images than 3DGS trained with COLMAP poses.

NeurIPS Conference 2025 Conference Paper

Distil-E2D: Distilling Image-to-Depth Priors for Event-Based Monocular Depth Estimation

Jie Long Lee
Gim Hee Lee

Event cameras are neuromorphic vision sensors that asynchronously capture pixel-level intensity changes with high temporal resolution and dynamic range. These make them well suited for monocular depth estimation under challenging lighting conditions. However, progress in event-based monocular depth estimation remains constrained by the quality of supervision: LiDAR-based depth labels are inherently sparse, spatially incomplete, and prone to artifacts. Consequently, these signals are suboptimal for learning dense depth from sparse events. To address this problem, we propose Distil-E2D, a framework that distills depth priors from the image domain into the event domain by generating dense synthetic pseudolabels from co-recorded APS or RGB frames using foundational depth models. These pseudolabels complement sparse LiDAR depths with dense semantically rich supervision informed by large-scale image-depth datasets. To reconcile discrepancies between synthetic and real depths, we introduce a Confidence-Guided Calibrated Depth Loss that learns nonlinear depth alignment and adaptively weights supervision by alignment confidence. Additionally, our architecture integrates past predictions via a Context Transformer and employs a Dual-Decoder Training scheme that enhances encoder representations by jointly learning metric and relative depth abstractions. Experiments on benchmark datasets show that Distil-E2D achieves state-of-the-art performance in event-based monocular depth estimation across both event-only and event+APS settings.

NeurIPS Conference 2025 Conference Paper

Dynam3D: Dynamic Layered 3D Tokens Empower VLM for Vision-and-Language Navigation

Zihan Wang
Seungjun Lee
Gim Hee Lee

Vision-and-Language Navigation (VLN) is a core task where embodied agents leverage their spatial mobility to navigate in 3D environments toward designated destinations based on natural language instructions. Recently, video-language large models (Video-VLMs) with strong generalization capabilities and rich commonsense knowledge have shown remarkable performance when applied to VLN tasks. However, these models still encounter the following challenges when applied to real-world 3D navigation: 1) Insufficient understanding of 3D geometry and spatial semantics; 2) Limited capacity for large-scale exploration and long-term environmental memory; 3) Poor adaptability to dynamic and changing environments. To address these limitations, we propose Dynam3D, a dynamic layered 3D representation model that leverages language-aligned, generalizable, and hierarchical 3D representations as visual input to train 3D-VLM in navigation action prediction. Given posed RGB-D images, our Dynam3D projects 2D CLIP features into 3D space and constructs multi-level 3D patch-instance-zone representations for 3D geometric and semantic understanding with a dynamic and layer-wise update strategy. Our Dynam3D is capable of online encoding and localization of 3D instances, and dynamically updates them in changing environments to provide large-scale exploration and long-term memory capabilities for navigation. By leveraging large-scale 3D-language pretraining and task-specific adaptation, our Dynam3D sets new state-of-the-art performance on VLN benchmarks including R2R-CE, REVERIE-CE and NavRAG-CE under monocular settings. Furthermore, experiments for pre-exploration, lifelong memory, and real-world robot validate the effectiveness of practical deployment.

ICLR Conference 2025 Conference Paper

econSG: Efficient and Multi-view Consistent Open-Vocabulary 3D Semantic Gaussians

Can Zhang 0007
Gim Hee Lee

The primary focus of most recent works on open-vocabulary neural fields is extracting precise semantic features from the VLMs and then consolidating them efficiently into a multi-view consistent 3D neural fields representation. However, most existing works over-trusted SAM to regularize image-level CLIP without any further refinement. Moreover, several existing works improved efficiency by dimensionality reduction of semantic features from 2D VLMs before fusing with 3DGS semantic fields, which inevitably leads to multi-view inconsistency. In this work, we propose econSG for open-vocabulary semantic segmentation with 3DGS. Our econSG consists of: 1) A Confidence-region Guided Regularization (CRR) that mutually refines SAM and CLIP to get the best of both worlds for precise semantic features with complete and precise boundaries. 2) A low dimensional contextual space to enforce 3D multi-view consistency while improving computational efficiency by fusing backprojected multi-view 2D features and follow by dimensional reduction directly on the fused 3D features instead of operating on each 2D view separately. Our econSG show state-of-the-art performance on four benchmark datasets compared to the existing methods. Furthermore, we are also the most efficient training among all the methods.

NeurIPS Conference 2025 Conference Paper

Event-Driven Dynamic Scene Depth Completion

Zhiqiang Yan
Jianhao Jiao
Zhengxue Wang
Gim Hee Lee

Depth completion in dynamic scenes poses significant challenges due to rapid ego-motion and object motion, which can severely degrade the quality of input modalities such as RGB images and LiDAR measurements. Conventional RGB-D sensors often struggle to align precisely and capture reliable depth under such conditions. In contrast, event cameras with their high temporal resolution and sensitivity to motion at the pixel level provide complementary cues that are beneficial in dynamic environments. To this end, we propose EventDC, the first event-driven depth completion framework. It consists of two key components: Event-Modulated Alignment (EMA) and Local Depth Filtering (LDF). Both modules adaptively learn the two fundamental components of convolution operations: offsets and weights conditioned on motion-sensitive event streams. In the encoder, EMA leverages events to modulate the sampling positions of RGB-D features to achieve pixel redistribution for improved alignment and fusion. In the decoder, LDF refines depth estimations around moving objects by learning motion-aware masks from events. Additionally, EventDC incorporates two loss terms to further benefit global alignment and enhance local depth recovery. Moreover, we establish the first benchmark for event-based depth completion comprising one real-world and two synthetic datasets to facilitate future research. Extensive experiments on this benchmark demonstrate the superiority of our EventDC. Project page.

NeurIPS Conference 2025 Conference Paper

FlexEvent: Towards Flexible Event-Frame Object Detection at Varying Operational Frequencies

Dongyue Lu
Lingdong Kong
Gim Hee Lee
Camille Simon Chane
Wei Ooi

Event cameras offer unparalleled advantages for real-time perception in dynamic environments, thanks to the microsecond-level temporal resolution and asynchronous operation. Existing event detectors, however, are limited by fixed-frequency paradigms and fail to fully exploit the high-temporal resolution and adaptability of event data. To address these limitations, we propose FlexEvent, a novel framework that enables detection at varying frequencies. Our approach consists of two key components: FlexFuse, an adaptive event-frame fusion module that integrates high-frequency event data with rich semantic information from RGB frames, and FlexTune, a frequency-adaptive fine-tuning mechanism that generates frequency-adjusted labels to enhance model generalization across varying operational frequencies. This combination allows our method to detect objects with high accuracy in both fast-moving and static scenarios, while adapting to dynamic environments. Extensive experiments on large-scale event camera datasets demonstrate that our approach surpasses state-of-the-art methods, achieving significant improvements in both standard and high-frequency settings. Notably, our method maintains robust performance when scaling from 20 Hz to 90 Hz and delivers accurate detection up to 180 Hz, proving its effectiveness in extreme conditions. Our framework sets a new benchmark for event-based object detection and paves the way for more adaptable, real-time vision systems.

ICLR Conference 2025 Conference Paper

Generalizable Human Gaussians from Single-View Image

Jinnan Chen
Chen Li 0038
Jianfeng Zhang
Lingting Zhu
Buzhen Huang
Hanlin Chen
Gim Hee Lee

In this work, we tackle the task of learning 3D human Gaussians from a single image, focusing on recovering detailed appearance and geometry including unobserved regions. We introduce a single-view generalizable Human Gaussian Model (HGM), which employs a novel generate-then-refine pipeline with the guidance from human body prior and diffusion prior. Our approach uses a ControlNet to refine rendered back-view images from coarse predicted human Gaussians, then uses the refined image along with the input image to reconstruct refined human Gaussians. To mitigate the potential generation of unrealistic human poses and shapes, we incorporate human priors from the SMPL-X model as a dual branch, propagating image features from the SMPL-X volume to the image Gaussians using sparse convolution and attention mechanisms. Given that the initial SMPL-X estimation might be inaccurate, we gradually refine it with our HGM model. We validate our approach on several publicly available datasets. Our method surpasses previous methods in both novel view synthesis and surface reconstruction. Our approach also exhibits strong generalization for cross-dataset evaluation and in-the-wild images.

ICLR Conference 2025 Conference Paper

GenXD: Generating Any 3D and 4D Scenes

Yuyang Zhao
Chung-Ching Lin
Kevin Lin
Zhiwen Yan
Linjie Li
Zhengyuan Yang
Jianfeng Wang
Gim Hee Lee

Recent developments in 2D visual generation have been remarkably successful. However, 3D and 4D generation remain challenging in real-world applications due to the lack of large-scale 4D data and effective model design. In this paper, we propose to jointly investigate general 3D and 4D generation by leveraging camera and object movements commonly observed in daily life. Due to the lack of real-world 4D data in the community, we first propose a data curation pipeline to obtain camera poses and object motion strength from videos. Based on this pipeline, we introduce a large-scale real-world 4D scene dataset: CamVid-30K. By leveraging all the 3D and 4D data, we develop our framework, GenXD, which allows us to produce any 3D or 4D scene. We propose multiview-temporal modules, which disentangle camera and object movements, to seamlessly learn from both 3D and 4D data. Additionally, GenXD employs masked latent conditions to support a variety of conditioning views. GenXD can generate videos that follow the camera trajectory as well as consistent 3D views that can be lifted into 3D representations. We perform extensive evaluations across various real-world and synthetic datasets, demonstrating GenXD's effectiveness and versatility compared to previous methods in 3D and 4D generation.

ICRA Conference 2025 Conference Paper

LiLoc: Lifelong Localization Using Adaptive Submap Joining and Egocentric Factor Graph

Yixin Fang
Yanyan Li 0001
Kun Qian
Federico Tombari
Yue Wang
Gim Hee Lee

This paper proposes a versatile graph-based lifelong localization framework using LiDAR, LiLoc, which enhances its timeliness by maintaining a single central session while improves the accuracy through multi-modal factors between the central and subsidiary sessions. First, an adaptive submap joining strategy is employed to generate prior submaps (keyframes and poses) for the central session, and to provide priors for subsidiaries when constraints are needed for robust localization. Next, a coarse-to-fine pose initialization for subsidiary sessions is performed using vertical recognition and ICP refinement in the global coordinate frame. To elevate the accuracy of subsequent localization, we propose an egocentric factor graph (EFG) module that integrates the IMU preintegration, LiDAR odometry and scan match factors in a joint optimization manner. Specifically, the scan match factors are constructed by a novel propagation model that efficiently distributes the prior constrains as edges to the relevant prior pose nodes, weighted by noises based on keyframe registration errors. Additionally, the framework supports flexible switching between two modes: relocalization (RLM) and incremental localization (ILM) based on the proposed overlap-based mechanism to select or update the prior submaps from central session. The proposed LiLoc is tested on public and custom datasets, demonstrating accurate localization performance against state-of-the-art methods. Our codes will be publicly available on https://github.com/Yixin-F/LiLoc.

NeurIPS Conference 2025 Conference Paper

Motion4D: Learning 3D-Consistent Motion and Semantics for 4D Scene Understanding

Haoran Zhou
Gim Hee Lee

Recent advancements in foundation models for 2D vision have substantially improved the analysis of dynamic scenes from monocular videos. However, despite their strong generalization capabilities, these models often lack 3D consistency, a fundamental requirement for understanding scene geometry and motion, thereby causing severe spatial misalignment and temporal flickering in complex 3D environments. In this paper, we present Motion4D, a novel framework that addresses these challenges by integrating 2D priors from foundation models into a unified 4D Gaussian Splatting representation. Our method features a two-part iterative optimization framework: 1) Sequential optimization, which updates motion and semantic fields in consecutive stages to maintain local consistency, and 2) Global optimization, which jointly refines all attributes for long-term coherence. To enhance motion accuracy, we introduce a 3D confidence map that dynamically adjusts the motion priors, and an adaptive resampling process that inserts new Gaussians into under-represented regions based on per-pixel RGB and semantic errors. Furthermore, we enhance semantic coherence through an iterative refinement process that resolves semantic inconsistencies by alternately optimizing the semantic fields and updating prompts of SAM2. Extensive evaluations demonstrate that our Motion4D significantly outperforms both 2D foundation models and existing 3D-based approaches across diverse scene understanding tasks, including point-based tracking, video object segmentation, and novel view synthesis. Our code is available at https: //hrzhou2. github. io/motion4d-web/.

ICLR Conference 2025 Conference Paper

Neuralized Markov Random Field for Interaction-Aware Stochastic Human Trajectory Prediction

Zilin Fang
David Hsu
Gim Hee Lee

Interactive human motions and the continuously changing nature of intentions pose significant challenges for human trajectory prediction. In this paper, we present a neuralized Markov random field (MRF)-based motion evolution method for probabilistic interaction-aware human trajectory prediction. We use MRF to model each agent's motion and the resulting crowd interactions over time, hence is robust against noisy observations and enables group reasoning. We approximate the modeled distribution using two conditional variational autoencoders (CVAEs) for efficient learning and inference. Our proposed method achieves state-of-the-art performance on ADE/FDE metrics across two dataset categories: overhead datasets ETH/UCY, SDD, and NBA, and ego-centric JRDB. Furthermore, our approach allows for real-time stochastic inference in bustling environments, making it well-suited for a 30FPS video setting. We open-source our codes at: https://github.com/AdaCompNUS/NMRF_TrajectoryPrediction.git

NeurIPS Conference 2025 Conference Paper

PoseCrafter: Extreme Pose Estimation with Hybrid Video Synthesis

Qing Mao
Tianxin Huang
Yu Zhu
Jinqiu Sun
Yanning Zhang
Gim Hee Lee

Pairwise camera pose estimation from sparsely overlapping image pairs remains a critical and unsolved challenge in 3D vision. Most existing methods struggle with image pairs that have small or no overlap. Recent approaches attempt to address this by synthesizing intermediate frames using video interpolation and selecting key frames via a self-consistency score. However, the generated frames are often blurry due to small overlap inputs, and the selection strategies are slow and not explicitly aligned with pose estimation. To solve these cases, we propose Hybrid Video Generation (HVG) to synthesize clearer intermediate frames by coupling a video interpolation model with a pose-conditioned novel view synthesis model, where we also propose a Feature Matching Selector (FMS) based on feature correspondence to select intermediate frames appropriate for pose estimation from the synthesized results. Extensive experiments on Cambridge Landmarks, ScanNet, DL3DV-10K, and NAVI demonstrate that, compared to existing SOTA methods, PoseCrafter can obviously enhance the pose estimation performances, especially on examples with small or no overlap.

ICLR Conference 2025 Conference Paper

Segment Any 3D Object with Language

Seungjun Lee
Yuyang Zhao
Gim Hee Lee

In this paper, we investigate Open-Vocabulary 3D Instance Segmentation (OV-3DIS) with free-form language instructions. Earlier works mainly rely on annotated base categories for training which leads to limited generalization to unseen novel categories. To mitigate the poor generalizability to novel categories, recent works generate class-agnostic masks or projecting generalized masks from 2D to 3D, subsequently classifying them with the assistance of 2D foundation model. However, these works often disregard semantic information in the mask generation, leading to sub-optimal performance. Instead, generating generalizable but semantic-aware masks directly from 3D point clouds would result in superior outcomes. To the end, we introduce Segment any 3D Object with LanguagE ($\textbf{SOLE}$), which is a semantic and geometric-aware visual-language learning framework with strong generalizability by generating semantic-related masks directly from 3D point clouds. Specifically, we propose a multimodal fusion network to incorporate multimodal semantics in both backbone and decoder. In addition, to align the 3D segmentation model with various language instructions and enhance the mask quality, we introduce three types of multimodal associations as supervision. Our SOLE outperforms previous methods by a large margin on ScanNetv2, ScanNet200, and Replica benchmarks, and the results are even closed to the fully-supervised counterpart despite the absence of class annotations in the training. Furthermore, extensive qualitative results demonstrate the versatility of our SOLE to language instructions. The code will be made publicly available.

NeurIPS Conference 2025 Conference Paper

X-Scene: Large-Scale Driving Scene Generation with High Fidelity and Flexible Controllability

Yu Yang
Alan Liang
Jianbiao Mei
Yukai Ma
Yong Liu
Gim Hee Lee

Diffusion models are advancing autonomous driving by enabling realistic data synthesis, predictive end-to-end planning, and closed-loop simulation, with a primary focus on temporally consistent generation. However, large-scale 3D scene generation requiring spatial coherence remains underexplored. In this paper, we present X-Scene, a novel framework for large-scale driving scene generation that achieves geometric intricacy, appearance fidelity, and flexible controllability. Specifically, X-Scene supports multi-granular control, including low-level layout conditioning driven by user input or text for detailed scene composition, and high-level semantic guidance informed by user intent and LLM-enriched prompts for efficient customization. To enhance geometric and visual fidelity, we introduce a unified pipeline that sequentially generates 3D semantic occupancy and corresponding multi-view images and videos, ensuring alignment and temporal consistency across modalities. We further extend local regions into large-scale scenes via consistency-aware outpainting, which extrapolates occupancy and images from previously generated areas to maintain spatial and visual coherence. The resulting scenes are lifted into high-quality 3DGS representations, supporting diverse applications such as simulation and scene exploration. Extensive experiments demonstrate that X-Scene substantially advances controllability and fidelity in large-scale scene generation, empowering data generation and simulation for autonomous driving.

NeurIPS Conference 2024 Conference Paper

DOGS: Distributed-Oriented Gaussian Splatting for Large-Scale 3D Reconstruction Via Gaussian Consensus

Yu Chen
Gim Hee Lee

The recent advances in 3D Gaussian Splatting (3DGS) show promising results on the novel view synthesis (NVS) task. With its superior rendering performance and high-fidelity rendering quality, 3DGS is excelling at its previous NeRF counterparts. The most recent 3DGS method focuses either on improving the instability of rendering efficiency or reducing the model size. On the other hand, the training efficiency of 3DGS on large-scale scenes has not gained much attention. In this work, we propose DoGaussian, a method that trains 3DGS distributedly. Our method first decomposes a scene into $K$ blocks and then introduces the Alternating Direction Method of Multipliers (ADMM) into the training procedure of 3DGS. During training, our DoGaussian maintains one global 3DGS model on the master node and $K$ local 3DGS models on the slave nodes. The $K$ local 3DGS models are dropped after training and we only query the global 3DGS model during inference. The training time is reduced by scene decomposition, and the training convergence and stability are guaranteed through the consensus on the shared 3D Gaussians. Our method accelerates the training of 3DGS by $6+$ times when evaluated on large-scale scenes while concurrently achieving state-of-the-art rendering quality. Our code is publicly available at [https: //github. com/AIBluefisher/DOGS](https: //github. com/AIBluefisher/DOGS).

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

FreeSplat: Generalizable 3D Gaussian Splatting Towards Free View Synthesis of Indoor Scenes

Yunsong Wang
Tianxin Huang
Hanlin Chen
Gim Hee Lee

Empowering 3D Gaussian Splatting with generalization ability is appealing. However, existing generalizable 3D Gaussian Splatting methods are largely confined to narrow-range interpolation between stereo images due to their heavy backbones, thus lacking the ability to accurately localize 3D Gaussian and support free-view synthesis across wide view range. In this paper, we present a novel framework FreeSplat that is capable of reconstructing geometrically consistent 3D scenes from long sequence input towards free-view synthesis. Specifically, we firstly introduce Low-cost Cross-View Aggregation achieved by constructing adaptive cost volumes among nearby views and aggregating features using a multi-scale structure. Subsequently, we present the Pixel-wise Triplet Fusion to eliminate redundancy of 3D Gaussians in overlapping view regions and to aggregate features observed across multiple views. Additionally, we propose a simple but effective free-view training strategy that ensures robust view synthesis across broader view range regardless of the number of views. Our empirical results demonstrate state-of-the-art novel view synthesis peformances in both novel view rendered color maps quality and depth maps accuracy across different numbers of input views. We also show that FreeSplat performs inference more efficiently and can effectively reduce redundant Gaussians, offering the possibility of feed-forward large scene reconstruction without depth priors. Our code will be made open-source upon paper acceptance.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Learning to Decouple the Lights for 3D Face Texture Modeling

Tianxin Huang
Zhenyu Zhang
Ying Tai
Gim Hee Lee

Existing research has made impressive strides in reconstructing human facial shapes and textures from images with well-illuminated faces and minimal external occlusions. Nevertheless, it remains challenging to recover accurate facial textures from scenarios with complicated illumination affected by external occlusions, \eg a face that is partially obscured by items such as a hat. Existing works based on the assumption of single and uniform illumination cannot correctly process these data. In this work, we introduce a novel approach to model 3D facial textures under such unnatural illumination. Instead of assuming single illumination, our framework learns to imitate the unnatural illumination as a composition of multiple separate light conditions combined with learned neural representations, named Light Decoupling. According to experiments on both single images and video sequences, we demonstrate the effectiveness of our approach in modeling facial textures under challenging illumination affected by occlusions.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MVSDet: Multi-View Indoor 3D Object Detection via Efficient Plane Sweeps

Yating Xu
Chen Li
Gim Hee Lee

The key challenge of multi-view indoor 3D object detection is to infer accurate geometry information from images for precise 3D detection. Previous method relies on NeRF for geometry reasoning. However, the geometry extracted from NeRF is generally inaccurate, which leads to sub-optimal detection performance. In this paper, we propose MVSDet which utilizes plane sweep for geometry-aware 3D object detection. To circumvent the requirement for a large number of depth planes for accurate depth prediction, we design a probabilistic sampling and soft weighting mechanism to decide the placement of pixel features on the 3D volume. We select multiple locations that score top in the probability volume for each pixel and use their probability score to indicate the confidence. We further apply recent pixel-aligned Gaussian Splatting to regularize depth prediction and improve detection performance with little computation overhead. Extensive experiments on ScanNet and ARKitScenes datasets are conducted to show the superiority of our model. Our code is available at https: //github. com/Pixie8888/MVSDet.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Robust Visual Recognition with Class-Imbalanced Open-World Noisy Data

Na Zhao
Gim Hee Lee

Learning from open-world noisy data, where both closed-set and open-set noise co-exist in the dataset, is a realistic but underexplored setting. Only recently, several efforts have been initialized to tackle this problem. However, these works assume the classes are balanced when dealing with open-world noisy data. This assumption often violates the nature of real-world large-scale datasets, where the label distributions are generally long-tailed, i.e. class-imbalanced. In this paper, we study the problem of robust visual recognition with class-imbalanced open-world noisy data. We propose a probabilistic graphical model-based approach: iMRF to achieve label noise correction that is robust to class imbalance via an efficient iterative inference of a Markov Random Field (MRF) in each training mini-batch. Furthermore, we design an agreement-based thresholding strategy to adaptively collect clean samples from all classes that includes corrected closed-set noisy samples while rejecting open-set noisy samples. We also introduce a noise-aware balanced cross-entropy loss to explicitly eliminate the bias caused by class-imbalanced data. Extensive experiments on several benchmark datasets including synthetic and real-world noisy datasets demonstrate the superior performance robustness of our method over existing methods. Our code is available at https://github.com/Na-Z/LIOND.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

VCR-GauS: View Consistent Depth-Normal Regularizer for Gaussian Surface Reconstruction

Hanlin Chen
Fangyin Wei
Chen Li
Tianxin Huang
Yunsong Wang
Gim Hee Lee

Although 3D Gaussian Splatting has been widely studied because of its realistic and efficient novel-view synthesis, it is still challenging to extract a high-quality surface from the point-based representation. Previous works improve the surface by incorporating geometric priors from the off-the-shelf normal estimator. However, there are two main limitations: 1) Supervising normal rendered from 3D Gaussians updates only the rotation parameter while neglecting other geometric parameters; 2) The inconsistency of predicted normal maps across multiple views may lead to severe reconstruction artifacts. In this paper, we propose a Depth-Normal regularizer that directly couples normal with other geometric parameters, leading to full updates of the geometric parameters from normal regularization. We further propose a confidence term to mitigate inconsistencies of normal predictions across multiple views. Moreover, we also introduce a densification and splitting strategy to regularize the size and distribution of 3D Gaussians for more accurate surface modeling. Compared with Gaussian-based baselines, experiments show that our approach obtains better reconstruction quality and maintains competitive appearance quality at faster training speed and 100+ FPS rendering. Our code will be made open-source upon paper acceptance.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

X-Ray: A Sequential 3D Representation For Generation

Tao Hu
Wenhang Ge
Yuyang Zhao
Gim Hee Lee

We introduce X-Ray, a novel 3D sequential representation inspired by the penetrability of x-ray scans. X-Ray transforms a 3D object into a series of surface frames at different layers, making it suitable for generating 3D models from images. Our method utilizes ray casting from the camera center to capture geometric and textured details, including depth, normal, and color, across all intersected surfaces. This process efficiently condenses the whole 3D object into a multi-frame video format, motivating the utilize of a network architecture similar to those in video diffusion models. This design ensures an efficient 3D representation by focusing solely on surface information. Also, we propose a two-stage pipeline to generate 3D objects from X-Ray Diffusion Model and Upsampler. We demonstrate the practicality and adaptability of our X-Ray representation by synthesizing the complete visible and hidden surfaces of a 3D object from a single input image. Experimental results reveal the state-of-the-art superiority of our representation in enhancing the accuracy of 3D generation, paving the way for new 3D representation research and practical applications. Our project page is in \url{https: //tau-yihouxiang. github. io/projects/X-Ray/X-Ray. html}.

PDF Details DOI

ICRA Conference 2023 Conference Paper

AdaSfM: From Coarse Global to Fine Incremental Adaptive Structure from Motion

Yu Chen
Zihao Yu
Shu Song
Tianning Yu
Jianming Li
Gim Hee Lee

Despite the impressive results achieved by many existing Structure from Motion (SfM) approaches, there is still a need to improve the robustness, accuracy, and efficiency on large-scale scenes with many outlier matches and sparse view graphs. In this paper, we propose AdaSfM: a coarse-to-fine adaptive SfM approach that is scalable to large-scale and challenging datasets. Our approach first does a coarse global SfM which improves the reliability of the view graph by leveraging measurements from low-cost sensors such as Inertial Measurement Units (IMUs) and wheel encoders. Subsequently, the view graph is divided into sub-scenes that are refined in parallel by a fine local incremental SfM regularised by the result from the coarse global SfM to improve the camera registration accuracy and alleviate scene drifts. Finally, our approach uses a threshold-adaptive strategy to align all local reconstructions to the coordinate frame of global SfM. Extensive experiments on large-scale benchmark datasets show that our approach achieves state-of-the-art accuracy and efficiency. [Project Page]

NeurIPS Conference 2023 Conference Paper

GNeSF: Generalizable Neural Semantic Fields

Hanlin Chen
Chen Li
Mengqi Guo
Zhiwen Yan
Gim Hee Lee

3D scene segmentation based on neural implicit representation has emerged recently with the advantage of training only on 2D supervision. However, existing approaches still requires expensive per-scene optimization that prohibits generalization to novel scenes during inference. To circumvent this problem, we introduce a \textit{generalizable} 3D segmentation framework based on implicit representation. Specifically, our framework takes in multi-view image features and semantic maps as the inputs instead of only spatial information to avoid overfitting to scene-specific geometric and semantic information. We propose a novel soft voting mechanism to aggregate the 2D semantic information from different views for each 3D point. In addition to the image features, view difference information is also encoded in our framework to predict the voting scores. Intuitively, this allows the semantic information from nearby views to contribute more compared to distant ones. Furthermore, a visibility module is also designed to detect and filter out detrimental information from occluded views. Due to the generalizability of our proposed method, we can synthesize semantic maps or conduct 3D semantic segmentation for novel scenes with solely 2D semantic supervision. Experimental results show that our approach achieves comparable performance with scene-specific approaches. More importantly, our approach can even outperform existing strong supervision-based approaches with only 2D annotations.

AAAI Conference 2023 Conference Paper

Incremental-DETR: Incremental Few-Shot Object Detection via Self-Supervised Learning

Na Dong
Yongqiang Zhang
Mingli Ding
Gim Hee Lee

Incremental few-shot object detection aims at detecting novel classes without forgetting knowledge of the base classes with only a few labeled training data from the novel classes. Most related prior works are on incremental object detection that rely on the availability of abundant training samples per novel class that substantially limits the scalability to real-world setting where novel data can be scarce. In this paper, we propose the Incremental-DETR that does incremental few-shot object detection via fine-tuning and self-supervised learning on the DETR object detector. To alleviate severe over-fitting with few novel class data, we first fine-tune the class-specific components of DETR with self-supervision from additional object proposals generated using Selective Search as pseudo labels. We further introduce an incremental few-shot fine-tuning strategy with knowledge distillation on the class-specific components of DETR to encourage the network in detecting novel classes without forgetting the base classes. Extensive experiments conducted on standard incremental object detection and incremental few-shot object detection settings show that our approach significantly outperforms state-of-the-art methods by a large margin. Our source code is available at https://github.com/dongnana777/Incremental-DETR.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

NU-MCC: Multiview Compressive Coding with Neighborhood Decoder and Repulsive UDF

Stefan Lionar
Xiangyu Xu
Min Lin
Gim Hee Lee

Remarkable progress has been made in 3D reconstruction from single-view RGB-D inputs. MCC is the current state-of-the-art method in this field, which achieves unprecedented success by combining vision Transformers with large-scale training. However, we identified two key limitations of MCC: 1) The Transformer decoder is inefficient in handling large number of query points; 2) The 3D representation struggles to recover high-fidelity details. In this paper, we propose a new approach called NU-MCC that addresses these limitations. NU-MCC includes two key innovations: a Neighborhood decoder and a Repulsive Unsigned Distance Function (Repulsive UDF). First, our Neighborhood decoder introduces center points as an efficient proxy of input visual features, allowing each query point to only attend to a small neighborhood. This design not only results in much faster inference speed but also enables the exploitation of finer-scale visual features for improved recovery of 3D textures. Second, our Repulsive UDF is a novel alternative to the occupancy field used in MCC, significantly improving the quality of 3D object reconstruction. Compared to standard UDFs that suffer from holes in results, our proposed Repulsive UDF can achieve more complete surface reconstruction. Experimental results demonstrate that NU-MCC is able to learn a strong 3D representation, significantly advancing the state of the art in single-view 3D reconstruction. Particularly, it outperforms MCC by 9. 7% in terms of the F1-score on the CO3D-v2 dataset with more than 5x faster running speed.

NeurIPS Conference 2022 Conference Paper

Adversarial Style Augmentation for Domain Generalized Urban-Scene Segmentation

Zhun Zhong
Yuyang Zhao
Gim Hee Lee
Nicu Sebe

In this paper, we consider the problem of domain generalization in semantic segmentation, which aims to learn a robust model using only labeled synthetic (source) data. The model is expected to perform well on unseen real (target) domains. Our study finds that the image style variation can largely influence the model's performance and the style features can be well represented by the channel-wise mean and standard deviation of images. Inspired by this, we propose a novel adversarial style augmentation (AdvStyle) approach, which can dynamically generate hard stylized images during training and thus can effectively prevent the model from overfitting on the source domain. Specifically, AdvStyle regards the style feature as a learnable parameter and updates it by adversarial training. The learned adversarial style feature is used to construct an adversarial image for robust model training. AdvStyle is easy to implement and can be readily applied to different models. Experiments on two synthetic-to-real semantic segmentation benchmarks demonstrate that AdvStyle can significantly improve the model performance on unseen real domains and show that we can achieve the state of the art. Moreover, AdvStyle can be employed to domain generalized image classification and produces a clear improvement on the considered datasets.

AAAI Conference 2022 Conference Paper

Static-Dynamic Co-teaching for Class-Incremental 3D Object Detection

Na Zhao
Gim Hee Lee

Deep learning-based approaches have shown remarkable performance in the 3D object detection task. However, they suffer from a catastrophic performance drop on the originally trained classes when incrementally learning new classes without revisiting the old data. This “catastrophic forgetting” phenomenon impedes the deployment of 3D object detection approaches in real-world scenarios, where continuous learning systems are needed. In this paper, we study the unexplored yet important class-incremental 3D object detection problem and present the first solution - SDCoT, a novel static-dynamic co-teaching method. Our SDCoT alleviates the catastrophic forgetting of old classes via a static teacher, which provides pseudo annotations for old classes in the new samples and regularizes the current model by extracting previous knowledge with a distillation loss. At the same time, SDCoT consistently learns the underlying knowledge from new data via a dynamic teacher. We conduct extensive experiments on two benchmark datasets and demonstrate the superior performance of our SDCoT over baseline approaches in several incremental learning scenarios. Our code is available at https: //github. com/Na-Z/SDCoT.

IROS Conference 2021 Conference Paper

A General Framework for Lifelong Localization and Mapping in Changing Environment

Min Zhao
Xin Guo
Le Song
Baoxing Qin
Xuesong Shi
Gim Hee Lee
Guanghui Sun

The environment of most real-world scenarios such as malls and supermarkets changes at all times. A pre-built map that does not account for these changes becomes out-of-date easily. Therefore, it is necessary to have an up-to-date model of the environment to facilitate long-term operation of a robot. To this end, this paper presents a general lifelong simultaneous localization and mapping (SLAM) framework. Our framework uses a multiple session map representation, and exploits an efficient map updating strategy that includes map building, pose graph refinement and sparsification. To mitigate the unbounded increase of memory usage, we propose a map-trimming method based on the Chow-Liu maximum-mutual-information spanning tree. The proposed SLAM framework has been comprehensively validated by over a month of robot deployment in real supermarket environment. Furthermore, we release the dataset collected from the indoor and outdoor changing environment with the hope to accelerate lifelong SLAM research in the community. Our dataset is available at https://github.com/sanduan168/lifelong-SLAM-dataset.

NeurIPS Conference 2021 Conference Paper

Bridging Non Co-occurrence with Unlabeled In-the-wild Data for Incremental Object Detection

Na Dong
Yongqiang Zhang
Mingli Ding
Gim Hee Lee

Deep networks have shown remarkable results in the task of object detection. However, their performance suffers critical drops when they are subsequently trained on novel classes without any sample from the base classes originally used to train the model. This phenomenon is known as catastrophic forgetting. Recently, several incremental learning methods are proposed to mitigate catastrophic forgetting for object detection. Despite the effectiveness, these methods require co-occurrence of the unlabeled base classes in the training data of the novel classes. This requirement is impractical in many real-world settings since the base classes do not necessarily co-occur with the novel classes. In view of this limitation, we consider a more practical setting of complete absence of co-occurrence of the base and novel classes for the object detection task. We propose the use of unlabeled in-the-wild data to bridge the non co-occurrence caused by the missing base classes during the training of additional novel classes. To this end, we introduce a blind sampling strategy based on the responses of the base-class model and pre-trained novel-class model to select a smaller relevant dataset from the large in-the-wild dataset for incremental learning. We then design a dual-teacher distillation framework to transfer the knowledge distilled from the base- and novel-class teacher models to the student model using the sampled in-the-wild data. Experimental results on the PASCAL VOC and MS COCO datasets show that our proposed method significantly outperforms other state-of-the-art class-incremental object detection methods when there is no co-occurrence between the base and novel classes during training.

ICRA Conference 2021 Conference Paper

City-scale Scene Change Detection using Point Clouds

Zi Jian Yew
Gim Hee Lee

We propose a method for detecting structural changes in a city using images captured from vehicular mounted cameras over traversals at two different times. We first generate 3D point clouds for each traversal from the images and approximate GNSS/INS readings using Structure-from-Motion (SfM). A direct comparison of the two point clouds for change detection is not ideal due to inaccurate geo-location information and possible drifts in the SfM. To circumvent this problem, we propose a deep learning-based non-rigid registration on the point clouds which allows us to compare the point clouds for structural change detection in the scene. Furthermore, we introduce a dual thresholding check and post-processing step to enhance the robustness of our method. We collect two datasets for the evaluation of our approach. Experiments show that our method is able to detect scene changes effectively, even in the presence of viewpoint and illumination differences.

NeurIPS Conference 2021 Conference Paper

Coarse-to-fine Animal Pose and Shape Estimation

Chen Li
Gim Hee Lee

Most existing animal pose and shape estimation approaches reconstruct animal meshes with a parametric SMAL model. This is because the low-dimensional pose and shape parameters of the SMAL model makes it easier for deep networks to learn the high-dimensional animal meshes. However, the SMAL model is learned from scans of toy animals with limited pose and shape variations, and thus may not be able to represent highly varying real animals well. This may result in poor fittings of the estimated meshes to the 2D evidences, e. g. 2D keypoints or silhouettes. To mitigate this problem, we propose a coarse-to-fine approach to reconstruct 3D animal mesh from a single image. The coarse estimation stage first estimates the pose, shape and translation parameters of the SMAL model. The estimated meshes are then used as a starting point by a graph convolutional network (GCN) to predict a per-vertex deformation in the refinement stage. This combination of SMAL-based and vertex-based representations benefits from both parametric and non-parametric representations. We design our mesh refinement GCN (MRGCN) as an encoder-decoder structure with hierarchical feature representations to overcome the limited receptive field of traditional GCNs. Moreover, we observe that the global image feature used by existing animal mesh reconstruction works is unable to capture detailed shape information for mesh refinement. We thus introduce a local feature extractor to retrieve a vertex-level feature and use it together with the global feature as the input of the MRGCN. We test our approach on the StanfordExtra dataset and achieve state-of-the-art results. Furthermore, we test the generalization capacity of our approach on the Animal Pose and BADJA datasets. Our code is available at the project website.

ICML Conference 2021 Conference Paper

FILTRA: Rethinking Steerable CNN by Filter Transform

Bo Li
Qili Wang
Gim Hee Lee

Steerable CNN imposes the prior knowledge of transformation invariance or equivariance in the network architecture to enhance the the network robustness on geometry transformation of data and reduce overfitting. It has been an intuitive and widely used technique to construct a steerable filter by augmenting a filter with its transformed copies in the past decades, which is named as filter transform in this paper. Recently, the problem of steerable CNN has been studied from aspect of group representation theory, which reveals the function space structure of a steerable kernel function. However, it is not yet clear on how this theory is related to the filter transform technique. In this paper, we show that kernel constructed by filter transform can also be interpreted in the group representation theory. This interpretation help complete the puzzle of steerable CNN theory and provides a novel and simple approach to implement steerable convolution operators. Experiments are executed on multiple datasets to verify the feasibility of the proposed approach.

ICRA Conference 2021 Conference Paper

Learning Spatial Context with Graph Neural Network for Multi-Person Pose Grouping

Jiahao Lin
Gim Hee Lee

Bottom-up approaches for image-based multi-person pose estimation consist of two stages: (1) keypoint detection and (2) grouping of the detected keypoints to form person instances. Current grouping approaches rely on learned embedding from only visual features that completely ignore the spatial configuration of human poses. In this work, we formulate the grouping task as a graph partitioning problem, where we learn the affinity matrix with a Graph Neural Network (GNN). More specifically, we design a Geometry-aware Association GNN that utilizes spatial information of the keypoints and learns local affinity from the global context. The learned geometry-based affinity is further fused with appearance-based affinity to achieve robust keypoint association. Spectral clustering is used to partition the graph for the formation of the pose instances. Experimental results on two benchmark datasets show that our proposed method outperforms existing appearance-only grouping frameworks, which shows the effectiveness of utilizing spatial context for robust grouping. Source code is available at: https://github.com/jiahaoLjh/PoseGrouping.

ICLR Conference 2020 Conference Paper

Identifying through Flows for Recovering Latent Representations

Shen Li 0004
Bryan Hooi
Gim Hee Lee

Identifiability, or recovery of the true latent representations from which the observed data originates, is de facto a fundamental goal of representation learning. Yet, most deep generative models do not address the question of identifiability, and thus fail to deliver on the promise of the recovery of the true latent sources that generate the observations. Recent work proposed identifiable generative modelling using variational autoencoders (iVAE) with a theory of identifiability. Due to the intractablity of KL divergence between variational approximate posterior and the true posterior, however, iVAE has to maximize the evidence lower bound (ELBO) of the marginal likelihood, leading to suboptimal solutions in both theory and practice. In contrast, we propose an identifiable framework for estimating latent representations using a flow-based model (iFlow). Our approach directly maximizes the marginal likelihood, allowing for theoretical guarantees on identifiability, thereby dispensing with variational approximations. We derive its optimization objective in analytical form, making it possible to train iFlow in an end-to-end manner. Simulations on synthetic data validate the correctness and effectiveness of our proposed method and demonstrate its practical advantages over other existing methods.

IROS Conference 2020 Conference Paper

Point Cloud Completion by Learning Shape Priors

Xiaogang Wang 0008
Marcelo H. Ang
Gim Hee Lee

In view of the difficulty in reconstructing object details in point cloud completion, we propose a shape prior learning method for object completion. The shape priors include geometric information in both complete and the partial point clouds. We design a feature alignment strategy to learn the shape prior from complete points, and a coarse to fine strategy to incorporate partial prior in the fine stage. To learn the complete objects prior, we first train a point cloud auto-encoder to extract the latent embeddings from complete points. Then we learn a mapping to transfer the point features from partial points to that of the complete points by optimizing feature alignment losses. The feature alignment losses consist of a L2 distance and an adversarial loss obtained by Maximum Mean Discrepancy Generative Adversarial Network (MMD-GAN). The L2 distance optimizes the partial features towards the complete ones in the feature space, and MMD-GAN decreases the statistical distance of two point features in a Reproducing Kernel Hilbert Space. We achieve state-of-the-art performances on the point cloud completion task. Our code is available at https://github.com/xiaogangw/point-cloud-completion-shape-prior.

ICRA Conference 2020 Conference Paper

PointAtrousGraph: Deep Hierarchical Encoder-Decoder with Point Atrous Convolution for Unorganized 3D Points

Liang Pan
Chee-Meng Chew
Gim Hee Lee

Motivated by the success of encoding multi-scale contextual information for image analysis, we propose our PointAtrousGraph (PAG) - a deep permutation-invariant hierarchical encoder-decoder for efficiently exploiting multi-scale edge features in point clouds. Our PAG is constructed by several novel modules, such as Point Atrous Convolution (PAC), Edgepreserved Pooling (EP) and Edge-preserved Unpooling (EU). Similar with atrous convolution, our PAC can effectively enlarge receptive fields of filters and thus densely learn multi-scale point features. Following the idea of non-overlapping maxpooling operations, we propose our EP to preserve critical edge features during subsampling. Correspondingly, our EU modules gradually recover spatial information for edge features. In addition, we introduce chained skip subsampling/upsampling modules that directly propagate edge features to the final stage. Particularly, our proposed auxiliary loss functions can further improve our performance. Experimental results show that our PAG outperform previous state-of-the-art methods on various 3D semantic perception applications.

ICRA Conference 2020 Conference Paper

Robust 6D Object Pose Estimation by Learning RGB-D Features

Meng Tian
Liang Pan
Marcelo H. Ang
Gim Hee Lee

Accurate 6D object pose estimation is fundamental to robotic manipulation and grasping. Previous methods follow a local optimization approach which minimizes the distance between closest point pairs to handle the rotation ambiguity of symmetric objects. In this work, we propose a novel discrete- continuous formulation for rotation regression to resolve this local-optimum problem. We uniformly sample rotation anchors in SO(3), and predict a constrained deviation from each anchor to the target, as well as uncertainty scores for selecting the best prediction. Additionally, the object location is detected by aggregating point-wise vectors pointing to the 3D center. Experiments on two benchmarks: LINEMOD and YCB-Video, show that the proposed method outperforms state-of-the-art approaches. Our code is available at https://github.com/mentian/object-posenet.

ICRA Conference 2019 Conference Paper

2D3D-Matchnet: Learning To Match Keypoints Across 2D Image And 3D Point Cloud

Mengdan Feng
Sixing Hu
Marcelo H. Ang
Gim Hee Lee

Large-scale point cloud generated from 3D sensors is more accurate than its image-based counterpart. However, it is seldom used in visual pose estimation due to the difficulty in obtaining 2D-3D image to point cloud correspondences. In this paper, we propose the 2D3D-MatchNet - an end-to-end deep network architecture to jointly learn the descriptors for 2D and 3D keypoint from image and point cloud, respectively. As a result, we are able to directly match and establish 2D-3D correspondences from the query image and 3D point cloud reference map for visual pose estimation. We create our Oxford 2D-3D Patches dataset from the Oxford Robotcar dataset with the ground truth camera poses and 2D-3D image to point cloud correspondences for training and testing the deep network. Experimental results verify the feasibility of our approach.

IROS Conference 2019 Conference Paper

Degeneracy in Self-Calibration Revisited and a Deep Learning Solution for Uncalibrated SLAM

Bingbing Zhuang
Quoc-Huy Tran
Gim Hee Lee
Loong Fah Cheong
Manmohan Chandraker

Self-calibration of camera intrinsics and radial distortion has a long history of research in the computer vision community. However, it remains rare to see real applications of such techniques to modern Simultaneous Localization And Mapping (SLAM) systems, especially in driving scenarios. In this paper, we revisit the geometric approach to this problem, and provide a theoretical proof that explicitly shows the ambiguity between radial distortion and scene depth when two-view geometry is used to self-calibrate the radial distortion. In view of such geometric degeneracy, we propose a learning approach that trains a convolutional neural network (CNN) on a large amount of synthetic data. We demonstrate the utility of our proposed method by applying it as a checkerboard-free calibration tool for SLAM, achieving comparable or superior performance to previous learning and hand-crafted methods.

ICRA Conference 2019 Conference Paper

Discrete Rotation Equivariance for Point Cloud Recognition

Jiaxin Li
Yingcai Bi
Gim Hee Lee

Despite the recent active research on processing point clouds with deep networks, few attention has been on the sensitivity of the networks to rotations. In this paper, we propose a deep learning architecture that achieves discrete SO(2)/SO(3) rotation equivariance for point cloud recognition. Specifically, the rotation of an input point cloud with elements of a rotation group is similar to shuffling the feature vectors generated by our approach. The equivariance is easily reduced to invariance by eliminating the permutation with operations such as maximum or average. Our method can be directly applied to any existing point cloud based networks, resulting in significant improvements in their performance for rotated inputs. We show state-of-the-art results in the classification tasks with various datasets under both SO(2) and SO(3) rotations. In addition, we further analyze the necessary conditions of applying our approach to PointNet [1] based networks.

ICRA Conference 2019 Conference Paper

Project AutoVision: Localization and 3D Scene Perception for an Autonomous Vehicle with a Multi-Camera System

Lionel Heng
Benjamin Choi
Zhaopeng Cui
Marcel Geppert
Sixing Hu
Benson Kuan
Peidong Liu 0001
Rang M. H. Nguyen

Project AutoVision aims to develop localization and 3D scene perception capabilities for a self-driving vehicle. Such capabilities will enable autonomous navigation in urban and rural environments, in day and night, and with cameras as the only exteroceptive sensors. The sensor suite employs many cameras for both 360-degree coverage and accurate multi-view stereo; the use of low-cost cameras keeps the cost of this sensor suite to a minimum. In addition, the project seeks to extend the operating envelope to include GNSS-less conditions which are typical for environments with tall buildings, foliage, and tunnels. Emphasis is placed on leveraging multi-view geometry and deep learning to enable the vehicle to localize and perceive in 3D space. This paper presents an overview of the project, and describes the sensor suite and current progress in the areas of calibration, localization, and perception.

IROS Conference 2017 Conference Paper

Deep learning for 2D scan matching and loop closure

Jiaxin Li
Huangying Zhan
Ben M. Chen
Ian D. Reid 0001
Gim Hee Lee

Although 2D LiDAR based Simultaneous Localization and Mapping (SLAM) is a relatively mature topic nowadays, the loop closure problem remains challenging due to the lack of distinctive features in 2D LiDAR range scans. Existing research can be roughly divided into correlation based approaches e. g. scan-to-submap matching and feature based methods e. g. bag-of-words (BoW). In this paper, we solve loop closure detection and relative pose transformation using 2D LiDAR within an end-to-end Deep Learning framework. The algorithm is verified with simulation data and on an Unmanned Aerial Vehicle (UAV) flying in indoor environment. The loop detection ConvNet alone achieves an accuracy of 98. 2% in loop closure detection. With a verification step using the scan matching ConvNet, the false positive rate drops to around 0. 001%. The proposed approach processes 6000 pairs of raw LiDAR scans per second on a Nvidia GTX1080 GPU.

IROS Conference 2016 Conference Paper

Object detection and motion planning for automated welding of tubular joints

Syeda Mariam Ahmed
Yan Zhi Tan
Gim Hee Lee
Chee-Meng Chew
Chee Khiang Pang

Automatic welding of tubular TKY joints is an important and challenging task for the marine and offshore industry. In this paper, a framework for tubular joint detection and motion planning is proposed. The pose of the real tubular joint is detected using RGB-D sensors, which is used to obtain a real-to-virtual mapping for positioning the workpiece in a virtual environment. For motion planning, a Bi-directional Transition-based Rapidly exploring Random Tree (BiTRRT) algorithm is used to generate trajectories for reaching the desired goals. The complete framework is verified with experiments, and the results show that the robot welding torch is able to transit without collision to desired goals which are close to the tubular joint.

IROS Conference 2015 Conference Paper

A minimal solution to the rolling shutter pose estimation problem

Olivier Saurer
Marc Pollefeys
Gim Hee Lee

Artefacts that are present in images taken from a moving rolling shutter camera degrade the accuracy of absolute pose estimation. To alleviate this problem, we introduce an addition linear velocity in the camera projection matrix to approximate the motion of the rolling shutter camera. In particular, we derive a minimal solution using the Gröbner Basis that solves for the absolute pose as well as the motion of a rolling shutter camera. We show that the minimal problem requires 5-point correspondences and gives up to 8 real solutions. We also show that our formulation can be extended to use more than 5-point correspondences. We use RANSAC to robustly get all the inliers. In the final step, we relax the linear velocity assumption and do a non-linear refinement on the fuli motion, i. e. linear and angular velocities, and pose of the rolling shutter camera with all the inliers. We verify the feasibility and accuracy of our algorithm with both simulated and real-world datasets.

ICRA Conference 2014 Conference Paper

Infrastructure-based calibration of a multi-camera rig

Lionel Heng
Mathias Bürki
Gim Hee Lee
Paul Timothy Furgale
Roland Siegwart
Marc Pollefeys

The online recalibration of multi-sensor systems is a fundamental problem that must be solved before complex automated systems are deployed in situations such as automated driving. In such situations, accurate knowledge of calibration parameters is critical for the safe operation of automated systems. However, most existing calibration methods for multisensor systems are computationally expensive, use installations of known fiducial patterns, and require expert supervision. We propose an alternative approach called infrastructure-based calibration that is efficient, requires no modification of the infrastructure, and is completely unsupervised. In a survey phase, a computationally expensive simultaneous localization and mapping (SLAM) method is used to build a highly accurate map of a calibration area. Once the map is built, many other vehicles are able to use it for calibration as if it were a known fiducial pattern. We demonstrate the effectiveness of this method to calibrate the extrinsic parameters of a multi-camera system. The method does not assume that the cameras have an overlapping field of view and it does not require an initial guess. As the camera rig moves through the previously mapped area, we match features between each set of synchronized camera images and the map. Subsequently, we find the camera poses and inlier 2D-3D correspondences. From the camera poses, we obtain an initial estimate of the camera extrinsics and rig poses, and optimize these extrinsics and rig poses via non-linear refinement. The calibration code is publicly available as a standalone C++ package.

ICRA Conference 2014 Conference Paper

Unsupervised learning of threshold for geometric verification in visual-based loop-closure

Gim Hee Lee
Marc Pollefeys

A potential loop-closure image pair passes the geometric verification test if the number of inliers from the computation of the geometric constraint with RANSAC exceed a pre-defined threshold. The choice of the threshold is critical to the success of identifying the correct loop-closure image pairs. However, the value for this threshold often varies for different datasets and is chosen empirically. In this paper, we propose an unsupervised method that learns the threshold for geometric verification directly from the observed inlier counts of all the potential loop-closure image pairs. We model the distributions of the inlier counts from all the potential loop-closure image pairs with a two components Log-Normal mixture model - one component represents the state of non loop-closure and the other represents the state of loop-closure, and learn the parameters with the Expectation-Maximization algorithm. The intersection of the Log-Normal mixture distributions is the optimal threshold for geometric verification, i. e. the threshold that gives the minimum false positive and negative loop-closures. Our algorithm degenerates when there are too few or no loop-closures and we propose the χ 2 test to detect this degeneracy. We verify our proposed method with several large-scale datasets collected from both the multi-camera setup and stereo camera.

IROS Conference 2013 Conference Paper

A 4-point algorithm for relative pose estimation of a calibrated camera with a known relative rotation angle

Bo Li 0018
Lionel Heng
Gim Hee Lee
Marc Pollefeys

We propose an algorithm to estimate the relative camera pose using four feature correspondences and one relative rotation angle measurement. The algorithm can be used for relative pose estimation of a rigid body equipped with a camera and a relative rotation angle sensor which can be either an odometer, an IMU or a GPS/INS system. This algorithm exploits the fact that the relative rotation angles of both the camera and relative rotation angle sensor are the same as the camera and sensor are rigidly mounted to a rigid body. Therefore, knowledge of the extrinsic calibration between the camera and sensor is not required. We carry out a quantitative comparison of our algorithm with the well-known 5-point and 1-point algorithms, and show that our algorithm exhibits the highest level of accuracy.

IROS Conference 2013 Conference Paper

Robust pose-graph loop-closures with expectation-maximization

Gim Hee Lee
Friedrich Fraundorfer
Marc Pollefeys

In this paper, we model the robust loop-closure pose-graph SLAM problem as a Bayesian network and show that it can be solved with the Classification Expectation-Maximization (EM) algorithm. In particular, we express our robust pose-graph SLAM as a Bayesian network where the robot poses and constraints are latent and observed variables. An additional set of latent variables is introduced as weights for the loop-constraints. We show that the weights can be chosen as the Cauchy function, which are iteratively computed from the errors between the predicted robot poses and observed loop-closure constraints in the Expectation step, and used to weigh the cost functions from the pose-graph loop-closure constraints in the Maximization step. As a result, outlier loop-closure constraints are assigned low weights and exert less influences in the pose-graph optimization within the EM iterations. To prevent the EM algorithm from getting stuck at local minima, we perform the EM algorithm multiple times where the loop constraints with very low weights are removed after each EM process. This is repeated until there are no more changes to the weights. We show proofs of the conceptual similarity between our EM algorithm and the M-Estimator. Specifically, we show that the weight function in our EM algorithm is equivalent to the robust residual function in the M-Estimator. We verify our proposed algorithm with experimental results from multiple simulated and real-world datasets, and comparisons with other existing works.

IROS Conference 2013 Conference Paper

Structureless pose-graph loop-closure with a multi-camera system on a self-driving car

Gim Hee Lee
Friedrich Fraundorfer
Marc Pollefeys

In this paper, we propose a method to compute the pose-graph loop-closure constraints using multiple non/minimal overlapping field-of-views cameras mounted rigidly on a self-driving car without the need to reconstruct any 3D scene points. In particular, we show that the relative pose with metric scale between two loop-closing pose-graph vertices can be directly obtained from the epipolar geometry of the multicameras system. As a result, we avoid the additional time complexities and uncertainties from the reconstruction of 3D scene points which are needed by standard monocular and stereo approaches. In addition, there is a greater flexibility in choosing a configuration for the multi-camera system to cover a wider field-of-view so as to avoid missing out any loop-closure opportunities. We show that by expressing the point correspondences between two frames as Plücker lines and enforcing the planar motion constraint on the car, we are able to use multiple cameras as one and formulate the relative pose problem for loop-closure as a minimal problem which requires 3-point correspondences that yields up to six real solutions. The RANSAC algorithm is used to determine the correct solution and for robust estimation. We verify our method with results from multiple large-scale real-world data.

IROS Conference 2012 Conference Paper

SFly: Swarm of micro flying robots

Markus W. Achtelik
Michael Achtelik
Yorick Brunet
Margarita Chli
Savvas A. Chatzichristofis
Jean-Dominique Decotignie
Klaus-Michael Doth
Friedrich Fraundorfer

The SFly project is an EU-funded project, with the goal to create a swarm of autonomous vision controlled micro aerial vehicles. The mission in mind is that a swarm of MAV's autonomously maps out an unknown environment, computes optimal surveillance positions and places the MAV's there and then locates radio beacons in this environment. The scope of the work includes contributions on multiple different levels ranging from theoretical foundations to hardware design and embedded programming. One of the contributions is the development of a new MAV, a hexacopter, equipped with enough processing power for onboard computer vision. A major contribution is the development of monocular visual SLAM that runs in real-time onboard of the MAV. The visual SLAM results are fused with IMU measurements and are used to stabilize and control the MAV. This enables autonomous flight of the MAV, without the need of a data link to a ground station. Within this scope novel analytical solutions for fusing IMU and vision measurements have been derived. In addition to the realtime local SLAM, an offline dense mapping process has been developed. For this the MAV's are equipped with a payload of a stereo camera system. The dense environment map is used to compute optimal surveillance positions for a swarm of MAV's. For this an optimiziation technique based on cognitive adaptive optimization has been developed. Finally, the MAV's have been equipped with radio transceivers and a method has been developed to locate radio beacons in the observed environment.

IROS Conference 2012 Conference Paper

Vision-based autonomous mapping and exploration using a quadrotor MAV

Friedrich Fraundorfer
Lionel Heng
Dominik Honegger
Gim Hee Lee
Lorenz Meier
Petri Tanskanen
Marc Pollefeys

In this paper, we describe our autonomous vision-based quadrotor MAV system which maps and explores unknown environments. All algorithms necessary for autonomous mapping and exploration run on-board the MAV. Using a front-looking stereo camera as the main exteroceptive sensor, our quadrotor achieves these capabilities with both the Vector Field Histogram+ (VFH+) algorithm for local navigation, and the frontier-based exploration algorithm. In addition, we implement the Bug algorithm for autonomous wall-following which could optionally be selected as the substitute exploration algorithm in sparse environments where the frontier-based exploration under-performs. We incrementally build a 3D global occupancy map on-board the MAV. The map is used by the VFH+ and frontier-based exploration in dense environments, and the Bug algorithm for wall-following in sparse environments. During the exploration phase, images from the front-looking camera are transmitted over Wi-Fi to the ground station. These images are input to a large-scale visual SLAM process running off-board on the ground station. SLAM is carried out with pose-graph optimization and loop closure detection using a vocabulary tree. We improve the robustness of the pose estimation by fusing optical flow and visual odometry. Optical flow data is provided by a customized downward-looking camera integrated with a microcontroller while visual odometry measurements are derived from the front-looking stereo camera. We verify our approaches with experimental results.

ICRA Conference 2011 Conference Paper

MAV visual SLAM with plane constraint

Gim Hee Lee
Friedrich Fraundorfer
Marc Pollefeys

Bundle adjustment (BA) which produces highly accurate results for visual Simultaneous Localization and Mapping (SLAM) could not be used for Micro-Aerial Vehicles (MAVs) with limited processing power because of its O(N 3 ) complexity. We observed that a consistent ground plane often exists for MAVs flying in both the indoor and outdoor urban environments. Therefore, in this paper, we propose a visual SLAM algorithm that make use of the plane constraint to reduce the complexity of BA. The reduction of complexity is achieved by refining only the current camera pose and most recent map points with BA that minimizes the reprojection errors and perpendicular distances between the most recent map points and the best fit plane with all the pre-existing map points. As a result, our algorithm is approximately constant time since the number of current camera pose and most recent map points remain approximately constant. In addition, the minimization of the perpendicular distances between the plane and map points would enforce consistency between the reconstructed map points and the actual ground plane.

IROS Conference 2011 Conference Paper

Real-time photo-realistic 3D mapping for micro aerial vehicles

Lionel Heng
Gim Hee Lee
Friedrich Fraundorfer
Marc Pollefeys

In this paper, we proposed a method to recognize complex human daily activities including body activities and hand gestures simultaneously in an indoor environment. Three wearable motion sensors are attached to the right thigh, the waist, and the right hand of a person, while an optical motion capture system is used to obtain his/her location information. A three-level dynamic Bayesian network is implemented to model the intra-temporal and inter-temporal constraints among the location, body activity and hand gesture. The body activity and hand gesture are estimated using a Bayesian filter and the short-time Viterbi algorithm, which reduces the storage memory and the computational complexity. We conducted experiments in a mock apartment environment and the obtained results showed the effectiveness and accuracy of our algorithms.

IROS Conference 2011 Conference Paper

RS-SLAM: RANSAC sampling for visual FastSLAM

Gim Hee Lee
Friedrich Fraundorfer
Marc Pollefeys

In this paper, we present our RS-SLAM algorithm for monocular camera where the proposal distribution is derived from the 5-point RANSAC algorithm and image feature measurement uncertainties instead of using the easily violated constant velocity model. We propose to do another RANSAC sampling within all the inliers that have the best RANSAC score to check for inlier misclassifications in the original correspondences and use all the hypotheses generated from these consensus sets in the proposal distribution. This is to mitigate data association errors (inlier misclassifications) caused by the observation that the consensus set from RANSAC that yields the highest score might not, in practice, contain all the true inliers due to noise on the feature measurements. Hypotheses which are less probable will eventually be eliminated in the particle filter resampling process. We also show in this paper that our monocular approach can be easily extended for stereo camera. Experimental results validate the potential of our approach.