Arrow Research search

Author name cluster

Hujun Bao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

AAAI Conference 2026 Conference Paper

One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

  • Yitong Dong
  • Qi Zhang
  • Minchao Jiang
  • Zhiqiang Wu
  • Qingnan Fan
  • Ying Feng
  • Huaqi Zhang
  • Hujun Bao

We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

AAAI Conference 2026 Conference Paper

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

  • Yifan Yang
  • Zhi Cen
  • Sida Peng
  • Xiangwei Chen
  • Yifu Deng
  • Xinyu Zhu
  • Fan Jia
  • Xiaowei Zhou

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that outputs facial motions in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides a lightweight diffusion head to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Experiments conducted on public datasets demonstrate that our approach outperforms recent baseline methods.

NeurIPS Conference 2025 Conference Paper

AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians

  • Xiyu Zhang
  • Chong Bao
  • Yipeng Chen
  • Hongjia Zhai
  • Yitong Dong
  • Hujun Bao
  • Zhaopeng Cui
  • Guofeng Zhang

3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

IROS Conference 2025 Conference Paper

DW-VIO: Deep Weighted Visual-Inertial Odometry

  • Guyuan Chen
  • Xiyue Guo
  • Xiaokun Pan
  • Yujun Shen
  • Guofeng Zhang 0001
  • Hujun Bao
  • Zhaopeng Cui

Visual-inertial odometry (VIO) has made significant progress in various applications. However, one of the key challenges in VIO is the efficient and robust fusion of visual and inertial measurements, particularly while mitigating the impact of sensor failures. To address this challenge, we propose a new learning-based VIO system, i. e. , DW-VIO, which is able to integrate multiple sensors and provide robust state estimations. To this end, we design a novel deep learning-based data-fusion approach that dynamically associates information from multiple sensors to predict sensor weights for optimization. Moreover, in order to improve the efficiency, we present several real-time optimization techniques including a fast patch graph constructor and an efficient GPU-accelerated multi-factor bundle adjustment layer. Experimental results show that DW-VIO outperforms most state-of-the-art (SOTA) methods on the EuRoC MAV, ETH3D-SLAM, and KITTI-360 benchmarks across various challenging sequences. Additionally, it maintains a minimum of 20 frames per second (FPS) on a single RTX 3060 GPU with high-resolution input, highlighting its efficiency.

IROS Conference 2025 Conference Paper

ETO+: Revisit the Refinement Stage in Efficient Feature Matching

  • Junjie Ni
  • Yichen Shen 0004
  • Yijin Li
  • Hongjia Zhai
  • Hujun Bao
  • Guofeng Zhang 0001

Recent feature matching approaches like ETO have focused on developing lightweight matching algorithms for real-time applications. However, their lack of cross-image feature interaction and sufficient refinement often lead to a decline in matching accuracy. To address these challenges, we propose ETO+, a novel and accurate feature matching algorithm that incorporates a lightweight yet efficient bidirectional interaction module and multi-stage refinement. Specifically, we introduce Trans-CNN, a bidirectional feature interaction module that integrates CNN- and transformer-based techniques to enhance both intra-image feature refinement and inter-image feature fusion, all while maintaining a comparable computational cost. Furthermore, by leveraging the inherent sparsity of local feature matching, we propose an efficient strategy to adaptively reallocate computational resources within the network. Additionally, we design an adaptive loss function that mitigates the impact of large matching errors, thereby improving overall robustness. Extensive experiments on widely used datasets demonstrate that our approach achieves a strong balance between accuracy and computational efficiency. It outperforms ETO by 7. 9 in AUC@5 on MegaDepth, respectively, while being about 40% faster than E-LoFTR.

AAAI Conference 2025 Conference Paper

GURecon: Learning Detailed 3D Geometric Uncertainties for Neural Surface Reconstruction

  • Zesong Yang
  • Ru Zhang
  • Jiale Shi
  • Zixiang Ai
  • Boming Zhao
  • Hujun Bao
  • Luwei Yang
  • Zhaopeng Cui

Neural surface representation has demonstrated remarkable success in the areas of novel view synthesis and 3D reconstruction. However, assessing the geometric quality of 3D reconstructions in the absence of ground truth mesh remains a significant challenge, due to its rendering-based optimization process and entangled learning of appearance and geometry with photometric losses. In this paper, we present a novel framework, i.e, GURecon, which establishes a geometric uncertainty field for the neural surface based on geometric consistency. Different from existing methods that rely on rendering-based measurement, GURecon models a continuous 3D uncertainty field for the reconstructed surface, and is learned by an online distillation approach without introducing real geometric information for supervision. Moreover, in order to mitigate the interference of illumination on geometric consistency, a decoupled field is learned and exploited to finetune the uncertainty field. Experiments on various datasets demonstrate the superiority of GURecon in modeling 3D geometric uncertainty, as well as its plug-and-play extension to various neural surface representations and improvement on downstream tasks such as incremental reconstruction.

ICLR Conference 2025 Conference Paper

ND-SDF: Learning Normal Deflection Fields for High-Fidelity Indoor Reconstruction

  • Ziyu Tang
  • Weicai Ye
  • Yifan Wang 0025
  • Di Huang
  • Hujun Bao
  • Tong He 0001
  • Guofeng Zhang 0001

Neural implicit reconstruction via volume rendering has demonstrated its effectiveness in recovering dense 3D surfaces. However, it is non-trivial to simultaneously recover meticulous geometry and preserve smoothness across regions with differing characteristics. To address this issue, previous methods typically employ geometric priors, which are often constrained by the performance of the prior models. In this paper, we propose ND-SDF, which learns a Normal Deflection field to represent the angular deviation between the scene normal and the prior normal. Unlike previous methods that uniformly apply geometric priors on all samples, introducing significant bias in accuracy, our proposed normal deflection field dynamically learns and adapts the utilization of samples based on their specific characteristics, thereby improving both the accuracy and effectiveness of the model. Our method not only obtains smooth weakly textured regions such as walls and floors but also preserves the geometric details of complex structures. In addition, we introduce a novel ray sampling strategy based on the deflection angle to facilitate the unbiased rendering process, which significantly improves the quality and accuracy of intricate surfaces, especially on thin structures. Consistent improvements on various challenging datasets demonstrate the superiority of our method.

ICRA Conference 2025 Conference Paper

Neuraloc: Visual Localization in Neural Implicit Map With Dual Complementary Features

  • Hongjia Zhai
  • Boming Zhao
  • Hai Li
  • Xiaokun Pan
  • Yijia He
  • Zhaopeng Cui
  • Hujun Bao
  • Guofeng Zhang 0001

Recently, neural radiance fields (NeRF) have gained significant attention in the field of visual localization. However, existing NeRF-based approaches either lack geometric constraints or require extensive storage for feature matching, limiting their practical applications. To address these challenges, we propose an efficient and novel visual localization approach based on the neural implicit map with complementary features. Specifically, to enforce geometric constraints and reduce storage requirements, we implicitly learn a 3D keypoint descriptor field, avoiding the need to explicitly store point-wise features. To further address the semantic ambiguity of descriptors, we introduce additional semantic contextual feature fields, which enhance the quality and reliability of 2D-3D correspondences. Besides, we propose descriptor similarity distribution alignment to minimize the domain gap between 2D and 3D feature spaces during matching. Finally, we construct the matching graph using both complementary descriptors and contextual features to establish accurate 2D3D correspondences for 6-DoF pose estimation. Compared with the recent NeRF-based approaches, our method achieves a $3 \times$ faster training speed and a $45 \times$ reduction in model storage. Extensive experiments on two widely used datasets demonstrate that our approach outperforms or is highly competitive with other state-of-the-art NeRF-based visual localization methods. Project page: https://zju3dv.github.io/neuraloc

ICLR Conference 2025 Conference Paper

Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation

  • Zhi Cen
  • Huaijin Pi
  • Sida Peng
  • Qing Shuai
  • Yujun Shen
  • Hujun Bao
  • Xiaowei Zhou 0001
  • Ruizhen Hu

This paper addresses the task of generating two-character online interactions. Previously, two main settings existed for two-character interaction generation: (1) generating one's motions based on the counterpart's complete motion sequence, and (2) jointly generating two-character motions based on specific conditions. We argue that these settings fail to model the process of real-life two-character interactions, where humans will react to their counterparts in real time and act as independent individuals. In contrast, we propose an online reaction policy, called Ready-to-React, to generate the next character pose based on past observed motions. Each character has its own reaction policy as its ``brain'', enabling them to interact like real humans in a streaming manner. Our policy is implemented by incorporating a diffusion head into an auto-regressive model, which can dynamically respond to the counterpart's motions while effectively mitigating the error accumulation throughout the generation process. We conduct comprehensive experiments using the challenging boxing task. Experimental results demonstrate that our method outperforms existing baselines and can generate extended motion sequences. Additionally, we show that our approach can be controlled by sparse signals, making it well-suited for VR and other online interactive environments. Code and data will be made publicly available.

ICRA Conference 2025 Conference Paper

Scalable Multi-Session Visual SLAM in Large-Scale Scenes with Subgraph Optimization

  • Xiaokun Pan
  • Zhenzhe Li
  • Tianxing Fan
  • Hongjia Zhai
  • Hujun Bao
  • Guofeng Zhang 0001

Multi-session visual SLAM systems enable 6-DoF camera localization along with long-term maintenance and expansion of the global map, by utilizing image data from different sessions. However, in large-scale environments, these systems often suffer from severe scale drift. While modern SLAM systems attempt to maintain global map consistency through loop detection and correction, they still face challenges in terms of convergence and accuracy. In this paper, we propose a robust large-scale multi-session SLAM system for long-term localization and mapping that achieves global consistency. Furthermore, to address the backend optimization problem in large-scale environments, we introduce a hierarchical optimization strategy based on the graph structure. More specifically, a subgraph structure is introduced to reduce the size of problem while effectively propagating scale correction information. In addition, a hierarchical strategy enables coarse-to-fine updates of the graph states. Experimental results not only demonstrate that our method efficiently optimizes the pose graph and maintains map consistency in large-scale environments, but also highlight the effectiveness and scalability of the proposed approach.

ICLR Conference 2025 Conference Paper

UniRestore3D: A Scalable Framework For General Shape Restoration

  • Yuang Wang
  • Yujian Zhang
  • Sida Peng
  • Xingyi He
  • Haoyu Guo
  • Yujun Shen
  • Hujun Bao
  • Xiaowei Zhou 0001

Shape restoration aims to recover intact 3D shapes from defective ones, such as those that are incomplete, noisy, and low-resolution. Previous works have achieved impressive results in shape restoration subtasks thanks to advanced generative models. While effective for specific shape defects, they are less applicable in real-world scenarios involving multiple defect types simultaneously. Additionally, training on limited subsets of defective shapes hinders knowledge transfer across restoration types and thus affects generalization. In this paper, we address the task of general shape restoration, which restores shapes with various types of defects through a unified model, thereby naturally improving the applicability and scalability. Our approach first standardizes the data representation across different restoration subtasks using high-resolution TSDF grids and constructs a large-scale dataset with diverse types of shape defects. Next, we design an efficient hierarchical shape generation model and a noise-robust defective shape encoder that enables effective impaired shape understanding and intact shape generation. Moreover, we propose a scalable training strategy for efficient model training. The capabilities of our proposed method are demonstrated across multiple shape restoration subtasks and validated on various datasets, including Objaverse, ShapeNet, GSO, and ABO.

NeurIPS Conference 2024 Conference Paper

A Global Depth-Range-Free Multi-View Stereo Transformer Network with Pose Embedding

  • Yitong Dong
  • Yijin Li
  • Zhaoyang Huang
  • Weikang Bian
  • Jingbo Liu
  • Hujun Bao
  • Zhaopeng Cui
  • Hongsheng Li

In this paper, we propose a novel multi-view stereo (MVS) framework that gets rid of the depth range prior. Unlike recent prior-free MVS methods that work in a pair-wise manner, our method simultaneously considers all the source images. Specifically, we introduce a Multi-view Disparity Attention (MDA) module to aggregate long-range context information within and across multi-view images. Considering the asymmetry of the epipolar disparity flow, the key to our method lies in accurately modeling multi-view geometric constraints. We integrate pose embedding to encapsulate information such as multi-view camera poses, providing implicit geometric constraints for multi-view disparity feature fusion dominated by attention. Additionally, we construct corresponding hidden states for each source image due to significant differences in the observation quality of the same pixel in the reference frame across multiple source frames. We explicitly estimate the quality of the current pixel corresponding to sampled points on the epipolar line of the source image and dynamically update hidden states through the uncertainty estimation module. Extensive results on the DTU dataset and Tanks&Temple benchmark demonstrate the effectiveness of our method.

IJCAI Conference 2024 Conference Paper

Error-aware Sampling in Adaptive Shells for Neural Surface Reconstruction

  • Qi Wang
  • Yuchi Huo
  • Qi Ye
  • Rui Wang
  • Hujun Bao

Neural implicit surfaces with signed distance functions (SDFs) achieve superior quality in 3D geometry reconstruction. However, training SDFs is time-consuming because it requires a great number of samples to calculate accurate weight distributions and a considerable amount of samples sampled from the distribution for integrating the rendering results. Some existing sampling strategies focus on this problem. During the training, they assume a spatially-consistent convergence speed of kernel size, thus still suffering from low convergence or errors. Instead, we introduce an error-aware sampling method based on thin intervals of valid weight distributions, dubbed adaptive shells, to reduce the number of samples while still maintaining the reconstruction accuracy. To this end, we first extend Laplace-based neural implicit surfaces with learned spatially-varying kernel sizes which indicates the range of valid weight distributions. Then, the adaptive shell for each ray is determined by an efficient double-clipping strategy with spatially-varying SDF values and kernel sizes, fitting larger kernel sizes to wider shells. Finally, we calculate the error-bounded cumulative distribution functions (CDFs) of shells to conduct efficient importance sampling, achieving low-variance rendering with fewer calculations. Extensive results in various scenes demonstrate the superiority of our sampling technique, including significantly reducing sample counts and training time, even improving the reconstruction quality. The code is available at https: //github. com/erernan/ESampling.

NeurIPS Conference 2024 Conference Paper

ETO:Efficient Transformer-based Local Feature Matching by Organizing Multiple Homography Hypotheses

  • Junjie Ni
  • Guofeng Zhang
  • Guanglin Li
  • Yijin Li
  • Xinyang Liu
  • Zhaoyang Huang
  • Hujun Bao

We tackle the efficiency problem of learning local feature matching. Recent advancements have given rise to purely CNN-based and transformer-based approaches, each augmented with deep learning techniques. While CNN-based methods often excel in matching speed, transformer-based methods tend to provide more accurate matches. We propose an efficient transformer-based network architecture for local feature matching. This technique is built on constructing multiple homography hypotheses to approximate the continuous correspondence in the real world and uni-directional cross-attention to accelerate the refinement. On the YFCC100M dataset, our matching accuracy is competitive with LoFTR, a state-of-the-art transformer-based architecture, while the inference speed is boosted to 4 times, even outperforming the CNN-based methods. Comprehensive evaluations on other open datasets such as Megadepth, ScanNet, and HPatches demonstrate our method's efficacy, highlighting its potential to significantly enhance a wide array of downstream applications.

ICRA Conference 2024 Conference Paper

From Satellite to Ground: Satellite Assisted Visual Localization with Cross-view Semantic Matching

  • Xiyue Guo
  • Haocheng Peng
  • Junjie Hu 0003
  • Hujun Bao
  • Guofeng Zhang 0001

One of the key challenges of visual Simultaneous Localization and Mapping (SLAM) in large-scale environments is how to effectively use global localization to correct the cumulative errors from long-term tracking. This challenge presents itself in two main aspects: first, the difficulty for robots in revisiting previous locations to perform loop closure, and second, the considerable memory resources required to maintain point-cloud-based global maps. Recent solutions have resorted into neural networks, using satellite images as the references for ground-level localization. However, most of these methods merely provide cross-view patch-matching results, which leads to unfeasible in integration with the SLAM system. To address these issues, we present a semantic-based cross-view localization method. This approach combines semantic information with a reward and penalty mechanism, enabling us to obtain a global probability map and achieve precise 3-degree-of-freedom (3-DoF) localization. Based on that, we develop a SLAM system that capitalizes on satellite imagery for global localization. This strategy effectively bridges the gap between SLAM and real-world coordinates while also substantially reducing accumulated errors. Our experimental results demonstrate that our global localization method significantly outperforms existing satellite-based systems. Moreover, in scenarios where the robot struggles to find loop closures, employing our localization method improves the SLAM accuracy.

ICRA Conference 2024 Conference Paper

Omnidirectional Dense SLAM for Back-to-back Fisheye Cameras

  • Weijian Xie
  • Guanyi Chu
  • Quanhao Qian
  • Yihao Yu
  • Shangjin Zhai
  • Danpeng Chen
  • Nan Wang 0020
  • Hujun Bao

We propose a real-time visual-inertial dense SLAM system that utilizes the online data streams from back-to-back dual fisheye cameras setup, providing 360 ◦ coverage of the environment. Firstly, we employ a sliding-window-based front-end to estimate real-time poses from the binocular fisheye images and IMU data. Then, we implement a lightweight panoramic depth completion network based on multi-basis depth representation. The network takes panoramic images (obtained by stitching dual-fisheye images with extrinsics and intrinsic parameters) and sparse depths (generated by the front-end local tracking) as input and predicts multiple depth bases along with corresponding confidence as output. The final dense depth is the linear combination of the multiple depth bases. Thanks to the multi-basis depth representation, we can continuously optimize the 360° depth with the traditional optimizer to achieve higher global consistency in depth. We conducted experiments on both simulated and real-world datasets to evaluate our method. The results demonstrate that the proposed method outperforms SoTA methods in terms of depth prediction and 3D reconstruction. In addition, we develop a demo that can run on a mobile to demonstrate the real-time capabilities of our method.

AAAI Conference 2024 Conference Paper

PNeRFLoc: Visual Localization with Point-Based Neural Radiance Fields

  • Boming Zhao
  • Luwei Yang
  • Mao Mao
  • Hujun Bao
  • Zhaopeng Cui

Due to the ability to synthesize high-quality novel views, Neural Radiance Fields (NeRF) has been recently exploited to improve visual localization in a known environment. However, the existing methods mostly utilize NeRF for data augmentation to improve the regression model training, and their performances on novel viewpoints and appearances are still limited due to the lack of geometric constraints. In this paper, we propose a novel visual localization framework, i.e., PNeRFLoc, based on a unified point-based representation. On one hand, PNeRFLoc supports the initial pose estimation by matching 2D and 3D feature points as traditional structure-based methods; on the other hand, it also enables pose refinement with novel view synthesis using rendering-based optimization. Specifically, we propose a novel feature adaption module to close the gaps between the features for visual localization and neural rendering. To improve the efficacy and efficiency of neural rendering-based optimization, we also developed an efficient rendering-based framework with a warping loss function. Extensive experiments demonstrate that PNeRFLoc performs the best on the synthetic dataset when the 3D NeRF model can be well learned, and significantly outperforms all the NeRF-boosted localization methods with on-par SOTA performance on the real-world benchmark localization datasets. Project webpage: https://zju3dv.github.io/PNeRFLoc/.

IROS Conference 2023 Conference Paper

BlinkFlow: A Dataset to Push the Limits of Event-Based Optical Flow Estimation

  • Yijin Li
  • Zhaoyang Huang
  • Shuo Chen
  • Xiaoyu Shi 0002
  • Hongsheng Li 0001
  • Hujun Bao
  • Zhaopeng Cui
  • Guofeng Zhang 0001

Event cameras provide high temporal precision, low data rates, and high dynamic range visual perception, which are well-suited for optical flow estimation. While data-driven optical flow estimation has obtained great success in RGB cameras, its generalization performance is seriously hindered in event cameras mainly due to the limited and biased training data. In this paper, we present a novel simulator, BlinkSim, for the fast generation of large-scale data for event-based optical flow. BlinkSim incorporates a configurable rendering engine alongside an event simulation suite. By leveraging the wealth of current 3D assets, the rendering engine enables us to automatically build up thousands of scenes with different objects, textures, and motion patterns and render very high-frequency images for realistic event data simulation. Based on BlinkSim, we construct a large training dataset and evaluation benchmark BlinkFlow that contains sufficient, diversiform, and challenging event data with optical flow ground truth. Experiments show that BlinkFlow improves the generalization performance of state-of-the-art methods by more than 40% on average and up to 90%. Moreover, we further propose an Event-based optical Flow transFormer (E-FlowFormer) architecture. Powered by our BlinkFlow, E-FlowFormer outperforms the SOTA methods by up to 91% on the MVSEC dataset and 14% on the DSEC dataset and presents the best generalization performance. The source code and data are available at https://zju3dv.github.io/blinkflow/.

NeurIPS Conference 2023 Conference Paper

Compact Neural Volumetric Video Representations with Dynamic Codebooks

  • Haoyu Guo
  • Sida Peng
  • Yunzhi Yan
  • Linzhan Mou
  • Yujun Shen
  • Hujun Bao
  • Xiaowei Zhou

This paper addresses the challenge of representing high-fidelity volumetric videos with low storage cost. Some recent feature grid-based methods have shown superior performance of fast learning implicit neural representations from input 2D images. However, such explicit representations easily lead to large model sizes when modeling dynamic scenes. To solve this problem, our key idea is reducing the spatial and temporal redundancy of feature grids, which intrinsically exist due to the self-similarity of scenes. To this end, we propose a novel neural representation, named dynamic codebook, which first merges similar features for the model compression and then compensates for the potential decline in rendering quality by a set of dynamic codes. Experiments on the NHR and DyNeRF datasets demonstrate that the proposed approach achieves state-of-the-art rendering quality, while being able to achieve more storage efficiency. The source code is available at https: //github. com/zju3dv/compact_vv.

NeurIPS Conference 2023 Conference Paper

CP-SLAM: Collaborative Neural Point-based SLAM System

  • Jiarui Hu
  • Mao Mao
  • Hujun Bao
  • Guofeng Zhang
  • Zhaopeng Cui

This paper presents a collaborative implicit neural simultaneous localization and mapping (SLAM) system with RGB-D image sequences, which consists of complete front-end and back-end modules including odometry, loop detection, sub-map fusion, and global refinement. In order to enable all these modules in a unified framework, we propose a novel neural point based 3D scene representation in which each point maintains a learnable neural feature for scene encoding and is associated with a certain keyframe. Moreover, a distributed-to-centralized learning strategy is proposed for the collaborative implicit SLAM to improve consistency and cooperation. A novel global optimization framework is also proposed to improve the system accuracy like traditional bundle adjustment. Experiments on various datasets demonstrate the superiority of the proposed method in both camera tracking and mapping.

ICRA Conference 2023 Conference Paper

Descriptor Distillation for Efficient Multi-Robot SLAM

  • Xiyue Guo
  • Junjie Hu 0003
  • Hujun Bao
  • Guofeng Zhang 0001

Performing accurate localization while maintaining the low-level communication bandwidth is an essential challenge of multi-robot simultaneous localization and mapping (MR-SLAM). In this paper, we tackle this problem by generating a compact yet discriminative feature descriptor with minimum inference time. We propose descriptor distillation that formulates the descriptor generation into a learning problem under the teacher-student framework. To achieve real-time descriptor generation, we design a compact student network and learn it by transferring the knowledge from a pre-trained large teacher model. To reduce the descriptor dimensions from the teacher to the student, we propose a novel loss function that enables the knowledge transfer between two different dimensional descriptors. The experimental results demonstrate that our model is 30% lighter than the state-of-the-art model and produces better descriptors in patch matching. Moreover, we build a MR-SLAM system based on the proposed method and show that our descriptor distillation can achieve higher localization performance for MR-SLAM with lower bandwidth.

ICRA Conference 2023 Conference Paper

Perceiving Unseen 3D Objects by Poking the Objects

  • Linghao Chen
  • Yunzhou Song
  • Hujun Bao
  • Xiaowei Zhou 0001

We present a novel approach to interactive 3D object perception for robots. Unlike previous perception algorithms that rely on known object models or a large amount of annotated training data, we propose a poking-based approach that automatically discovers and reconstructs 3D objects. The poking process not only enables the robot to discover unseen 3D objects but also produces multi-view observations for 3D reconstruction of the objects. The reconstructed objects are then memorized by neural networks with regular supervised learning and can be recognized in new test images. The experiments on real-world data show that our approach could unsupervisedly discover and reconstruct unseen 3D objects with high quality, and facilitate real-world applications such as robotic grasping. The code and supplementary materials are available at the project page: https://zju3dv.github.io/poking_perception/.

AAAI Conference 2022 Conference Paper

Active Boundary Loss for Semantic Segmentation

  • Chi Wang
  • Yunke Zhang
  • Miaomiao Cui
  • Peiran Ren
  • Yin Yang
  • Xuansong Xie
  • Xian-Sheng Hua
  • Hujun Bao

This paper proposes a novel active boundary loss for semantic segmentation. It can progressively encourage the alignment between predicted boundaries and ground-truth boundaries during end-to-end training, which is not explicitly enforced in commonly used cross-entropy loss. Based on the predicted boundaries detected from the segmentation results using current network parameters, we formulate the boundary alignment problem as a differentiable direction vector prediction problem to guide the movement of predicted boundaries in each iteration. Our loss is model-agnostic and can be plugged in to the training of segmentation networks to improve the boundary details. Experimental results show that training with the active boundary loss can effectively improve the boundary F-score and mean Intersection-over-Union on challenging image and video object segmentation datasets.

ICRA Conference 2022 Conference Paper

Crossview Mapping with Graph-based Geolocalization on City-Scale Street Maps

  • Zhichao Ye
  • Chong Bao
  • Xinyang Liu
  • Hujun Bao
  • Zhaopeng Cui
  • Guofeng Zhang 0001

3D environment mapping has been actively stud-ied recently with the development of autonomous driving and augmented reality. Although many image-based methods are proposed due to their convenience and flexibility compared to other complex sensors, few works focus on fixing the inherent scale ambiguity of image-based methods and registering the reconstructed structure to the real-world 3D map, which is very important for autonomous driving. This paper presents a low-cost mapping solution that is able to refine and align the monocular reconstructed point cloud given a public street map. Specifically, we first find the association between the street map and the reconstructed point cloud structure by a novel graph-based geolocalization method. Then, optimized with the corresponding relationship, the map accuracy is significantly improved. The rich environment information can also be associated with the point cloud by the geographical location. Experiments show that our geolocalization algorithm can locate the scene on a gigantic city-scale map (173. 46 km2) in two minutes and support 3D map reconstruction with absolute scale and rich environmental information from Internet videos.

NeurIPS Conference 2022 Conference Paper

Geometry-aware Two-scale PIFu Representation for Human Reconstruction

  • Zheng Dong
  • Ke Xu
  • Ziheng Duan
  • Hujun Bao
  • Weiwei Xu
  • Rynson Lau

Although PIFu-based 3D human reconstruction methods are popular, the quality of recovered details is still unsatisfactory. In a sparse (e. g. , 3 RGBD sensors) capture setting, the depth noise is typically amplified in the PIFu representation, resulting in flat facial surfaces and geometry-fallible bodies. In this paper, we propose a novel geometry-aware two-scale PIFu for 3D human reconstruction from sparse, noisy inputs. Our key idea is to exploit the complementary properties of depth denoising and 3D reconstruction, for learning a two-scale PIFu representation to reconstruct high-frequency facial details and consistent bodies separately. To this end, we first formulate depth denoising and 3D reconstruction as a multi-task learning problem. The depth denoising process enriches the local geometry information of the reconstruction features, while the reconstruction process enhances depth denoising with global topology information. We then propose to learn the two-scale PIFu representation using two MLPs based on the denoised depth and geometry-aware features. Extensive experiments demonstrate the effectiveness of our approach in reconstructing facial details and bodies of different poses and its superiority over state-of-the-art methods.

NeurIPS Conference 2022 Conference Paper

OnePose++: Keypoint-Free One-Shot Object Pose Estimation without CAD Models

  • Xingyi He
  • Jiaming Sun
  • Yuang Wang
  • Di Huang
  • Hujun Bao
  • Xiaowei Zhou

We propose a new method for object pose estimation without CAD models. The previous feature-matching-based method OnePose has shown promising results under a one-shot setting which eliminates the need for CAD models or object-specific training. However, OnePose relies on detecting repeatable image keypoints and is thus prone to failure on low-textured objects. We propose a keypoint-free pose estimation pipeline to remove the need for repeatable keypoint detection. Built upon the detector-free feature matching method LoFTR, we devise a new keypoint-free SfM method to reconstruct a semi-dense point-cloud model for the object. Given a query image for object pose estimation, a 2D-3D matching network directly establishes 2D-3D correspondences between the query image and the reconstructed point-cloud model without first detecting keypoints in the image. Experiments show that the proposed pipeline outperforms existing one-shot CAD-model-free methods by a large margin and is comparable to CAD-model-based methods on LINEMOD even for low-textured objects. We also collect a new dataset composed of 80 sequences of 40 low-textured objects to facilitate future research on one-shot object pose estimation. The supplementary material, code and dataset are available on the project page: https: //zju3dv. github. io/onepose plus plus/.

NeurIPS Conference 2022 Conference Paper

TotalSelfScan: Learning Full-body Avatars from Self-Portrait Videos of Faces, Hands, and Bodies

  • Junting Dong
  • Qi Fang
  • Yudong Guo
  • Sida Peng
  • Qing Shuai
  • Xiaowei Zhou
  • Hujun Bao

Recent advances in implicit neural representations make it possible to reconstruct a human-body model from a monocular self-rotation video. While previous works present impressive results of human body reconstruction, the quality of reconstructed face and hands are relatively low. The main reason is that the image region occupied by these parts is very small compared to the body. To solve this problem, we propose a new approach named TotalSelfScan, which reconstructs the full-body model from several monocular self-rotation videos that focus on the face, hands, and body, respectively. Compared to recording a single video, this setting has almost no additional cost but provides more details of essential parts. To learn the full-body model, instead of encoding the whole body in a single network, we propose a multi-part representation to model separate parts and then fuse the part-specific observations into a single unified human model. Once learned, the full-body model enables rendering photorealistic free-viewpoint videos under novel human poses. Experiments show that TotalSelfScan can significantly improve the reconstruction and rendering quality on the face and hands compared to the existing methods. The code is available at \url{https: //zju3dv. github. io/TotalSelfScan}.

ICRA Conference 2022 Conference Paper

VIP-SLAM: An Efficient Tightly-Coupled RGB-D Visual Inertial Planar SLAM

  • Danpeng Chen
  • Shuai Wang
  • Weijian Xie
  • Shangjin Zhai
  • Nan Wang 0020
  • Hujun Bao
  • Guofeng Zhang 0001

In this paper, we propose a tightly-coupled SLAM system fused with RGB, Depth, IMU and structured plane information. Traditional sparse points based SLAM systems always maintain a mass of map points to model the environment. Huge number of map points bring us a high computational complexity, making it difficult to be deployed on mobile devices. On the other hand, planes are common structures in man-made environment especially in indoor environments. We usually can use a small number of planes to represent a large scene. So the main purpose of this article is to decrease the high complexity of sparse points based SLAM. We build a lightweight back-end map which consists of a few planes and map points to achieve efficient bundle adjustment (BA) with an equal or better accuracy. We use homography constraints to eliminate the parameters of numerous plane points in the optimization and reduce the complexity of BA. We separate the parameters and measurements in homography and point-to-plane constraints and compress the measurements part to further effectively im-prove the speed of BA. We also integrate the plane information into the whole system to realize robust planar feature extraction, data association, and global consistent planar reconstruction. Finally, we perform an ablation study and compare our method with similar methods in simulation and real environment data. Our system achieves obvious advantages in accuracy and efficiency. Even if the plane parameters are involved in the optimization, we effectively simplify the back-end map by using planar structures. The global bundle adjustment is nearly 2 times faster than the sparse points based SLAM algorithm.

IROS Conference 2021 Conference Paper

Coxgraph: Multi-Robot Collaborative, Globally Consistent, Online Dense Reconstruction System

  • Xiangyu Liu
  • Weicai Ye
  • Chaoran Tian
  • Zhaopeng Cui
  • Hujun Bao
  • Guofeng Zhang 0001

Real-time dense reconstruction has been extensively studied for its wide applications in computer vision and robotics, meanwhile much effort has been made for the multi-robot system which plays an irreplaceable role in complicated but time-critical scenarios, e. g. , search and rescue tasks. In this paper, we propose an efficient system named Coxgraph for multi-robot collaborative dense reconstruction in real-time. In our system, each client performs volumetric mapping in a producer-consumer manner. To facilitate transmission, we propose a compact 3D representation which transforms the SDF submap to mesh packs. During the recovery of submaps from mesh packs, the system can perform loop closure outlier rejection based on geometry consistency, trajectory collision and fitness check. Then we develop a robust map fusion method through joint optimization of trajectories and submaps. Extensive experiments demonstrate that our system can produce a globally consistent dense map in real-time with less transmission load, which is available as open-source software 1.

ICRA Conference 2020 Conference Paper

Efficient Covisibility-based Image Matching for Large-Scale SfM

  • Zhichao Ye
  • Guofeng Zhang 0001
  • Hujun Bao

Obtaining accurate and sufficient feature matches is crucial for robust large-scale Structure-from-Motion. For unordered image collections, a traditional feature matching method with geometric verification requires a huge cost to find sufficient feature matches. Although several methods have been proposed to speed up this stage, none of them makes full use of existing matches. In this paper, we propose a novel efficient image matching method by using the transitivity of region covisibility. The overlapping image pairs can be efficiently found in an iterative matching strategy even only with few inlier feauture matches. The experimental results on unordered image datasets demonstrate that the proposed method is three times faster than the state-of-the-art and the matching result is high-quality enough for robust SfM.

NeurIPS Conference 2019 Conference Paper

GIFT: Learning Transformation-Invariant Dense Visual Descriptors via Group CNNs

  • Yuan Liu
  • Zehong Shen
  • Zhixuan Lin
  • Sida Peng
  • Hujun Bao
  • Xiaowei Zhou

Finding local correspondences between images with different viewpoints requires local descriptors that are robust against geometric transformations. An approach for transformation invariance is to integrate out the transformations by pooling the features extracted from transformed versions of an image. However, the feature pooling may sacrifice the distinctiveness of the resulting descriptors. In this paper, we introduce a novel visual descriptor named Group Invariant Feature Transform (GIFT), which is both discriminative and robust to geometric transformations. The key idea is that the features extracted from the transformed versions of an image can be viewed as a function defined on the group of the transformations. Instead of feature pooling, we use group convolutions to exploit underlying structures of the extracted features on the group, resulting in descriptors that are both discriminative and provably invariant to the group of transformations. Extensive experiments show that GIFT outperforms state-of-the-art methods on several benchmark datasets and practically improves the performance of relative pose estimation.

IROS Conference 2019 Conference Paper

Rapid and Robust Monocular Visual-Inertial Initialization with Gravity Estimation via Vertical Edges

  • Jinyu Li 0002
  • Hujun Bao
  • Guofeng Zhang 0001

Monocular visual-inertial tracking without good initialization easily fails due to its non-linear nature. Rapid and accurate metric initialization is crucial. In this paper, we propose a novel monocular visual-inertial initialization method which can initialize the IMU states, camera poses, and scale in a rapid and robust way. To avoid mixing gravity and accelerometer bias, we propose to use the detected vertical edges to estimate a better gravity. This improves the observability to the underlying problem even without sufficient movement, so we can solve all the states crucial for a good initialization. We evaluate our approach on EuRoC dataset and compare with existing state-of-the-art methods. The experimental results demonstrate the effectiveness of the proposed method.

ICRA Conference 2017 Conference Paper

Robust stereo matching with surface normal prediction

  • Shuangli Zhang
  • Weijian Xie
  • Guofeng Zhang 0001
  • Hujun Bao
  • Michael Kaess

Traditional stereo matching approaches generally have problems in handling textureless regions, strong occlusions and reflective regions that do not satisfy a Lambertian surface assumption. In this paper, we propose to combine the predicted surface normal by deep learning to overcome these inherent difficulties in stereo matching. With the selected reliable disparities from stereo matching method and effective edge fusion strategy, we can faithfully convert the predicted surface normal map to a disparity map by solving a least squares system which maintains discontinuity on object boundaries and continuity on other regions. Then we refine the disparity map iteratively by bilateral filtering-based completion and edge feature refinement. Experimental results on the Middlebury dataset and our own captured stereo sequences demonstrate the effectiveness of the proposed approach.

IJCAI Conference 2009 Conference Paper

  • Deng Cai
  • Xiaofei He
  • Xuanhui Wang
  • Hujun Bao
  • Jiawei Han

Matrix factorization techniques have been frequently applied in information processing tasks. Among them, Non-negative Matrix Factorization (NMF) have received considerable attentions due to its psychological and physiological interpretation of naturally occurring data whose representation may be parts-based in human brain. On the other hand, from geometric perspective the data is usually sampled from a low dimensional manifold embedded in high dimensional ambient space. One hopes then to find a compact representation which uncovers the hidden topics and simultaneously respects the intrinsic geometric structure. In this paper, we propose a novel algorithm, called Locality Preserving Non-negative Matrix Factorization (LPNMF), for this purpose. For two data points, we use KL-divergence to evaluate their similarity on the hidden topics. The optimal maps are obtained such that the feature values on hidden topics are restricted to be non-negative and vary smoothly along the geodesics of the data manifold. Our empirical study shows the encouraging results of the proposed algorithm in comparisons to the state-ofthe-art algorithms on two large high-dimensional databases.

IJCAI Conference 2009 Conference Paper

  • Xiaofei He
  • Ming Ji
  • Hujun Bao

Recently graph based dimensionality reduction has received a lot of interests in many fields of information processing. Central to it is a graph structure which models the geometrical and discriminant structure of the data manifold. When label information is available, it is usually incorporated into the graph structure by modifying the weights between data points. In this paper, we propose a novel dimensionality reduction algorithm, called Constrained Graph Embedding, which considers the label information as additional constraints. Specifically, we constrain the space of the solutions that we explore only to contain embedding results that are consistent with the labels. Experimental results on two real life data sets illustrate the effectiveness of our proposed method.

IJCAI Conference 2007 Conference Paper

  • Deng Cai
  • Xiaofei He
  • Kun Zhou
  • Jiawei Han
  • Hujun Bao

Linear Discriminant Analysis (LDA) is a popular data-analytic tool for studying the class relationship between data points. A major disadvantage of LDA is that it fails to discover the local geometrical structure of the data manifold. In this paper, we introduce a novel linear algorithm for discriminant analysis, called {\bf Locality Sensitive Discriminant Analysis} (LSDA). When there is no sufficient training samples, local structure is generally more important than global structure for discriminant analysis. By discovering the local manifold structure, LSDA finds a projection which maximizes the margin between data points from different classes at each local area. Specifically, the data points are mapped into a subspace in which the nearby points with the same label are close to each other while the nearby points with different labels are far apart. Experiments carried out on several standard face databases show a clear improvement over the results of LDA-based recognition.