Arrow Research search

Author name cluster

Yu-Shen Liu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

AAAI Conference 2026 Conference Paper

VGGS: VGGT-guided Gaussian Splatting for Efficient and Faithful Sparse-View Surface Reconstruction

  • Peng Xiang
  • Liang Han
  • Hui Zhang
  • Yu-Shen Liu
  • Zhizhong Han

Reconstructing a faithful geometric surface from sparse images remains a fundamental challenge in 3D computer vision. While recent methods have achieved remarkable progress, they still struggle to recover reliable geometry due to the lack of multi-view geometric cues, particularly in non-overlapping regions. To address this issue, we introduce VGGS, a Gaussian Splatting (GS) method that exploits multi-view geometric priors from VGGT for efficient and high-fidelity sparse-view surface reconstruction. Our primary contribution is an anchor-calibrated depth estimation scheme, which yields accurate depth maps. The insight is to align the VGGT depth prior to the underlying surface with a sparse set of multi-view consistent anchors, then infer depth for unreliable regions by relative depth estimation. Furthermore, to mitigate misalignment in complex scenes, we propose a relative depth consistency loss that penalizes the rendered depth if its relative depth relationship in local regions is inconsistent to the multi-view prior. Extensive experiments on widely-used benchmarks show that VGGS surpasses state-of-the-art methods in both accuracy and efficiency, delivering 4–7× faster optimization while reducing memory consumption compared to previous GS-based approaches.

AAAI Conference 2025 Conference Paper

FatesGS: Fast and Accurate Sparse-View Surface Reconstruction Using Gaussian Splatting with Depth-Feature Consistency

  • Han Huang
  • Yulun Wu
  • Chao Deng
  • Ge Gao
  • Ming Gu
  • Yu-Shen Liu

Recently, Gaussian Splatting has sparked a new trend in the field of computer vision. Apart from novel view synthesis, it has also been extended to the area of multi-view reconstruction. The latest methods facilitate complete, detailed surface reconstruction while ensuring fast training speed. However, these methods still require dense input views, and their output quality significantly degrades with sparse views. We observed that the Gaussian primitives tend to overfit the few training views, leading to noisy floaters and incomplete reconstruction surfaces. In this paper, we present an innovative sparse-view reconstruction framework that leverages intra-view depth and multi-view feature consistency to achieve remarkably accurate surface reconstruction. Specifically, we utilize monocular depth ranking information to supervise the consistency of depth distribution within patches and employ a smoothness loss to enhance the continuity of the distribution. To achieve finer surface reconstruction, we optimize the absolute position of depth through multi-view projection features. Extensive experiments on DTU and BlendedMVS demonstrate that our method outperforms state-of-the-art methods with a speedup of 60x to 200x, achieving swift and fine-grained mesh reconstruction without the need for costly pre-training.

NeurIPS Conference 2025 Conference Paper

MaterialRefGS: Reflective Gaussian Splatting with Multi-view Consistent Material Inference

  • Wenyuan Zhang
  • Jimin Tang
  • Weiqi Zhang
  • Yi Fang
  • Yu-Shen Liu
  • Zhizhong Han

Modeling reflections from 2D images is essential for photorealistic rendering and novel view synthesis. Recent approaches enhance Gaussian primitives with reflection-related material attributes to enable physically based rendering (PBR) with Gaussian Splatting. However, the material inference often lacks sufficient constraints, especially under limited environment modeling, resulting in illumination aliasing and reduced generalization. In this work, we revisit the problem from a multi-view perspective and show that multi-view consistent material inference with more physically-based environment modeling is key to learning accurate reflections with Gaussian Splatting. To this end, we enforce 2D Gaussians to produce multi-view consistent material maps during deferred shading. We also track photometric variations across views to identify highly reflective regions, which serve as strong priors for reflection strength terms. To handle indirect illumination caused by inter-object occlusions, we further introduce an environment modeling strategy through ray tracing with 2DGS, enabling photorealistic rendering of indirect radiance. Experiments on widely used benchmarks show that our method faithfully recovers both illumination and geometry, achieving state-of-the-art rendering quality in novel views synthesis. Project Page: https: //wen-yuan-zhang. github. io/MaterialRefGS.

AAAI Conference 2025 Conference Paper

Sharpening Neural Implicit Functions with Frequency Consolidation Priors

  • Chao Chen
  • Yu-Shen Liu
  • Zhizhong Han

Signed Distance Functions (SDFs) are vital implicit representations to represent high fidelity 3D surfaces. Current methods mainly leverage a neural network to learn an SDF from various supervisions including signed distances, 3D point clouds, or multi-view images. However, due to various reasons including the bias of neural network on low frequency content, 3D unaware sampling, sparsity in point clouds, or low resolutions of images, neural implicit representations still struggle to represent geometries with high frequency components like sharp structures, especially for the ones learned from images or point clouds. To overcome this challenge, we introduce a method to sharpen a low frequency SDF observation by recovering its high frequency components, pursuing a sharper and more complete surface. Our key idea is to learn a mapping from a low frequency observation to a full frequency coverage in a data-driven manner, leading to a prior knowledge of shape consolidation in the frequency domain, dubbed frequency consolidation priors. To better generalize a learned prior to unseen shapes, we introduce to represent frequency components as embeddings and disentangle the embedding of the low frequency component from the embedding of the full frequency component. This disentanglement allows the prior to generalize on an unseen low frequency observation by simply recovering its full frequency embedding through a test-time self-reconstruction. Our evaluations under widely used benchmarks or real scenes show that our method can recover high frequency component and produce more accurate surfaces than the latest methods.

IROS Conference 2025 Conference Paper

Socially-Aware Robot Navigation Enhanced by Bidirectional Natural Language Conversations Using Large Language Models

  • Congcong Wen
  • Yifan Liu
  • Geeta Chandra Raju Bethala
  • Shuaihang Yuan
  • Hao Huang 0003
  • Yu Hao
  • Mengyu Wang 0001
  • Yu-Shen Liu

Robotic navigation plays a pivotal role in a wide range of real-world applications. While traditional navigation systems focus on efficiency and obstacle avoidance, their inability to model complex human behaviors in shared spaces has underscored the growing need for socially aware navigation. In this work, we explore a novel paradigm of socially aware robot navigation empowered by large language models (LLMs), and propose HSAC-LLM, a hybrid framework that seamlessly integrates deep reinforcement learning with the reasoning and communication capabilities of LLMs. Unlike prior approaches that passively predict pedestrian trajectories or issue pre-scripted alerts, HSAC-LLM enables bidirectional natural language interaction, allowing robots to proactively engage in dialogue with pedestrians to resolve potential conflicts and negotiate path decisions. Extensive evaluations across 2D simulations, Gazebo environments, and real-world deployments demonstrate that HSAC-LLM consistently outperforms state-of-the-art DRL baselines under our proposed socially aware navigation metric, which covers safety, efficiency, and human comfort. By bridging linguistic reasoning and interactive motion planning, our results highlight the potential of LLM-augmented agents for robust, adaptive, and human-aligned navigation in real-world settings. Project page: https://hsacllm.github.io/.

AAAI Conference 2025 Conference Paper

Sparis: Neural Implicit Surface Reconstruction of Indoor Scenes from Sparse Views

  • Yulun Wu
  • Han Huang
  • Wenyuan Zhang
  • Chao Deng
  • Ge Gao
  • Ming Gu
  • Yu-Shen Liu

In recent years, reconstructing indoor scene geometry from multi-view images has achieved encouraging accomplishments. Current methods incorporate monocular priors into neural implicit surface models to achieve high-quality reconstructions. However, these methods require hundreds of images for scene reconstruction. When only a limited number of views are available as input, the performance of monocular priors deteriorates due to scale ambiguity, leading to the collapse of the reconstructed scene geometry. In this paper, we propose a new method, named Sparis, for indoor surface reconstruction from sparse views. Specifically, we investigate the impact of monocular priors on sparse scene reconstruction, introducing a novel prior based on inter-image matching information. Our prior offers more accurate depth information while ensuring cross-view matching consistency. Additionally, we employ an angular filter strategy and an epipolar matching weight function, aiming to reduce errors due to view matching inaccuracies, thereby refining the inter-image prior for improved reconstruction accuracy. The experiments conducted on widely used benchmarks demonstrate superior performance in sparse-view scene reconstruction.

NeurIPS Conference 2025 Conference Paper

U-CAN: Unsupervised Point Cloud Denoising with Consistency-Aware Noise2Noise Matching

  • Junsheng Zhou
  • XingYu Shi
  • Haichuan Song
  • Yi Fang
  • Yu-Shen Liu
  • Zhizhong Han

Point clouds captured by scanning sensors are often perturbed by noise, which have a highly negative impact on downstream tasks (e. g. surface reconstruction and shape understanding). Previous works mostly focus on training neural networks with noisy-clean point cloud pairs for learning denoising priors, which requires extensively manual efforts. In this work, we introduce U-CAN, an Unsupervised framework for point cloud denoising with Consistency-Aware Noise2Noise matching. Specifically, we leverage a neural network to infer a multi-step denoising path for each point of a shape or scene with a noise to noise matching schema. We achieve this by a novel loss which enables statistical reasoning on multiple noisy point cloud observations. We further introduce a novel constraint on the denoised geometry consistency for learning consistency-aware denoising patterns. We justify that the proposed constraint is a general term which is not limited to 3D domain and can also contribute to the area of 2D image denoising. Our evaluations under the widely used benchmarks in point cloud denoising, upsampling and image denoising show significant improvement over the state-of-the-art unsupervised methods, where U-CAN also produces comparable results with the supervised methods. Project page: https: //gloriasze. github. io/U-CAN/.

NeurIPS Conference 2025 Conference Paper

VA-GS: Enhancing the Geometric Representation of Gaussian Splatting via View Alignment

  • Qing Li
  • Huifang Feng
  • Xun Gong
  • Yu-Shen Liu

3D Gaussian Splatting has recently emerged as an efficient solution for high-quality and real-time novel view synthesis. However, its capability for accurate surface reconstruction remains underexplored. Due to the discrete and unstructured nature of Gaussians, supervision based solely on image rendering loss often leads to inaccurate geometry and inconsistent multi-view alignment. In this work, we propose a novel method that enhances the geometric representation of 3D Gaussians through view alignment (VA). Specifically, we incorporate edge-aware image cues into the rendering loss to improve surface boundary delineation. To enforce geometric consistency across views, we introduce a visibility-aware photometric alignment loss that models occlusions and encourages accurate spatial relationships among Gaussians. To further mitigate ambiguities caused by lighting variations, we incorporate normal-based constraints to refine the spatial orientation of Gaussians and improve local surface estimation. Additionally, we leverage deep image feature embeddings to enforce cross-view consistency, enhancing the robustness of the learned geometry under varying viewpoints and illumination. Extensive experiments on standard benchmarks demonstrate that our method achieves state-of-the-art performance in both surface reconstruction and novel view synthesis. The source code is available at https: //github. com/LeoQLi/VA-GS.

ICRA Conference 2024 Conference Paper

3D-OAE: Occlusion Auto-Encoders for Self-Supervised Learning on Point Clouds

  • Junsheng Zhou
  • Xin Wen 0003
  • Baorui Ma
  • Yu-Shen Liu
  • Yue Gao 0002
  • Yi Fang 0006
  • Zhizhong Han

The manual annotation for large-scale point clouds is still tedious and unavailable for many harsh real-world tasks. Self-supervised learning, which is used on raw and unlabeled data to pre-train deep neural networks, is a promising approach to address this issue. Existing works usually take the common aid from auto-encoders to establish the self-supervision by the self-reconstruction schema. However, the previous auto-encoders merely focus on the global shapes and do not distinguish the local and global geometric features apart. To address this problem, we present a novel and efficient self-supervised point cloud representation learning framework, named 3D Occlusion Auto-Encoder (3D-OAE), to facilitate the detailed supervision inherited in local regions and global shapes. We propose to randomly occlude some local patches of point clouds and establish the supervision via inpainting the occluded patches using the remaining ones. Specifically, we design an asymmetrical encoder-decoder architecture based on standard Transformer, where the encoder operates only on the visible subset of patches to learn local patterns, and a lightweight decoder is designed to leverage these visible patterns to infer the missing geometries via self-attention. We find that occluding a very high proportion of the input point cloud (e. g. 75%) will still yield a nontrivial self-supervisory performance, which enables us to achieve 3-4 times faster during training but also improve accuracy. Experimental results show that our approach outperforms the state-of-the-art on a diverse range of down-stream discriminative and generative tasks. Code is available at https://github.com/junshengzhou/3D-OAE.

NeurIPS Conference 2024 Conference Paper

Binocular-Guided 3D Gaussian Splatting with View Consistency for Sparse View Synthesis

  • Liang Han
  • Junsheng Zhou
  • Yu-Shen Liu
  • Zhizhong Han

Novel view synthesis from sparse inputs is a vital yet challenging task in 3D computer vision. Previous methods explore 3D Gaussian Splatting with neural priors (e. g. depth priors) as an additional supervision, demonstrating promising quality and efficiency compared to the NeRF based methods. However, the neural priors from 2D pretrained models are often noisy and blurry, which struggle to precisely guide the learning of radiance fields. In this paper, We propose a novel method for synthesizing novel views from sparse views with Gaussian Splatting that does not require external prior as supervision. Our key idea lies in exploring the self-supervisions inherent in the binocular stereo consistency between each pair of binocular images constructed with disparity-guided image warping. To this end, we additionally introduce a Gaussian opacity constraint which regularizes the Gaussian locations and avoids Gaussian redundancy forimproving the robustness and efficiency of inferring 3D Gaussians from sparse views. Extensive experiments on the LLFF, DTU, and Blender datasets demonstrate that our method significantly outperforms the state-of-the-art methods.

NeurIPS Conference 2024 Conference Paper

DiffGS: Functional Gaussian Splatting Diffusion

  • Junsheng Zhou
  • Weiqi Zhang
  • Yu-Shen Liu

3D Gaussian Splatting (3DGS) has shown convincing performance in rendering speed and fidelity, yet the generation of Gaussian Splatting remains a challenge due to its discreteness and unstructured nature. In this work, we propose DiffGS, a general Gaussian generator based on latent diffusion models. DiffGS is a powerful and efficient 3D generative model which is capable of generating Gaussian primitives at arbitrary numbers for high-fidelity rendering with rasterization. The key insight is to represent Gaussian Splatting in a disentangled manner via three novel functions to model Gaussian probabilities, colors and transforms. Through the novel disentanglement of 3DGS, we represent the discrete and unstructured 3DGS with continuous Gaussian Splatting functions, where we then train a latent diffusion model with the target of generating these Gaussian Splatting functions both unconditionally and conditionally. Meanwhile, we introduce a discretization algorithm to extract Gaussians at arbitrary numbers from the generated functions via octree-guided sampling and optimization. We explore DiffGS for various tasks, including unconditional generation, conditional generation from text, image, and partial 3DGS, as well as Point-to-Gaussian generation. We believe that DiffGS provides a new direction for flexibly modeling and generating Gaussian Splatting. Project page: https: //junshengzhou. github. io/DiffGS.

IROS Conference 2024 Conference Paper

EMBOSR: Embodied Spatial Reasoning for Enhanced Situated Question Answering in 3D Scenes

  • Yu Hao
  • Fan Yang
  • Nicholas Fang
  • Yu-Shen Liu

3D Embodied Spatial Reasoning, emphasizing an agent’s interaction with its surroundings for spatial information inference, is adeptly facilitated by the process of Situated Question Answering in 3D Scenes (SQA3D). SQA3D requires an agent to comprehend its position and orientation within a 3D scene based on a textual situation and then utilize this understanding to answer questions about the surrounding environment in that context. Previous methods in this field face substantial challenges, including a dependency on constant retraining on limited datasets, which leads to poor performance in unseen scenarios, limited expandability, and inadequate generalization. To address these challenges, we present a new embodied spatial reasoning paradigm for enhanced SQA3D, fusing the capabilities of foundation models with the chain of thought methodology. This approach is designed to elevate adaptability and scalability in a wide array of 3D environments. A new aspect of our model is the integration of a chain of thought reasoning process, which significantly augments the model’s capability for spatial reasoning and complex query handling in diverse 3D environments. In our structured experiments, we compare our approach against other methods with varying architectures, demonstrating its efficacy in multiple tasks including SQA3D and 3D captioning. We also assess the informativeness contained in the generated answers for complex queries. Ablation studies further delineate the individual contributions of our method to its overall performance. The results consistently affirm our proposed method’s effectiveness and efficiency.

AAAI Conference 2024 Conference Paper

GridFormer: Point-Grid Transformer for Surface Reconstruction

  • Shengtao Li
  • Ge Gao
  • Yudong Liu
  • Yu-Shen Liu
  • Ming Gu

Implicit neural networks have emerged as a crucial technology in 3D surface reconstruction. To reconstruct continuous surfaces from discrete point clouds, encoding the input points into regular grid features (plane or volume) has been commonly employed in existing approaches. However, these methods typically use the grid as an index for uniformly scattering point features. Compared with the irregular point features, the regular grid features may sacrifice some reconstruction details but improve efficiency. To take full advantage of these two types of features, we introduce a novel and high-efficiency attention mechanism between the grid and point features named Point-Grid Transformer (GridFormer). This mechanism treats the grid as a transfer point connecting the space and point cloud. Our method maximizes the spatial expressiveness of grid features and maintains computational efficiency. Furthermore, optimizing predictions over the entire space could potentially result in blurred boundaries. To address this issue, we further propose a boundary optimization strategy incorporating margin binary cross-entropy loss and boundary sampling. This approach enables us to achieve a more precise representation of the object structure. Our experiments validate that our method is effective and outperforms the state-of-the-art approaches under widely used benchmarks by producing more precise geometry reconstructions. The code is available at https://github.com/list17/GridFormer.

NeurIPS Conference 2024 Conference Paper

Inferring Neural Signed Distance Functions by Overfitting on Single Noisy Point Clouds through Finetuning Data-Driven based Priors

  • Chao Chen
  • Yu-Shen Liu
  • Zhizhong Han

It is important to estimate an accurate signed distance function (SDF) from a point cloud in many computer vision applications. The latest methods learn neural SDFs using either a data-driven based or an overfitting-based strategy. However, these two kinds of methods are with either poor generalization or slow convergence, which limits their capability under challenging scenarios like highly noisy point clouds. To resolve this issue, we propose a method to prompt pros of both data-driven based and overfitting-based methods for better generalization, faster inference, and higher accuracy in learning neural SDFs. We introduce a novel statistical reasoning algorithm in local regions which is able to finetune data-driven based priors without signed distance supervision, clean point cloud, or point normals. This helps our method start with a good initialization, and converge to a minimum in a much faster way. Our numerical and visual comparisons with the stat-of-the-art methods show our superiority over these methods in surface reconstruction and point cloud denoising on widely used shape and scene benchmarks. The code is available at https: //github. com/chenchao15/LocalN2NM.

AAAI Conference 2024 Conference Paper

Learning Continuous Implicit Field with Local Distance Indicator for Arbitrary-Scale Point Cloud Upsampling

  • Shujuan Li
  • Junsheng Zhou
  • Baorui Ma
  • Yu-Shen Liu
  • Zhizhong Han

Point cloud upsampling aims to generate dense and uniformly distributed point sets from a sparse point cloud, which plays a critical role in 3D computer vision. Previous methods typically split a sparse point cloud into several local patches, upsample patch points, and merge all upsampled patches. However, these methods often produce holes, outliers or non-uniformity due to the splitting and merging process which does not maintain consistency among local patches.To address these issues, we propose a novel approach that learns an unsigned distance field guided by local priors for point cloud upsampling. Specifically, we train a local distance indicator (LDI) that predicts the unsigned distance from a query point to a local implicit surface. Utilizing the learned LDI, we learn an unsigned distance field to represent the sparse point cloud with patch consistency. At inference time, we randomly sample queries around the sparse point cloud, and project these query points onto the zero-level set of the learned implicit field to generate a dense point cloud. We justify that the implicit field is naturally continuous, which inherently enables the application of arbitrary-scale upsampling without necessarily retraining for various scales. We conduct comprehensive experiments on both synthetic data and real scans, and report state-of-the-art results under widely used benchmarks. Project page: https://lisj575.github.io/APU-LDI

NeurIPS Conference 2024 Conference Paper

MultiPull: Detailing Signed Distance Functions by Pulling Multi-Level Queries at Multi-Step

  • Takeshi Noda
  • Chao Chen
  • Weiqi Zhang
  • Xinhai Liu
  • Yu-Shen Liu
  • Zhizhong Han

Reconstructing a continuous surface from a raw 3D point cloud is a challenging task. Latest methods employ supervised learning or pretrained priors to learn a signed distance function (SDF). However, neural networks tend to smooth local details due to the lack of ground truth signed distnaces or normals, which limits the performance of learning-based methods in reconstruction tasks. To resolve this issue, we propose a novel method, named MultiPull, to learn multi-scale implicit fields from raw point clouds to optimize accurate SDFs from coarse to fine. We achieve this by mapping 3D query points into a set of frequency features, which makes it possible to leverage multi-level features during optimization. Meanwhile, we introduce optimization constraints from the perspective of spatial distance and normal consistency, which play a key role in point cloud reconstruction based on multi-scale optimization strategies. Our experiments on widely used object and scene benchmarks demonstrate that our method outperforms the state-of-the-art methods in surface reconstruction.

NeurIPS Conference 2024 Conference Paper

Neural Signed Distance Function Inference through Splatting 3D Gaussians Pulled on Zero-Level Set

  • Wenyuan Zhang
  • Yu-Shen Liu
  • Zhizhong Han

It is vital to infer a signed distance function (SDF) for multi-view based surface reconstruction. 3D Gaussian splatting (3DGS) provides a novel perspective for volume rendering, and shows advantages in rendering efficiency and quality. Although 3DGS provides a promising neural rendering option, it is still hard to infer SDFs for surface reconstruction with 3DGS due to the discreteness, the sparseness, and the off-surface drift of 3D Gaussians. To resolve these issues, we propose a method that seamlessly merge 3DGS with the learning of neural SDFs. Our key idea is to more effectively constrain the SDF inference with the multi-view consistency. To this end, we dynamically align 3D Gaussians on the zero-level set of the neural SDF, and then render the aligned 3D Gaussians through the differentiable rasterization. Meanwhile, we update the neural SDF by pulling neighboring space to the pulled 3D Gaussians, which progressively refine the signed distance field near the surface. With both differentiable pulling and splatting, we jointly optimize 3D Gaussians and the neural SDF with both RGB and geometry constraints, which recovers more accurate, smooth, and complete surfaces with more geometry details. Our numerical and visual comparisons show our superiority over the state-of-the-art results on the widely used benchmarks.

AAAI Conference 2024 Conference Paper

NeuSurf: On-Surface Priors for Neural Surface Reconstruction from Sparse Input Views

  • Han Huang
  • Yulun Wu
  • Junsheng Zhou
  • Ge Gao
  • Ming Gu
  • Yu-Shen Liu

Recently, neural implicit functions have demonstrated remarkable results in the field of multi-view reconstruction. However, most existing methods are tailored for dense views and exhibit unsatisfactory performance when dealing with sparse views. Several latest methods have been proposed for generalizing implicit reconstruction to address the sparse view reconstruction task, but they still suffer from high training costs and are merely valid under carefully selected perspectives. In this paper, we propose a novel sparse view reconstruction framework that leverages on-surface priors to achieve highly faithful surface reconstruction. Specifically, we design several constraints on global geometry alignment and local geometry refinement for jointly optimizing coarse shapes and fine details. To achieve this, we train a neural network to learn a global implicit field from the on-surface points obtained from SfM and then leverage it as a coarse geometric constraint. To exploit local geometric consistency, we project on-surface points onto seen and unseen views, treating the consistent loss of projected features as a fine geometric constraint. The experimental results with DTU and BlendedMVS datasets in two prevalent sparse settings demonstrate significant improvements over the state-of-the-art methods.

ICLR Conference 2024 Conference Paper

Uni3D: Exploring Unified 3D Representation at Scale

  • Junsheng Zhou
  • Jinsheng Wang
  • Baorui Ma
  • Yu-Shen Liu
  • Tiejun Huang 0003
  • Xinlong Wang

Scaling up representations for images or text has been extensively investigated in the past few years and has led to revolutions in learning vision and language. However, scalable representation for 3D objects and scenes is relatively unexplored. In this work, we present Uni3D, a 3D foundation model to explore the unified 3D representation at scale. Uni3D uses a 2D initialized ViT end-to-end pretrained to align the 3D point cloud features with the image-text aligned features. Via the simple architecture and pretext task, Uni3D can leverage abundant 2D pretrained models as initialization and image-text aligned models as the target, unlocking the great potential of 2D model zoos and scaling-up strategies to the 3D world. We efficiently scale up Uni3D to one billion parameters, and set new records on a broad range of 3D tasks, such as zero-shot classification, few-shot classification, open-world understanding and zero-shot part segmentation. We show that the strong Uni3D representation also enables applications such as 3D painting and retrieval in the wild. We believe that Uni3D provides a new direction for exploring both scaling up and efficiency of the representation in 3D domain.

NeurIPS Conference 2024 Conference Paper

Zero-Shot Scene Reconstruction from Single Images with Deep Prior Assembly

  • Junsheng Zhou
  • Yu-Shen Liu
  • Zhizhong Han

Large language and vision models have been leading a revolution in visual computing. By greatly scaling up sizes of data and model parameters, the large models learn deep priors which lead to remarkable performance in various tasks. In this work, we present deep prior assembly, a novel framework that assembles diverse deep priors from large models for scene reconstruction from single images in a zero-shot manner. We show that this challenging task can be done without extra knowledge but just simply generalizing one deep prior in one sub-task. To this end, we introduce novel methods related to poses, scales, and occlusion parsing which are keys to enable deep priors to work together in a robust way. Deep prior assembly does not require any 3D or 2D data-driven training in the task and demonstrates superior performance in generalizing priors to open-world scenes. We conduct evaluations on various datasets, and report analysis, numerical and visual comparisons with the latest methods to show our superiority. Project page: https: //junshengzhou. github. io/DeepPriorAssembly.

NeurIPS Conference 2023 Conference Paper

Differentiable Registration of Images and LiDAR Point Clouds with VoxelPoint-to-Pixel Matching

  • Junsheng Zhou
  • Baorui Ma
  • Wenyuan Zhang
  • Yi Fang
  • Yu-Shen Liu
  • Zhizhong Han

Cross-modality registration between 2D images captured by cameras and 3D point clouds from LiDARs is a crucial task in computer vision and robotic. Previous methods estimate 2D-3D correspondences by matching point and pixel patterns learned by neural networks, and use Perspective-n-Points (PnP) to estimate rigid transformation during post-processing. However, these methods struggle to map points and pixels to a shared latent space robustly since points and pixels have very different characteristics with patterns learned in different manners (MLP and CNN), and they also fail to construct supervision directly on the transformation since the PnP is non-differentiable, which leads to unstable registration results. To address these problems, we propose to learn a structured cross-modality latent space to represent pixel features and 3D features via a differentiable probabilistic PnP solver. Specifically, we design a triplet network to learn VoxelPoint-to-Pixel matching, where we represent 3D elements using both voxels and points to learn the cross-modality latent space with pixels. We design both the voxel and pixel branch based on CNNs to operate convolutions on voxels/pixels represented in grids, and integrate an additional point branch to regain the information lost during voxelization. We train our framework end-to-end by imposing supervisions directly on the predicted pose distribution with a probabilistic PnP solver. To explore distinctive patterns of cross-modality features, we design a novel loss with adaptive-weighted optimization for cross-modality feature description. The experimental results on KITTI and nuScenes datasets show significant improvements over the state-of-the-art methods.

AAAI Conference 2023 Conference Paper

KT-Net: Knowledge Transfer for Unpaired 3D Shape Completion

  • Zhen Cao
  • Wenxiao Zhang
  • Xin Wen
  • Zhen Dong
  • Yu-Shen Liu
  • Xiongwu Xiao
  • Bisheng Yang

Unpaired 3D object completion aims to predict a complete 3D shape from an incomplete input without knowing the correspondence between the complete and incomplete shapes. In this paper, we propose the novel KTNet to solve this task from the new perspective of knowledge transfer. KTNet elaborates a teacher-assistant-student network to establish multiple knowledge transfer processes. Specifically, the teacher network takes complete shape as input and learns the knowledge of complete shape. The student network takes the incomplete one as input and restores the corresponding complete shape. And the assistant modules not only help to transfer the knowledge of complete shape from the teacher to the student, but also judge the learning effect of the student network. As a result, KTNet makes use of a more comprehensive understanding to establish the geometric correspondence between complete and incomplete shapes in a perspective of knowledge transfer, which enables more detailed geometric inference for generating high-quality complete shapes. We conduct comprehensive experiments on several datasets, and the results show that our method outperforms previous methods of unpaired point cloud completion by a large margin. Code is available at https://github.com/a4152684/KT-Net.

ICML Conference 2023 Conference Paper

Learning Signed Distance Functions from Noisy 3D Point Clouds via Noise to Noise Mapping

  • Baorui Ma
  • Yu-Shen Liu
  • Zhizhong Han

Learning signed distance functions (SDFs) from 3D point clouds is an important task in 3D computer vision. However, without ground truth signed distances, point normals or clean point clouds, current methods still struggle from learning SDFs from noisy point clouds. To overcome this challenge, we propose to learn SDFs via a noise to noise mapping, which does not require any clean point cloud or ground truth supervision for training. Our novelty lies in the noise to noise mapping which can infer a highly accurate SDF of a single object or scene from its multiple or even single noisy point cloud observations. Our novel learning manner is supported by modern Lidar systems which capture multiple noisy observations per second. We achieve this by a novel loss which enables statistical reasoning on point clouds and maintains geometric consistency although point clouds are irregular, unordered and have no point correspondence among noisy observations. Our evaluation under the widely used benchmarks demonstrates our superiority over the state-of-the-art methods in surface reconstruction, point cloud denoising and upsampling. Our code, data, and pre-trained models are available at https: //github. com/mabaorui/Noise2NoiseMapping/.

AAAI Conference 2023 Conference Paper

NeAF: Learning Neural Angle Fields for Point Normal Estimation

  • Shujuan Li
  • Junsheng Zhou
  • Baorui Ma
  • Yu-Shen Liu
  • Zhizhong Han

Normal estimation for unstructured point clouds is an important task in 3D computer vision. Current methods achieve encouraging results by mapping local patches to normal vectors or learning local surface fitting using neural networks. However, these methods are not generalized well to unseen scenarios and are sensitive to parameter settings. To resolve these issues, we propose an implicit function to learn an angle field around the normal of each point in the spherical coordinate system, which is dubbed as Neural Angle Fields (NeAF). Instead of directly predicting the normal of an input point, we predict the angle offset between the ground truth normal and a randomly sampled query normal. This strategy pushes the network to observe more diverse samples, which leads to higher prediction accuracy in a more robust manner. To predict normals from the learned angle fields at inference time, we randomly sample query vectors in a unit spherical space and take the vectors with minimal angle values as the predicted normals. To further leverage the prior learned by NeAF, we propose to refine the predicted normal vectors by minimizing the angle offsets. The experimental results with synthetic data and real scans show significant improvements over the state-of-the-art under widely used benchmarks. Project page: https://lisj575.github.io/NeAF/.

NeurIPS Conference 2023 Conference Paper

NeuralGF: Unsupervised Point Normal Estimation by Learning Neural Gradient Function

  • Qing Li
  • Huifang Feng
  • Kanle Shi
  • Yue Gao
  • Yi Fang
  • Yu-Shen Liu
  • Zhizhong Han

Normal estimation for 3D point clouds is a fundamental task in 3D geometry processing. The state-of-the-art methods rely on priors of fitting local surfaces learned from normal supervision. However, normal supervision in benchmarks comes from synthetic shapes and is usually not available from real scans, thereby limiting the learned priors of these methods. In addition, normal orientation consistency across shapes remains difficult to achieve without a separate post-processing procedure. To resolve these issues, we propose a novel method for estimating oriented normals directly from point clouds without using ground truth normals as supervision. We achieve this by introducing a new paradigm for learning neural gradient functions, which encourages the neural network to fit the input point clouds and yield unit-norm gradients at the points. Specifically, we introduce loss functions to facilitate query points to iteratively reach the moving targets and aggregate onto the approximated surface, thereby learning a global surface representation of the data. Meanwhile, we incorporate gradients into the surface approximation to measure the minimum signed deviation of queries, resulting in a consistent gradient field associated with the surface. These techniques lead to our deep unsupervised oriented normal estimator that is robust to noise, outliers and density variations. Our excellent results on widely used benchmarks demonstrate that our method can learn more accurate normals for both unoriented and oriented normal estimation tasks than the latest methods. The source code and pre-trained model are publicly available.

ICRA Conference 2023 Conference Paper

Pyramid Learnable Tokens for 3D LiDAR Place Recognition

  • Congcong Wen
  • Hao Huang 0003
  • Yu-Shen Liu
  • Yi Fang 0006

3D LiDAR place recognition plays a vital role in various robot applications' including robotic navigation, autonomous driving, and simultaneous localization and mapping. However, most previous studies evaluated their models on accumulated 2D scans instead of real-world 3D LiDAR scans with a larger number of points, which limits the application in real scenarios. To address this limitation, we propose a point transformer network with pyramid learnable tokens (PTNet-PLT) to learn global descriptors for an actual scanned 3D LiDAR place recognition. Specifically, we first present a novel shifted cube attention module that consists of a self-attention module for local feature extraction and a cross-attention module for regional feature aggregation. The self-attention module constrains attention computation on a locally partitioned cube and builds connections across cubes based on the shifted cube scheme. In addition, the cross-attention module introduces several learnable tokens to separately aggregate features of points with similar features but spatially distant into an arbitrarily shaped region, which enables the model to capture long-term dependencies of the points. Next, we build a pyramid architecture network to learn multi-scale features and involve a decreasing number of tokens at each layer to aggregate features over a larger region. Finally, we obtain the global descriptor by concatenating learned region tokens of all layers. Experiments on three datasets, including USyd Campus, Oxford Robot-Car, and KITTI, demonstrate the effectiveness and generalization of the proposed model for large-scale 3D LiDAR place recognition.

NeurIPS Conference 2022 Conference Paper

HSurf-Net: Normal Estimation for 3D Point Clouds by Learning Hyper Surfaces

  • Qing Li
  • Yu-Shen Liu
  • Jin-San Cheng
  • Cheng Wang
  • Yi Fang
  • Zhizhong Han

We propose a novel normal estimation method called HSurf-Net, which can accurately predict normals from point clouds with noise and density variations. Previous methods focus on learning point weights to fit neighborhoods into a geometric surface approximated by a polynomial function with a predefined order, based on which normals are estimated. However, fitting surfaces explicitly from raw point clouds suffers from overfitting or underfitting issues caused by inappropriate polynomial orders and outliers, which significantly limits the performance of existing methods. To address these issues, we introduce hyper surface fitting to implicitly learn hyper surfaces, which are represented by multi-layer perceptron (MLP) layers that take point features as input and output surface patterns in a high dimensional feature space. We introduce a novel space transformation module, which consists of a sequence of local aggregation layers and global shift layers, to learn an optimal feature space, and a relative position encoding module to effectively convert point clouds into the learned feature space. Our model learns hyper surfaces from the noise-less features and directly predicts normal vectors. We jointly optimize the MLP weights and module parameters in a data-driven manner to make the model adaptively find the most suitable surface pattern for various points. Experimental results show that our HSurf-Net achieves the state-of-the-art performance on the synthetic shape dataset, the real-world indoor and outdoor scene datasets. The code, data and pretrained models are publicly available.

NeurIPS Conference 2022 Conference Paper

Learning Consistency-Aware Unsigned Distance Functions Progressively from Raw Point Clouds

  • Junsheng Zhou
  • Baorui Ma
  • Yu-Shen Liu
  • Yi Fang
  • Zhizhong Han

Surface reconstruction for point clouds is an important task in 3D computer vision. Most of the latest methods resolve this problem by learning signed distance functions (SDF) from point clouds, which are limited to reconstructing shapes or scenes with closed surfaces. Some other methods tried to represent shapes or scenes with open surfaces using unsigned distance functions (UDF) which are learned from large scale ground truth unsigned distances. However, the learned UDF is hard to provide smooth distance fields near the surface due to the noncontinuous character of point clouds. In this paper, we propose a novel method to learn consistency-aware unsigned distance functions directly from raw point clouds. We achieve this by learning to move 3D queries to reach the surface with a field consistency constraint, where we also enable to progressively estimate a more accurate surface. Specifically, we train a neural network to gradually infer the relationship between 3D queries and the approximated surface by searching for the moving target of queries in a dynamic way, which results in a consistent field around the surface. Meanwhile, we introduce a polygonization algorithm to extract surfaces directly from the gradient field of the learned UDF. The experimental results in surface reconstruction for synthetic and real scan data show significant improvements over the state-of-the-art under the widely used benchmarks.

ICML Conference 2021 Conference Paper

Neural-Pull: Learning Signed Distance Function from Point clouds by Learning to Pull Space onto Surface

  • Baorui Ma
  • Zhizhong Han
  • Yu-Shen Liu
  • Matthias Zwicker

Reconstructing continuous surfaces from 3D point clouds is a fundamental operation in 3D geometry processing. Several recent state-of-the-art methods address this problem using neural networks to learn signed distance functions (SDFs). In this paper, we introduce Neural-Pull, a new approach that is simple and leads to high quality SDFs. Specifically, we train a neural network to pull query 3D locations to their closest points on the surface using the predicted signed distance values and the gradient at the query locations, both of which are computed by the network itself. The pulling operation moves each query location with a stride given by the distance predicted by the network. Based on the sign of the distance, this may move the query location along or against the direction of the gradient of the SDF. This is a differentiable operation that allows us to update the signed distance value and the gradient simultaneously during training. Our outperforming results under widely used benchmarks demonstrate that we can learn SDFs more accurately and flexibly for surface reconstruction and single image reconstruction than the state-of-the-art methods. Our code and data are available at https: //github. com/mabaorui/NeuralPull.

ICML Conference 2020 Conference Paper

DRWR: A Differentiable Renderer without Rendering for Unsupervised 3D Structure Learning from Silhouette Images

  • Zhizhong Han
  • Chao Chen
  • Yu-Shen Liu
  • Matthias Zwicker

Differentiable renderers have been used successfully for unsupervised 3D structure learning from 2D images because they can bridge the gap between 3D and 2D. To optimize 3D shape parameters, current renderers rely on pixel-wise losses between rendered images of 3D reconstructions and ground truth images from corresponding viewpoints. Hence they require interpolation of the recovered 3D structure at each pixel, visibility handling, and optionally evaluating a shading model. In contrast, here we propose a Differentiable Renderer Without Rendering (DRWR) that omits these steps. DRWR only relies on a simple but effective loss that evaluates how well the projections of reconstructed 3D point clouds cover the ground truth object silhouette. Specifically, DRWR employs a smooth silhouette loss to pull the projection of each individual 3D point inside the object silhouette, and a structure-aware repulsion loss to push each pair of projections that fall inside the silhouette far away from each other. Although we omit surface interpolation, visibility handling, and shading, our results demonstrate that DRWR achieves state-of-the-art accuracies under widely used benchmarks, outperforming previous methods both qualitatively and quantitatively. In addition, our training times are significantly lower due to the simplicity of DRWR.

IJCAI Conference 2019 Conference Paper

3DViewGraph: Learning Global Features for 3D Shapes from A Graph of Unordered Views with Attention

  • Zhizhong Han
  • Xiyang Wang
  • Chi Man Vong
  • Yu-Shen Liu
  • Matthias Zwicker
  • C. L. Philip Chen

Learning global features by aggregating information over multiple views has been shown to be effective for 3D shape analysis. For view aggregation in deep learning models, pooling has been applied extensively. However, pooling leads to a loss of the content within views, and the spatial relationship among views, which limits the discriminability of learned features. We propose 3DViewGraph to resolve this issue, which learns 3D global features by more effectively aggregating unordered views with attention. Specifically, unordered views taken around a shape are regarded as view nodes on a view graph. 3DViewGraph first learns a novel latent semantic mapping to project low-level view features into meaningful latent semantic embeddings in a lower dimensional space, which is spanned by latent semantic patterns. Then, the content and spatial information of each pair of view nodes are encoded by a novel spatial pattern correlation, where the correlation is computed among latent semantic patterns. Finally, all spatial pattern correlations are integrated with attention weights learned by a novel attention mechanism. This further increases the discriminability of learned features by highlighting the unordered view nodes with distinctive characteristics and depressing the ones with appearance ambiguity. We show that 3DViewGraph outperforms state-of-the-art methods under three large-scale benchmarks.

NeurIPS Conference 2019 Conference Paper

Fast Low-rank Metric Learning for Large-scale and High-dimensional Data

  • Han Liu
  • Zhizhong Han
  • Yu-Shen Liu
  • Ming Gu

Low-rank metric learning aims to learn better discrimination of data subject to low-rank constraints. It keeps the intrinsic low-rank structure of datasets and reduces the time cost and memory usage in metric learning. However, it is still a challenge for current methods to handle datasets with both high dimensions and large numbers of samples. To address this issue, we present a novel fast low-rank metric learning (FLRML) method. FLRML casts the low-rank metric learning problem into an unconstrained optimization on the Stiefel manifold, which can be efficiently solved by searching along the descent curves of the manifold. FLRML significantly reduces the complexity and memory usage in optimization, which makes the method scalable to both high dimensions and large numbers of samples. Furthermore, we introduce a mini-batch version of FLRML to make the method scalable to larger datasets which are hard to be loaded and decomposed in limited memory. The outperforming experimental results show that our method is with high accuracy and much faster than the state-of-the-art methods under several benchmarks with large numbers of high-dimensional data. Code has been made available at https: //github. com/highan911/FLRML.

IJCAI Conference 2019 Conference Paper

Parts4Feature: Learning 3D Global Features from Generally Semantic Parts in Multiple Views

  • Zhizhong Han
  • Xinhai Liu
  • Yu-Shen Liu
  • Matthias Zwicker

Deep learning has achieved remarkable results in 3D shape analysis by learning global shape features from the pixel-level over multiple views. Previous methods, however, compute low-level features for entire views without considering part-level information. In contrast, we propose a deep neural network, called Parts4Feature, to learn 3D global features from part-level information in multiple views. We introduce a novel definition of generally semantic parts, which Parts4Feature learns to detect in multiple views from different 3D shape segmentation benchmarks. A key idea of our architecture is that it transfers the ability to detect semantically meaningful parts in multiple views to learn 3D global features. Parts4Feature achieves this by combining a local part detection branch and a global feature learning branch with a shared region proposal module. The global feature learning branch aggregates the detected parts in terms of learned part patterns with a novel multi-attention mechanism, while the region proposal module enables locally and globally discriminative information to be promoted by each other. We demonstrate that Parts4Feature outperforms the state-of-the-art under three large-scale 3D shape benchmarks.

AAAI Conference 2019 Conference Paper

Point2Sequence: Learning the Shape Representation of 3D Point Clouds with an Attention-Based Sequence to Sequence Network

  • Xinhai Liu
  • Zhizhong Han
  • Yu-Shen Liu
  • Matthias Zwicker

Exploring contextual information in the local region is important for shape understanding and analysis. Existing studies often employ hand-crafted or explicit ways to encode contextual information of local regions. However, it is hard to capture fine-grained contextual information in hand-crafted or explicit manners, such as the correlation between different areas in a local region, which limits the discriminative ability of learned features. To resolve this issue, we propose a novel deep learning model for 3D point clouds, named Point2Sequence, to learn 3D shape features by capturing fine-grained contextual information in a novel implicit way. Point2Sequence employs a novel sequence learning model for point clouds to capture the correlations by aggregating multi-scale areas of each local region with attention. Specifically, Point2Sequence first learns the feature of each area scale in a local region. Then, it captures the correlation between area scales in the process of aggregating all area scales using a recurrent neural network (RNN) based encoder-decoder structure, where an attention mechanism is proposed to highlight the importance of different area scales. Experimental results show that Point2Sequence achieves state-of-the-art performance in shape classification and segmentation tasks.

AAAI Conference 2019 Conference Paper

View Inter-Prediction GAN: Unsupervised Representation Learning for 3D Shapes by Learning Global Shape Memories to Support Local View Predictions

  • Zhizhong Han
  • Mingyang Shang
  • Yu-Shen Liu
  • Matthias Zwicker

In this paper, we present a novel unsupervised representation learning approach for 3D shapes, which is an important research challenge as it avoids the manual effort required for collecting supervised data. Our method trains an RNNbased neural network architecture to solve multiple view inter-prediction tasks for each shape. Given several nearby views of a shape, we define view inter-prediction as the task of predicting the center view between the input views, and reconstructing the input views in a low-level feature space. The key idea of our approach is to implement the shape representation as a shape-specific global memory that is shared between all local view inter-predictions for each shape. Intuitively, this memory enables the system to aggregate information that is useful to better solve the view inter-prediction tasks for each shape, and to leverage the memory as a viewindependent shape representation. Our approach obtains the best results using a combination of L2 and adversarial losses for the view inter-prediction task. We show that VIP-GAN outperforms state-of-the-art methods in unsupervised 3D feature learning on three large-scale 3D shape benchmarks.

AAAI Conference 2019 Conference Paper

Y2Seq2Seq: Cross-Modal Representation Learning for 3D Shape and Text by Joint Reconstruction and Prediction of View and Word Sequences

  • Zhizhong Han
  • Mingyang Shang
  • Xiyang Wang
  • Yu-Shen Liu
  • Matthias Zwicker

Jointly learning representations of 3D shapes and text is crucial to support tasks such as cross-modal retrieval or shape captioning. A recent method employs 3D voxels to represent 3D shapes, but this limits the approach to low resolutions due to the computational cost caused by the cubic complexity of 3D voxels. Hence the method suffers from a lack of detailed geometry. To resolve this issue, we propose Y2 Seq2Seq, a view-based model, to learn cross-modal representations by joint reconstruction and prediction of view and word sequences. Specifically, the network architecture of Y2 Seq2Seq bridges the semantic meaning embedded in the two modalities by two coupled “Y” like sequence-tosequence (Seq2Seq) structures. In addition, our novel hierarchical constraints further increase the discriminability of the cross-modal representations by employing more detailed discriminative information. Experimental results on cross-modal retrieval and 3D shape captioning show that Y2 Seq2Seq outperforms the state-of-the-art methods.