Author name cluster

Hujun Bao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers

2 author rows

AAAI Conference 2026 Conference Paper

One-Shot Refiner: Boosting Feed-forward Novel View Synthesis via One-Step Diffusion

Yitong Dong
Qi Zhang
Minchao Jiang
Zhiqiang Wu
Qingnan Fan
Ying Feng
Huaqi Zhang
Hujun Bao

We present a novel framework for high-fidelity novel view synthesis (NVS) from sparse images, addressing key limitations in recent feed-forward 3D Gaussian Splatting (3DGS) methods built on Vision Transformer (ViT) backbones. While ViT-based pipelines offer strong geometric priors, they are often constrained by low-resolution inputs due to computational costs. Moreover, existing generative enhancement methods tend to be 3D-agnostic, resulting in inconsistent structures across views, especially in unseen regions. To overcome these challenges, we design a Dual-Domain Detail Perception Module, which enables handling high-resolution images without being limited by the ViT backbone, and endows Gaussians with additional features to store high-frequency details. We develop a feature-guided diffusion network, which can preserve high-frequency details during the restoration process. We introduce a unified training strategy that enables joint optimization of the ViT-based geometric backbone and the diffusion-based refinement module. Experiments demonstrate that our method can maintain superior generation quality across multiple datasets.

PDF Details DOI

AAAI Conference 2026 Conference Paper

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Yifan Yang
Zhi Cen
Sida Peng
Xiangwei Chen
Yifu Deng
Xinyu Zhu
Fan Jia
Xiaowei Zhou

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that outputs facial motions in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides a lightweight diffusion head to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Experiments conducted on public datasets demonstrate that our approach outperforms recent baseline methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

AtlasGS: Atlanta-world Guided Surface Reconstruction with Implicit Structured Gaussians

Xiyu Zhang
Chong Bao
Yipeng Chen
Hongjia Zhai
Yitong Dong
Hujun Bao
Zhaopeng Cui
Guofeng Zhang

3D reconstruction of indoor and urban environments is a prominent research topic with various downstream applications. However, existing geometric priors for addressing low-texture regions in indoor and urban settings often lack global consistency. Moreover, Gaussian Splatting and implicit SDF fields often suffer from discontinuities or exhibit computational inefficiencies, resulting in a loss of detail. To address these issues, we propose an Atlanta-world guided implicit-structured Gaussian Splatting that achieves smooth indoor and urban scene reconstruction while preserving high-frequency details and rendering efficiency. By leveraging the Atlanta-world model, we ensure the accurate surface reconstruction for low-texture regions, while the proposed novel implicit-structured GS representations provide smoothness without sacrificing efficiency and high-frequency details. Specifically, we propose a semantic GS representation to predict the probability of all semantic regions and deploy a structure plane regularization with learnable plane indicators for global accurate surface reconstruction. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches in both indoor and urban scenes, delivering superior surface reconstruction quality.

PDF Details

IROS Conference 2025 Conference Paper

DW-VIO: Deep Weighted Visual-Inertial Odometry

Guyuan Chen
Xiyue Guo
Xiaokun Pan
Yujun Shen
Guofeng Zhang 0001
Hujun Bao
Zhaopeng Cui

Visual-inertial odometry (VIO) has made significant progress in various applications. However, one of the key challenges in VIO is the efficient and robust fusion of visual and inertial measurements, particularly while mitigating the impact of sensor failures. To address this challenge, we propose a new learning-based VIO system, i. e. , DW-VIO, which is able to integrate multiple sensors and provide robust state estimations. To this end, we design a novel deep learning-based data-fusion approach that dynamically associates information from multiple sensors to predict sensor weights for optimization. Moreover, in order to improve the efficiency, we present several real-time optimization techniques including a fast patch graph constructor and an efficient GPU-accelerated multi-factor bundle adjustment layer. Experimental results show that DW-VIO outperforms most state-of-the-art (SOTA) methods on the EuRoC MAV, ETH3D-SLAM, and KITTI-360 benchmarks across various challenging sequences. Additionally, it maintains a minimum of 20 frames per second (FPS) on a single RTX 3060 GPU with high-resolution input, highlighting its efficiency.