Arrow Research search

Author name cluster

Chenhang He

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

AAAI Conference 2026 Conference Paper

BEVDilation: LiDAR-Centric Multi-Modal Fusion for 3D Object Detection

  • Guowen Zhang
  • Chenhang He
  • Liyi Chen
  • Lei Zhang

Integrating LiDAR and camera information in the bird's eye view (BEV) representation has demonstrated its effectiveness in 3D object detection. However, because of the fundamental disparity in geometric accuracy between these sensors, indiscriminate fusion in previous methods often leads to degraded performance. In this paper, we propose BEVDilation, a novel LiDAR-centric framework that prioritizes LiDAR information in the fusion. By formulating image BEV features as implicit guidance rather than naive concatenation, our strategy effectively alleviates the spatial misalignment caused by image depth estimation errors. Furthermore, the image guidance can effectively help the LiDAR-centric paradigm to address the sparsity and semantic limitations of point clouds. Specifically, we propose a Sparse Voxel Dilation Block that mitigates the inherent point sparsity by densifying foreground voxels through image priors. Moreover, we introduce a Semantic-Guided BEV Dilation Block to enhance the LiDAR feature diffusion processing with image semantic guidance and long-range context capture. On the challenging nuScenes benchmark, BEVDilation achieves better performance than state-of-the-art methods while maintaining competitive computational efficiency. Importantly, our LiDAR-centric strategy demonstrates greater robustness to depth noise compared to naive fusion.

AAAI Conference 2026 Conference Paper

PUFM: Efficient Point Cloud Upsampling via Flow Matching

  • Zhi-Song Liu
  • Chenhang He
  • Yakun Ju
  • Lei Li

Diffusion models have recently been adopted for point cloud upsampling due to their effectiveness in solving ill-posed problems. However, existing upsampling methods often struggle with inefficiencies, as they generate dense point clouds by mapping Gaussian noise to data, overlooking the geometric information already present in sparse inputs. To address this, we propose PUFM, a novel Point Cloud Upsampling via Flow Matching, which learns to directly transform sparse point clouds into their high-fidelity dense counterparts. Our approach first applies midpoint interpolation to densify the sparse input. Then, we construct a continuous interpolant between sparse and dense point clouds and train a neural network to estimate the velocity field for flow matching. Given the unordered nature of point clouds, we introduce a pre-alignment step based on Earth Mover's Distance (EMD) optimization to ensure coherent and meaningful interpolation between sparse and dense representations. This results in a more stable and efficient learning trajectory during flow matching. Experiments on synthetic benchmarks demonstrate that our method delivers superior upsampling quality but with fewer sampling steps. Further experiments on ScanNet and KITTI also show that our approach generalizes well to real-world RGB-D and LiDAR point clouds, making it more practical for real-world applications.

ICLR Conference 2025 Conference Paper

Continuous Autoregressive Modeling with Stochastic Monotonic Alignment for Speech Synthesis

  • Weiwei Lin 0002
  • Chenhang He

We propose a novel autoregressive modeling approach for speech synthesis, combining a variational autoencoder (VAE) with a multi-modal latent space and an autoregressive model that uses Gaussian Mixture Models (GMM) as the conditional probability distribution. Unlike previous methods that rely on residual vector quantization, our model leverages continuous speech representations from the VAE's latent space, greatly simplifying the training and inference pipelines. We also introduce a stochastic monotonic alignment mechanism to enforce strict monotonic alignments. Our approach significantly outperforms the state-of-the-art autoregressive model VALL-E in both subjective and objective evaluations, achieving these results with only 10.3\% of VALL-E's parameters. This demonstrates the potential of continuous speech language models as a more efficient alternative to existing quantization-based speech language models. Sample audio can be found at \url{https://tinyurl.com/gmm-lm-tts}.

NeurIPS Conference 2024 Conference Paper

Voxel Mamba: Group-Free State Space Models for Point Cloud based 3D Object Detection

  • Guowen Zhang
  • Lue Fan
  • Chenhang He
  • Zhen Lei
  • Zhaoxiang Zhang
  • Lei Zhang

Serialization-based methods, which serialize the 3D voxels and group them into multiple sequences before inputting to Transformers, have demonstrated their effectiveness in 3D object detection. However, serializing 3D voxels into 1D sequences will inevitably sacrifice the voxel spatial proximity. Such an issue is hard to be addressed by enlarging the group size with existing serialization-based methods due to the quadratic complexity of Transformers with feature sizes. Inspired by the recent advances of state space models (SSMs), we present a Voxel SSM, termed as Voxel Mamba, which employs a group-free strategy to serialize the whole space of voxels into a single sequence. The linear complexity of SSMs encourages our group-free design, alleviating the loss of spatial proximity of voxels. To further enhance the spatial proximity, we propose a Dual-scale SSM Block to establish a hierarchical structure, enabling a larger receptive field in the 1D serialization curve, as well as more complete local regions in 3D space. Moreover, we implicitly apply window partition under the group-free framework by positional encoding, which further enhances spatial proximity by encoding voxel positional information. Our experiments on Waymo Open Dataset and nuScenes dataset show that Voxel Mamba not only achieves higher accuracy than state-of-the-art methods, but also demonstrates significant advantages in computational efficiency. The source code is available at https: //github. com/gwenzhang/Voxel-Mamba.

ICML Conference 2023 Conference Paper

Self-supervised Neural Factor Analysis for Disentangling Utterance-level Speech Representations

  • Weiwei Lin 0002
  • Chenhang He
  • Man-Wai Mak
  • Youzhi Tu

Self-supervised learning (SSL) speech models such as wav2vec and HuBERT have demonstrated state-of-the-art performance on automatic speech recognition (ASR) and proved to be extremely useful in low label-resource settings. However, the success of SSL models has yet to transfer to utterance-level tasks such as speaker, emotion, and language recognition, which still require supervised fine-tuning of the SSL models to obtain good performance. We argue that the problem is caused by the lack of disentangled representations and an utterance-level learning objective for these tasks. Inspired by how HuBERT uses clustering to discover hidden acoustic units, we formulate a factor analysis (FA) model that uses the discovered hidden acoustic units to align the SSL features. The underlying utterance-level representations are disentangled using probabilistic inference on the aligned features. Furthermore, the variational lower bound derived from the FA model provides an utterance-level objective, allowing error gradients to be backpropagated to the Transformer layers to learn highly discriminative acoustic units. When used in conjunction with HuBERT’s masked prediction training, our models outperform the current best model, WavLM, on all utterance-level non-semantic tasks on the SUPERB benchmark with only 20% of labeled data.