Arrow Research search

Author name cluster

Kaichen Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers
2 author rows

Possible papers

13

AAAI Conference 2026 Conference Paper

Empowering Sparse-Input Neural Radiance Fields with Dual-Level Semantic Guidance from Dense Novel Views

  • Yingji Zhong
  • Kaichen Zhou
  • Zhihao Li
  • Lanqing Hong
  • Zhenguo Li
  • Dan Xu

Neural Radiance Fields (NeRF) have shown remarkable capabilities for photorealistic novel view synthesis. One major deficiency of NeRF is that dense inputs are typically required, and the rendering quality will drop drastically given sparse inputs. In this paper, we highlight the effectiveness of rendered semantics from dense novel views, and show that rendered semantics can be treated as a more robust form of augmented data than rendered RGB. Our method enhances NeRF’s performance by incorporating guidance derived from the rendered semantics. The rendered semantic guidance encompasses two levels: the supervision level and the feature level. The supervision-level guidance incorporates a bi-directional verification module that decides the validity of each rendered semantic label, while the feature-level guidance integrates a learnable codebook that encodes semantic-aware information, which is queried by each point via the attention mechanism to obtain semanticrelevant predictions. The overall semantic guidance is embedded into a self-improved pipeline.We also introduce a more challenging sparse-input indoor benchmark, where the number of inputs is limited to as few as 6. Experiments demonstrate the effectiveness of our method and it exhibits superior performance compared to existing approaches.

IROS Conference 2025 Conference Paper

SR3D: Unleashing Single-view 3D Reconstruction for Transparent and Specular Object Grasping

  • Mingxu Zhang
  • Xiaoqi Li 0020
  • Jiahui Xu
  • Kaichen Zhou
  • Hojin Bae
  • Yan Shen 0035
  • Chuyan Xiong
  • Hao Dong 0003

Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single-view 3D object reconstruction approaches, we propose a training-free framework SR3D that enables robotic grasping of transparent and specular objects from a single-view observation. Specifically, given single-view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object’s pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms, which leverage both the 2D and 3D’s inherent semantic and geometric information in the observation to determine the object’s 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real-world show the reconstruction effectiveness of SR3D. More demonstrations can be found at: https://sites.google.com/view/sr3dtech/

ICRA Conference 2025 Conference Paper

TransDiff: Diffusion-Based Method for Manipulating Transparent Objects Using a Single RGB-D Image

  • Haoxiao Wang
  • Kaichen Zhou
  • Binrui Gu
  • Zhiyuan Feng
  • Weijie Wang
  • Peilin Sun
  • Yicheng Xiao
  • Jianhua Zhang

Manipulating transparent objects presents significant challenges due to the complexities introduced by their reflection and refraction properties, which considerably hinder the accurate estimation of their 3D shapes. To address these challenges, we propose a single-view RGB-D-based depth completion framework, TransDiff, that leverages the Denoising Diffusion Probabilistic Models(DDPM) to achieve material-agnostic object grasping in desktop. Specifically, we leverage features extracted from RGB images, including semantic segmentation, edge maps, and normal maps, to condition the depth map generation process. Our method learns an iterative denoising process that transforms a random depth distribution into a depth map, guided by initially refined depth information, ensuring more accurate depth estimation in scenarios involving transparent objects. Additionally, we propose a novel training method to better align the noisy depth and RGB image features, which are used as conditions to refine depth estimation step by step. Finally, we utilized an improved inference process to accelerate the denoising procedure. Through comprehensive experimental validation, we demonstrate that our method significantly outperforms the baselines in both synthetic and real-world benchmarks with acceptable inference time. The demo of our method can be found on: https://wang-haoxiao.github.io/TransDiff/

ICRA Conference 2024 Conference Paper

Dusk Till Dawn: Self-supervised Nighttime Stereo Depth Estimation using Visual Foundation Models

  • Madhu Vankadari
  • Samuel Hodgson
  • Sangyun Shin
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

Self-supervised depth estimation algorithms rely heavily on frame-warping relationships, exhibiting substantial performance degradation when applied in challenging circumstances, such as low-visibility and nighttime scenarios with varying illumination conditions. Addressing this challenge, we introduce an algorithm designed to achieve accurate selfsupervised stereo depth estimation focusing on nighttime conditions. Specifically, we use pretrained visual foundation models to extract generalised features across challenging scenes and present an efficient method for matching and integrating these features from stereo frames. Moreover, to prevent pixels violating photometric consistency assumption from negatively affecting the depth predictions, we propose a novel masking approach designed to filter out such pixels. Lastly, addressing weaknesses in the evaluation of current depth estimation algorithms, we present novel evaluation metrics. Our experiments, conducted on challenging datasets including Oxford RobotCar and MultiSpectral Stereo, demonstrate the robust improvements realized by our approach.

IROS Conference 2024 Conference Paper

Learning Generalizable Manipulation Policy with Adapter-Based Parameter Fine-Tuning

  • Kai Lu 0003
  • Kim Tien Ly
  • William Hebberd
  • Kaichen Zhou
  • Ioannis Havoutis
  • Andrew Markham

This study investigates the use of adapters in reinforcement learning for robotic skill generalization across multiple robots and tasks. Traditional methods are typically reliant on robot-specific retraining and face challenges such as efficiency and adaptability, particularly when scaling to robots with varying kinematics. We propose an alternative approach where a disembodied (virtual) hand manipulator learns a task (i. e. , an abstract skill) and then transfers it to various robots with different kinematic constraints without retraining the entire model (i. e. , the concrete, physical implementation of the skill). Whilst adapters are commonly used in other domains with strong supervision available, we show how weaker feedback from robotic control can be used to optimize task execution by preserving the abstract skill dynamics whilst adapting to new robotic domains. We demonstrate the effectiveness of our method with experiments conducted in the SAPIEN ManiSkill environment, showing improvements in generalization and task success rates. All code, data, and additional videos are at this GitHub link: https://kl-research.github.io/genrob.

NeurIPS Conference 2024 Conference Paper

RoboMamba: Efficient Vision-Language-Action Model for Robotic Reasoning and Manipulation

  • Jiaming Liu
  • Mengzhen Liu
  • Zhenyu Wang
  • Pengju An
  • Xiaoqi Li
  • Kaichen Zhou
  • Senqiao Yang
  • Renrui Zhang

A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing Vision-Language-Action (VLA) models for robots can handle a range of basic tasks, they still face challenges in two areas: (1) insufficient reasoning ability to tackle complex tasks, and (2) high computational costs for VLA model fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic VLA model that leverages Mamba to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual tokens with language embedding through co-training, empowering our model with visual common sense and robotic-related reasoning. To further equip RoboMamba with SE(3) pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0. 1\% of the model) and time. In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 3 times faster than existing VLA models.

IROS Conference 2024 Conference Paper

SCANet: Correcting LEGO Assembly Errors with Self-Correct Assembly Network

  • Yuxuan Wan
  • Kaichen Zhou
  • Jinhong Chen
  • Hao Dong 0003

Autonomous assembly in robotics and 3D vision presents significant challenges, particularly in ensuring assembly correctness. Presently, predominant methods such as MEPNet focus on assembling components based on manually provided images. However, these approaches often fall short in achieving satisfactory results for tasks requiring long-term planning. Concurrently, we observe that integrating a self-correction module can partially alleviate such issues. Motivated by this concern, we introduce the Single-Step Assembly Error Correction Task, which involves identifying and rectifying misassembled components. To support research in this area, we present the LEGO Error Correction Assembly Dataset (LEGO-ECA), comprising manual images for assembly steps and instances of assembly failures. Additionally, we propose the Self-Correct Assembly Network (SCANet), a novel method to address this task. SCANet treats assembled components as queries, determining their correctness in manual images and providing corrections when necessary. Finally, we utilize SCANet to correct the assembly results of MEPNet. Experimental results demonstrate that SCANet can identify and correct MEPNet's misassembled results, significantly improving the correctness of assembly. Our code and dataset are available at https://github.com/Yaser-wyx/SCANet.

IROS Conference 2024 Conference Paper

WSCLoc: Weakly-Supervised Sparse-View Camera Relocalization via Radiance Field

  • Jialu Wang
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

Despite the advancements in deep learning for camera relocalization tasks, obtaining ground truth pose labels required for the training process remains a costly endeavor. While current weakly supervised methods excel in lightweight label generation, their performance notably declines in scenarios with sparse views. In response to this challenge, we introduce WSCLoc, a system capable of being customized to various deep learning-based relocalization models to enhance their performance under weakly-supervised and sparse view conditions. This is realized with two stages. In the initial stage, WSCLoc employs a multilayer perceptron-based structure called WFT-NeRF to co-optimize image reconstruction quality and initial pose information. To ensure a stable learning process, we incorporate temporal information as input. Furthermore, instead of optimizing SE(3), we opt for sim(3) optimization to explicitly enforce a scale constraint. In the second stage, we co-optimize the pre-trained WFT-NeRF and WFT-Pose. This optimization is enhanced by Time-Encoding based Random View Synthesis and supervised by inter-frame geometric constraints that consider pose, depth, and RGB information. We validate our approaches on two publicly available datasets, one outdoor and one indoor. Our experimental results demonstrate that our weakly-supervised relocalization solutions achieve superior pose estimation accuracy in sparse-view scenarios, comparable to state-of-the-art camera relocalization methods. We will make our code publicly available.

NeurIPS Conference 2023 Conference Paper

DynPoint: Dynamic Neural Point For View Synthesis

  • Kaichen Zhou
  • Jia-Xing Zhong
  • Sangyun Shin
  • Kai Lu
  • Yiyuan Yang
  • Andrew Markham
  • Niki Trigoni

The introduction of neural radiance fields has greatly improved the effectiveness of view synthesis for monocular videos. However, existing algorithms face difficulties when dealing with uncontrolled or lengthy scenarios, and require extensive training time specific to each new scenario. To tackle these limitations, we propose DynPoint, an algorithm designed to facilitate the rapid synthesis of novel views for unconstrained monocular videos. Rather than encoding the entirety of the scenario information into a latent representation, DynPoint concentrates on predicting the explicit 3D correspondence between neighboring frames to realize information aggregation. Specifically, this correspondence prediction is achieved through the estimation of consistent depth and scene flow information across frames. Subsequently, the acquired correspondence is utilized to aggregate information from multiple reference frames to a target frame, by constructing hierarchical neural point clouds. The resulting framework enables swift and accurate view synthesis for desired views of target frames. The experimental results obtained demonstrate the considerable acceleration of training time achieved - typically an order of magnitude - by our proposed method while yielding comparable outcomes compared to prior approaches. Furthermore, our method exhibits strong robustness in handling long-duration videos without learning a canonical representation of video content.

NeurIPS Conference 2023 Conference Paper

Multi-body SE(3) Equivariance for Unsupervised Rigid Segmentation and Motion Estimation

  • Jia-Xing Zhong
  • Ta-Ying Cheng
  • Yuhang He
  • Kai Lu
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

A truly generalizable approach to rigid segmentation and motion estimation is fundamental to 3D understanding of articulated objects and moving scenes. In view of the closely intertwined relationship between segmentation and motion estimates, we present an SE(3) equivariant architecture and a training strategy to tackle this task in an unsupervised manner. Our architecture is composed of two interconnected, lightweight heads. These heads predict segmentation masks using point-level invariant features and estimate motion from SE(3) equivariant features, all without the need for category information. Our training strategy is unified and can be implemented online, which jointly optimizes the predicted segmentation and motion by leveraging the interrelationships among scene flow, segmentation mask, and rigid transformations. We conduct experiments on four datasets to demonstrate the superiority of our method. The results show that our method excels in both model performance and computational efficiency, with only 0. 25M parameters and 0. 92G FLOPs. To the best of our knowledge, this is the first work designed for category-agnostic part-level SE(3) equivariance in dynamic point clouds.

ICRA Conference 2023 Conference Paper

Sample, Crop, Track: Self-Supervised Mobile 3D Object Detection for Urban Driving LiDAR

  • Sangyun Shin
  • Stuart Golodetz
  • Madhu Vankadari
  • Kaichen Zhou
  • Andrew Markham
  • Niki Trigoni

Deep learning has led to great progress in the detection of mobile (i. e. movement-capable) objects in urban driving scenes in recent years. Supervised approaches typically require the annotation of large training sets; there has thus been great interest in leveraging weakly, semi- or self- supervised methods to avoid this, with much success. Whilst weakly and semi-supervised methods require some annotation, self-supervised methods have used cues such as motion to relieve the need for annotation altogether. However, a complete absence of annotation typically degrades their performance, and ambiguities that arise during motion grouping can inhibit their ability to find accurate object boundaries. In this paper, we propose a new self-supervised mobile object detection approach called SCT. This uses both motion cues and expected object sizes to improve detection performance, and predicts a dense grid of 3 $D$ oriented bounding boxes to improve object discovery. We significantly outperform the state-of-the-art self-supervised mobile object detection method TCR on the KITTI tracking benchmark, and achieve performance that is within 30 % of the fully supervised PV-RCNN++ method for IoUs $\leq$ 0. 5. Our source code will be made available online.

TMLR Journal 2022 Journal Article

DHA: End-to-End Joint Optimization of Data Augmentation Policy, Hyper-parameter and Architecture

  • Kaichen Zhou
  • Lanqing Hong
  • Shoukang Hu
  • Fengwei Zhou
  • Binxin Ru
  • Jiashi Feng
  • Zhenguo Li

Automated machine learning (AutoML) usually involves several crucial components, such as Data Augmentation (DA) policy, Hyper-Parameter Optimization (HPO), and Neural Architecture Search (NAS). Although many strategies have been developed for automating these components in separation, joint optimization of these components remains challenging due to the largely increased search dimension and the variant input types of each component. In parallel to this, the common practice of searching for the optimal architecture first and then retraining it before deployment in NAS often suffers from the low-performance correlation between the searching and retraining stages. An end-to-end solution that integrates the AutoML components and returns a ready-to-use model at the end of the search is desirable. In view of these, we propose DHA, which achieves joint optimization of Data augmentation policy, Hyper-parameter, and Architecture. Specifically, end-to-end NAS is achieved in a differentiable manner by optimizing a compressed lower-dimensional feature space, while DA policy and HPO are regarded as dynamic schedulers, which adapt themselves to the update of network parameters and network architecture at the same time. Experiments show that DHA achieves state-of-the-art (SOTA) results on various datasets and search spaces. To the best of our knowledge, we are the first to efficiently and jointly optimize DA policy, NAS, and HPO in an end-to-end manner without retraining.

AAAI Conference 2021 Conference Paper

VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization

  • Kaichen Zhou
  • Changhao Chen
  • Bing Wang
  • Muhamad Risqi U. Saputra
  • Niki Trigoni
  • Andrew Markham

Recent learning-based approaches have achieved impressive results in the field of single-shot camera localization. However, how best to fuse multiple modalities (e. g. , image and depth) and to deal with degraded or missing input are less well studied. In particular, we note that previous approaches towards deep fusion do not perform significantly better than models employing a single modality. We conjecture that this is because of the naive approaches to feature space fusion through summation or concatenation which do not take into account the different strengths of each modality. To address this, we propose an end-to-end framework, termed VM- Loc, to fuse different sensor inputs into a common latent space through a variational Product-of-Experts (PoE) followed by attention-based fusion. Unlike previous multimodal variational works directly adapting the objective function of vanilla variational auto-encoder, we show how camera localization can be accurately estimated through an unbiased objective function based on importance weighting. Our model is extensively evaluated on RGB-D datasets and the results prove the efficacy of our model. The source code is available at https: //github. com/Zalex97/VMLoc.