Arrow Research search

Author name cluster

Cheng Chi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers
2 author rows

Possible papers

9

AAAI Conference 2026 Conference Paper

GOMPSNR: Reflourish the Signal-to-Noise Ratio Metric for Audio Generation Tasks

  • Lingling Dai
  • Andong Li
  • Cheng Chi
  • Yifan Liang
  • Xiaodong Li
  • Chengshi Zheng

In the field of audio generation, signal-to-noise ratio (SNR) has long served as an objective metric for evaluating audio quality. Nevertheless, recent studies have shown that SNR and its variants are not always highly correlated with human perception, prompting us to raise the questions: Why does SNR fail in measuring audio quality? And how to improve its reliability as an objective metric? In this paper, we identify the inadequate measurement of phase distance as a pivotal factor and propose to reformulate SNR with specially designed phase-distance terms, yielding an improved metric named GOMPSNR. We further extend the newly proposed formulation to derive two novel categories of loss function, corresponding to magnitude-guided phase refinement and joint magnitude-phase optimization, respectively. Besides, extensive experiments are conducted for an optimal combination of different loss functions. Experimental results on advanced neural vocoders demonstrate that our proposed GOMPSNR exhibits more reliable error measurement than SNR. Meanwhile, our proposed loss functions yield substantial improvements in model performance, and our well-chosen combination of different loss functions further optimizes the overall model capability.

NeurIPS Conference 2025 Conference Paper

RoboRefer: Towards Spatial Referring with Reasoning in Vision-Language Models for Robotics

  • Enshen Zhou
  • Jingkun An
  • Cheng Chi
  • Yi Han
  • Shanyu Rong
  • Chi Zhang
  • Pengwei Wang
  • Zhongyuan Wang

Spatial referring is a fundamental capability of embodied robots to interact with the 3D physical world. However, even with the powerful pretrained VLMs, recent approaches are still not qualified to accurately understand the complex 3D scenes and dynamically reason about the instruction-indicated locations for interaction. To this end, we propose RoboRefer, a 3D-aware vision language model (VLM) that can first achieve precise spatial understanding by integrating a disentangled but dedicated depth encoder via supervised fine-tuning (SFT). Moreover, RoboRefer advances generalized multi-step spatial reasoning via reinforcement fine-tuning (RFT), with metric-sensitive process reward functions tailored for spatial referring tasks. To support SFT and RFT training, we introduce RefSpatial, a large-scale dataset of 20M QA pairs (2x prior), covering 31 spatial relations (vs. 15 prior) and supporting complex reasoning processes (up to 5 steps). In addition, we introduce RefSpatial-Bench, a challenging benchmark filling the gap in evaluating spatial referring with multi-step reasoning. Experiments show that SFT-trained RoboRefer achieves state-of-the-art spatial understanding, with an average success rate of 89. 6%. RFT-trained RoboRefer further outperforms all other baselines by a large margin, even surpassing Gemini-2. 5-Pro by 12. 4% in average accuracy on RefSpatial-Bench. Notably, RoboRefer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (e, g. , UR5, G1 humanoid) in cluttered real-world scenes.

ICLR Conference 2023 Conference Paper

Recursive Time Series Data Augmentation

  • Amine Mohamed Aboussalah
  • Min-Jae Kwon
  • Raj G. Patel
  • Cheng Chi
  • Chi-Guhn Lee

Time series observations can be seen as realizations of an underlying dynamical system governed by rules that we typically do not know. For time series learning tasks we create our model using available data. Training on available realizations, where data is limited, often induces severe over-fitting thereby preventing generalization. To address this issue, we introduce a general recursive framework for time series augmentation, which we call the Recursive Interpolation Method (RIM). New augmented time series are generated using a recursive interpolation function from the original time series for use in training. We perform theoretical analysis to characterize the proposed RIM and to guarantee its performance under certain conditions. We apply RIM to diverse synthetic and real-world time series cases to achieve strong performance over non-augmented data on a variety of learning tasks. Our method is also computationally more efficient and leads to better performance when compared to state of the art time series data augmentation.

NeurIPS Conference 2022 Conference Paper

A Deep Reinforcement Learning Framework for Column Generation

  • Cheng Chi
  • Amine Aboussalah
  • Elias Khalil
  • Juyoung Wang
  • Zoha Sherkat-Masoumi

Column Generation (CG) is an iterative algorithm for solving linear programs (LPs) with an extremely large number of variables (columns). CG is the workhorse for tackling large-scale integer linear programs, which rely on CG to solve LP relaxations within a branch and bound algorithm. Two canonical applications are the Cutting Stock Problem (CSP) and Vehicle Routing Problem with Time Windows (VRPTW). In VRPTW, for example, each binary variable represents the decision to include or exclude a route, of which there are exponentially many; CG incrementally grows the subset of columns being used, ultimately converging to an optimal solution. We propose RLCG, the first Reinforcement Learning (RL) approach for CG. Unlike typical column selection rules which myopically select a column based on local information at each iteration, we treat CG as a sequential decision-making problem, as the column selected in an iteration affects subsequent iterations of the algorithm. This perspective lends itself to a Deep Reinforcement Learning approach that uses Graph Neural Networks (GNNs) to represent the variable-constraint structure in the LP of interest. We perform an extensive set of experiments using the publicly available BPPLIB benchmark for CSP and Solomon benchmark for VRPTW. RLCG converges faster and reduces the number of CG iterations by 22. 4% for CSP and 40. 9% for VRPTW on average compared to a commonly used greedy policy.

IROS Conference 2022 Conference Paper

PUA-MOS: End-to-End Point-wise Uncertainty Weighted Aggregation for Moving Object Segmentation

  • Cheng Chi
  • Peiliang Li 0001
  • Xiaozhi Chen
  • Xin Yang 0008

Segmenting moving objects in the 3D LiDAR point cloud can provide important guidance to localization, mapping and decision-making for self-driving vehicles. As for the conventional approaches to point cloud segmentation, they rely on semantic-level information, which makes it inevitable for long-tail problems to arise as there are always unseen types of objects on the road. To achieve moving segmentation while avoiding the reliance on the object category, the point motion is identified in this paper by fully exploring and aggregating the point-level geometric consistency in sequential point clouds. More specifically, an end-to-end point-wise uncertainty weighted aggregation approach known as PUA-MOS is proposed to segment the moving points in 3D LiDAR Data. Our method is applicable to estimate point-wise moving mask, scene flow and rigid-body transformation simultaneously in a coarse- to-fine network, where the relations between each prediction are implicitly learned. To explicitly model the inner and inter relations across these predictions among all points, the point- wise estimation and the average value of the same motion points are aggregated according to a predicted uncertainty. Then, the aggregated estimation is fed again into the next-level fusion, where the points will be re-segmented using the aggregated mask from the last level. Through iterative joint aggregation, our PUA-MOS outperforms the previous methods significantly on both KITTI [4] and Waymo [26] datasets. The code will be provided to generate the moving segmentation labels on both datasets for reproduction.

AAAI Conference 2020 Conference Paper

PedHunter: Occlusion Robust Pedestrian Detector in Crowded Scenes

  • Cheng Chi
  • Shifeng Zhang
  • Junliang Xing
  • Zhen Lei
  • Stan Z. Li
  • Xudong Zou

Pedestrian detection in crowded scenes is a challenging problem, because occlusion happens frequently among different pedestrians. In this paper, we propose an effective and efficient detection network to hunt pedestrians in crowd scenes. The proposed method, namely PedHunter, introduces strong occlusion handling ability to existing region-based detection networks without bringing extra computations in the inference stage. Specifically, we design a mask-guided module to leverage the head information to enhance the feature representation learning of the backbone network. Moreover, we develop a strict classification criterion by improving the quality of positive samples during training to eliminate common false positives of pedestrian detection in crowded scenes. Besides, we present an occlusion-simulated data augmentation to enrich the pattern and quantity of occlusion samples to improve the occlusion robustness. As a consequent, we achieve state-of-the-art results on three pedestrian detection datasets including CityPersons, Caltech-USA and CrowdHuman. To facilitate further studies on the occluded pedestrian detection in surveillance scenes, we release a new pedestrian dataset, called SUR-PED, with a total of over 162k highquality manually labeled instances in 10k images. The proposed dataset, source codes and trained models are available at https: //github. com/ChiCheng123/PedHunter.

AAAI Conference 2020 Conference Paper

Relational Learning for Joint Head and Human Detection

  • Cheng Chi
  • Shifeng Zhang
  • Junliang Xing
  • Zhen Lei
  • Stan Z. Li
  • Xudong Zou

Head and human detection have been rapidly improved with the development of deep convolutional neural networks. However, these two tasks are often studied separately without considering their inherent correlation, leading to that 1) head detection is often trapped in more false positives, and 2) the performance of human detector frequently drops dramatically in crowd scenes. To handle these two issues, we present a novel joint head and human detection network, namely JointDet, which effectively detects head and human body simultaneously. Moreover, we design a head-body relationship discriminating module to perform relational learning between heads and human bodies, and leverage this learned relationship to regain the suppressed human detections and reduce head false positives. To verify the effectiveness of the proposed method, we annotate head bounding boxes of the CityPersons and Caltech-USA datasets, and conduct extensive experiments on the CrowdHuman, CityPersons and Caltech-USA datasets. As a consequence, the proposed Joint- Det detector achieves state-of-the-art performance on these three benchmarks. To facilitate further studies on the head and human detection problem, all new annotations, source codes and trained models are available at https: //github. com/ ChiCheng123/JointDet.

NeurIPS Conference 2020 Conference Paper

RelationNet++: Bridging Visual Representations for Object Detection via Transformer Decoder

  • Cheng Chi
  • Fangyun Wei
  • Han Hu

Existing object detection frameworks are usually built on a single format of object/part representation, i. e. , anchor/proposal rectangle boxes in RetinaNet and Faster R-CNN, center points in FCOS and RepPoints, and corner points in CornerNet. While these different representations usually drive the frameworks to perform well in different aspects, e. g. , better classification or finer localization, it is in general difficult to combine these representations in a single framework to make good use of each strength, due to the heterogeneous or non-grid feature extraction by different representations. This paper presents an attention-based decoder module similar as that in Transformer~\cite{vaswani2017attention} to bridge other representations into a typical object detector built on a single representation format, in an end-to-end fashion. The other representations act as a set of \emph{key} instances to strengthen the main \emph{query} representation features in the vanilla detectors. Novel techniques are proposed towards efficient computation of the decoder module, including a \emph{key sampling} approach and a \emph{shared location embedding} approach. The proposed module is named \emph{bridging visual representations} (BVR). It can perform in-place and we demonstrate its broad effectiveness in bridging other representations into prevalent object detection frameworks, including RetinaNet, Faster R-CNN, FCOS and ATSS, where about $1. 5\sim3. 0$ AP improvements are achieved. In particular, we improve a state-of-the-art framework with a strong backbone by about $2. 0$ AP, reaching $52. 7$ AP on COCO test-dev. The resulting network is named RelationNet++. The code is available at \url{https: //github. com/microsoft/RelationNet2}.

AAAI Conference 2019 Conference Paper

Selective Refinement Network for High Performance Face Detection

  • Cheng Chi
  • Shifeng Zhang
  • Junliang Xing
  • Zhen Lei
  • Stan Z. Li
  • Xudong Zou

High performance face detection remains a very challenging problem, especially when there exists many tiny faces. This paper presents a novel single-shot face detector, named Selective Refinement Network (SRN), which introduces novel twostep classification and regression operations selectively into an anchor-based face detector to reduce false positives and improve location accuracy simultaneously. In particular, the SRN consists of two modules: the Selective Two-step Classification (STC) module and the Selective Two-step Regression (STR) module. The STC aims to filter out most simple negative anchors from low level detection layers to reduce the search space for the subsequent classifier, while the STR is designed to coarsely adjust the locations and sizes of anchors from high level detection layers to provide better initialization for the subsequent regressor. Moreover, we design a Receptive Field Enhancement (RFE) block to provide more diverse receptive field, which helps to better capture faces in some extreme poses. As a consequence, the proposed SRN detector achieves state-of-the-art performance on all the widely used face detection benchmarks, including AFW, PASCAL face, FDDB, and WIDER FACE datasets. Codes will be released to facilitate further studies on the face detection problem.