Author name cluster

Hao Shi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

2 author rows

AAAI Conference 2026 Conference Paper

SpatialActor: Exploring Disentangled Spatial Representations for Robust Robotic Manipulation

Hao Shi
Bin Xie
Yingfei Liu
Yang Yue
Tiancai Wang
Haoqiang Fan
Xiangyu Zhang
Gao Huang

Robotic manipulation requires precise spatial understanding to interact with objects in the real world. Point-based methods suffer from sparse sampling, leading to the loss of fine-grained semantics. Image-based methods typically feed RGB and depth into 2D backbones pre-trained on 3D auxiliary tasks, but their entangled semantics and geometry are sensitive to inherent depth noise in real-world that disrupts semantic understanding. Moreover, these methods focus on high-level geometry while overlooking low-level spatial cues essential for precise interaction. We propose SpatialActor, a disentangled framework for robust robotic manipulation that explicitly decouples semantics and geometry. The Semantic-guided Geometric Module adaptively fuses two complementary geometry from noisy depth and semantic-guided expert priors. Also, a Spatial Transformer leverages low-level spatial cues for accurate 2D-3D mapping and enables interaction among spatial features. We evaluate SpatialActor on multiple simulation and real-world scenarios across 50+ tasks. It achieves state-of-the-art performance with 87.4% on RLBench and improves by 13.9% to 19.4% under varying noisy conditions, showing strong robustness. Moreover, it significantly enhances few-shot generalization to new tasks and maintains robustness under various spatial perturbations.

PDF Details DOI

AAAI Conference 2025 Conference Paper

AlphaForge: A Framework to Mine and Dynamically Combine Formulaic Alpha Factors

Hao Shi
Weili Song
Xinting Zhang
Jiahe Shi
Cuicui Luo
Xiang Ao
Hamid Arian
Luis Angel Seco

The complexity of financial data, characterized by its variability and low signal-to-noise ratio, necessitates advanced methods in quantitative investment that prioritize both performance and interpretability.Transitioning from early manual extraction to genetic programming, the most advanced approach in the alpha factor mining domain currently employs reinforcement learning to mine a set of combination factors with fixed weights. However, the performance of resultant alpha factors exhibits inconsistency, and the inflexibility of fixed factor weights proves insufficient in adapting to the dynamic nature of financial markets. To address this issue, this paper proposes a two-stage formulaic alpha generating framework AlphaForge, for alpha factor mining and factor combination. This framework employs a generative-predictive neural network to generate factors, leveraging the robust spatial exploration capabilities inherent in deep learning while concurrently preserving diversity. The combination model within the framework incorporates the temporal performance of factors for selection and dynamically adjusts the weights assigned to each component alpha factor. Experiments conducted on real-world datasets demonstrate that our proposed model outperforms contemporary benchmarks in formulaic alpha factor mining. Furthermore, our model exhibits a notable enhancement in portfolio returns within the realm of quantitative investment and real money investment.

PDF Details DOI

ICLR Conference 2025 Conference Paper

DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding

Henry Zheng
Hao Shi
Qihang Peng
Yong Xien Chng
Rui Huang 0012
Yepeng Weng
Zhongchao Shi
Gao Huang 0001

Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based on verbal descriptions. However, this task faces two significant challenges: (1) loss of fine-grained visual semantics due to sparse fusion of point clouds with ego-centric multi-view images, (2) limited textual semantic context due to arbitrary language descriptions. We propose DenseGrounding, a novel approach designed to address these issues by enhancing both visual and textual semantics. For visual features, we introduce the Hierarchical Scene Semantic Enhancer, which retains dense semantics by capturing fine-grained global scene features and facilitating cross-modal alignment. For text descriptions, we propose a Language Semantic Enhancer that leverage large language models to provide rich context and diverse language descriptions with additional context during model training. Extensive experiments show that DenseGrounding significantly outperforms existing methods in overall accuracy, achieving improvements of **5.81%** and **7.56%** when trained on the comprehensive full training dataset and smaller mini subset, respectively, further advancing the SOTA in ego-centric 3D visual grounding. Our method also achieves **1st place** and receives **Innovation Award** in the 2024 Autonomous Grand Challenge Multi-view 3D Visual Grounding Track, validating its effectiveness and robustness.

Details

NeurIPS Conference 2025 Conference Paper

Improving Monte Carlo Tree Search for Symbolic Regression

Zhengyao Huang
Daniel Huang
Tiannan Xiao
Dina Ma
Zhenyu Ming
Hao Shi
Yuanhui Wen

Symbolic regression aims to discover concise, interpretable mathematical expressions that satisfy desired objectives, such as fitting data, posing a highly combinatorial optimization problem. While genetic programming has been the dominant approach, recent efforts have explored reinforcement learning methods for improving search efficiency. Monte Carlo Tree Search (MCTS), with its ability to balance exploration and exploitation through guided search, has emerged as a promising technique for symbolic expression discovery. However, its traditional bandit strategies and sequential symbol construction often limit performance. In this work, we propose an improved MCTS framework for symbolic regression that addresses these limitations through two key innovations: (1) an extreme bandit allocation strategy tailored for identifying globally optimal expressions, with finite-time performance guarantees under polynomial reward decay assumptions; and (2) evolution-inspired state-jumping actions such as mutation and crossover, which enable non-local transitions to promising regions of the search space. These state-jumping actions also reshape the reward landscape during the search process, improving both robustness and efficiency. We conduct a thorough numerical study to the impact of these improvements and benchmark our approach against existing symbolic regression methods on a variety of datasets, including both ground-truth and black-box datasets. Our approach achieves competitive performance with state-of-the-art libraries in terms of recovery rate, attains favorable positions on the Pareto frontier of accuracy versus model complexity.

PDF Details

NeurIPS Conference 2025 Conference Paper

mmWalk: Towards Multi-modal Multi-view Walking Assistance

Kedi Ying
Ruiping Liu
Chongyan Chen
Mingzhe Tao
Hao Shi
Kailun Yang
Jiaming Zhang
Rainer Stiefelhagen

Walking assistance in extreme or complex environments remains a significant challenge for people with blindness or low vision (BLV), largely due to the lack of a holistic scene understanding. Motivated by the real-world needs of the BLV community, we build mmWalk, a simulated multi-modal dataset that integrates multi-view sensor and accessibility-oriented features for outdoor safe navigation. Our dataset comprises $120$ manually controlled, scenario-categorized walking trajectories with $62k$ synchronized frames. It contains over $559k$ panoramic images across RGB, depth, and semantic modalities. Furthermore, to emphasize real-world relevance, each trajectory involves outdoor corner cases and accessibility-specific landmarks for BLV users. Additionally, we generate mmWalkVQA, a VQA benchmark with over $69k$ visual question-answer triplets across $9$ categories tailored for safe and informed walking assistance. We evaluate state-of-the-art Vision-Language Models (VLMs) using zero- and few-shot settings and found they struggle with our risk assessment and navigational tasks. We validate our mmWalk-finetuned model on real-world datasets and show the effectiveness of our dataset for advancing multi-modal walking assistance.

PDF Details

IJCAI Conference 2024 Conference Paper

Label-efficient Semantic Scene Completion with Scribble Annotations

Song Wang
Jiawei Yu
Wentong Li
Hao Shi
Kailun Yang
Junbo Chen
Jianke Zhu

Semantic scene completion aims to infer the 3D geometric structures with semantic classes from camera or LiDAR, which provide essential occupancy information in autonomous driving. Prior endeavors concentrate on constructing the network or benchmark in a fully supervised manner. While the dense occupancy grids need point-wise semantic annotations, which incur expensive and tedious labeling costs. In this paper, we build a new label-efficient benchmark, named ScribbleSC, where the sparse scribble-based semantic labels are combined with dense geometric labels for semantic scene completion. In particular, we propose a simple yet effective approach called Scribble2Scene, which bridges the gap between the sparse scribble annotations and fully-supervision. Our method consists of geometric-aware auto-labelers construction and online model training with an offline-to-online distillation module to enhance the performance. Experiments on SemanticKITTI demonstrate that Scribble2Scene achieves competitive performance against the fully-supervised counterparts, showing 99% performance of the fully-supervised models with only 13. 5% voxels labeled. Both annotations of ScribbleSC and our full implementation are available at https: //github. com/songw-zju/Scribble2Scene.

PDF Details DOI

EAAI Journal 2023 Journal Article

Detection of outlying patterns from sparse and irregularly sampled electronic health records data

Xiaokang Wang
Chengjian Li
Hao Shi
Congshan Wu
Chao Liu

Within the intensive care unit (ICU), vital signs such as arterial blood pressure (ABP) collected from electronic health records (EHRs) are typically recorded at different and uneven sampling frequencies and are often infrequently measured due to the nature of the medical treatment. Furthermore, from a temporal trajectory perspective, EHR data are likely to be corrupted by outlying patterns that deviate from normal samples in terms of the curves’ magnitude and shape. In this work, we propose a two-stage outlier detection approach for sparse and irregularly sampled (SiS) temporal data using functional data analysis (FDA) tools. In the first stage, an outlier identification measure is defined by a max–min statistic and a clean subset that contains nonoutliers. In the second stage, a multiple hypothesis testing problem is formulated based on the asymptotic distribution of the proposed measure. The simulation-based framework shows that the proposed method is robust to different types of shape and magnitude outliers. The detection results are more accurate than the widely used functional depth methods, especially in extremely sparse settings where the proportion of the observed data points over the entire time series is approximately 10%. Extensive experiments are also conducted on the real-world MIMIC-II dataset, which demonstrate that the method effectively detects clinically meaningful outlying patterns.

Details DOI

NeurIPS Conference 2023 Conference Paper

Open Compound Domain Adaptation with Object Style Compensation for Semantic Segmentation

Tingliang Feng
Hao Shi
Xueyang Liu
Wei Feng
Liang Wan
Yanlin Zhou
Di Lin

Many methods of semantic image segmentation have borrowed the success of open compound domain adaptation. They minimize the style gap between the images of source and target domains, more easily predicting the accurate pseudo annotations for target domain's images that train segmentation network. The existing methods globally adapt the scene style of the images, whereas the object styles of different categories or instances are adapted improperly. This paper proposes the Object Style Compensation, where we construct the Object-Level Discrepancy Memory with multiple sets of discrepancy features. The discrepancy features in a set capture the style changes of the same category's object instances adapted from target to source domains. We learn the discrepancy features from the images of source and target domains, storing the discrepancy features in memory. With this memory, we select appropriate discrepancy features for compensating the style information of the object instances of various categories, adapting the object styles to a unified style of source domain. Our method enables a more accurate computation of the pseudo annotations for target domain's images, thus yielding state-of-the-art results on different datasets.

PDF Details