Author name cluster

Yihao Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

ICLR Conference 2025 Conference Paper

Any-step Dynamics Model Improves Future Predictions for Online and Offline Reinforcement Learning

Haoxin Lin
Yu-Yan Xu
Yihao Sun
Zhilong Zhang
Yi-Chen Li 0001
Chengxing Jia
Junyin Ye
Jiaji Zhang

Model-based methods in reinforcement learning offer a promising approach to enhance data efficiency by facilitating policy exploration within a dynamics model. However, accurately predicting sequential steps in the dynamics model remains a challenge due to the bootstrapping prediction, which attributes the next state to the prediction of the current state. This leads to accumulated errors during model roll-out. In this paper, we propose the Any-step Dynamics Model (ADM) to mitigate the compounding error by reducing bootstrapping prediction to direct prediction. ADM allows for the use of variable-length plans as inputs for predicting future states without frequent bootstrapping. We design two algorithms, ADMPO-ON and ADMPO-OFF, which apply ADM in online and offline model-based frameworks, respectively. In the online setting, ADMPO-ON demonstrates improved sample efficiency compared to previous state-of-the-art methods. In the offline setting, ADMPO-OFF not only demonstrates superior performance compared to recent state-of-the-art offline approaches but also offers better quantification of model uncertainty using only a single ADM.

Details

AAAI Conference 2025 Conference Paper

Column-Oriented Datalog on the GPU

Yihao Sun
Sidharth Kumar
Thomas Gilray
Kristopher Micinski

Datalog is a logic programming language widely used in knowledge representation and reasoning (KRR), program analysis, and social media mining due to its expressiveness and high performance. Traditionally, Datalog engines use either row-oriented or column-oriented storage. Engines like VLog and Nemo favor column-oriented storage for efficiency on limited-resource machines, while row-oriented engines like Soufflé use advanced datastructures with locking to perform better on multi-core CPUs. The advent of modern datacenter GPUs, such as the NVIDIA H100 with its ability to run over 16k threads simultaneously and high memory bandwidth, has reopened the debate on which storage layout is more effective. This paper presents the first column-oriented Datalog engines tailored to the strengths of modern GPUs. We present VFLog, a CUDA-based Datalog runtime library with a column-oriented GPU datastructure that supports all necessary relational algebra operations. Our results demonstrate over 200x performance gains over SOTA CPU-based column-oriented Datalog engines and a 2.5x speedup over GPU Datalog engines in various workloads, including KRR.

PDF Details DOI

IROS Conference 2025 Conference Paper

Emergent Cooperative Strategies for Pursuit-Evasion in Cluttered Environments: A Knowledge-Enhanced Multi-Agent Deep Reinforcement Learning Approach

Yihao Sun
Chao Yan
Han Zhou
Xiaojia Xiang
Jie Jiang

Deep reinforcement learning (DRL) has recently emerged as a promising tool for tackling pursuit-evasion tasks. However, most existing DRL-based pursuit approaches still rely on individual rewards and struggle with complex scenarios. To address these challenges, we propose a knowledge-enhanced DRL approach for multi-agent pursuit-evasion in complex environments. Specifically, the cooperative pursuit problem is modeled as a decentralized partially observable Markov decision process from each pursuers perspective, where the team reward function is elaborately designed to encourage collaborative behavior and enhance team coordination. Then, a novel knowledge enhanced multi-agent twin delayed deep deterministic policy gradient (KE-MATD3) algorithm is presented to efficiently learn the cooperative pursuit policy. By integrating a knowledge enhancement mechanism that extracts effective information from an improved artificial potential field method, the cooperative pursuit policy achieves more robust convergence, mitigating the local optima that typically arise from individual reward-based learning. Finally, extensive numerical simulations and real-world experiments validate the efficiency and superiority of the proposed approach, demonstrating emergent cooperative behaviors among the pursuers.

Details

ICML Conference 2025 Conference Paper

Improving Reward Model Generalization from Adversarial Process Enhanced Preferences

Zhilong Zhang
Tian Xu 0003
Xinghao Du
Xingchen Cao
Yihao Sun
Yang Yu 0001

In sequential decision-making, the reward function serves as the primary supervision signal, guiding agents to acquire the desired behaviors. Traditional reward modeling methods rely heavily on human expertise, limiting their scalability. Automated preference generation from suboptimal demonstrations has emerged as a promising alternative to address this limitation. This approach first generates preference data from suboptimal demonstrations and then trains reward models based on these preferences. Despite its potential, existing methods often struggle to generate preference data with sufficient coverage, limiting the accuracy and generalizability of the resulting reward models. To overcome this limitation, we propose APEC (Automated Preference generation with Enhanced Coverage), a novel method that improves the coverage of preference data. APEC achieves this by selecting policy pairs with significantly different iteration indices from the whole adversarial imitation learning process. We provide a theoretical analysis to validate that the selected policy pairs provably hold preference relationships. Experimental results demonstrate that APEC consistently outperforms baseline methods in generating preferences with broader coverage across both vector-based and pixel-based control tasks. Consequently, the reward models trained with APEC align more closely with ground-truth rewards, deriving improved policy performance.

Details

YNIMG Journal 2025 Journal Article

Relationship between cognitive impairment and hippocampal iron overload: A quantitative susceptibility mapping study of a rat model

Xi Deng
Meiru Bu
Jiali Liang
Yihao Sun
Liyan Li
Heishu Zheng
Zisan Zeng
Muliang Jiang

BACKGROUND: The aim of this study was to establish an iron overload rat model to simulate the elevated iron levels in patients with thalassemia and to investigate the potential association between hippocampal iron deposition and cognition. METHODS: Two groups of iron overloaded rats and one group of control rats were used for this study. The Morris water maze (MWM) was used to test spatial reference memory indicated by escape latency time and number of MWM platform crossings. The magnetic susceptibility value of the hippocampal tissue, a measure of iron deposition, was assessed by quantitative susceptibility mapping (QSM) and was correlated with spatial reference memory performance. The iron content in hippocampal tissue sections of the rats were assessed using diaminobenzidine (DAB)-enhanced Perl's Prussian blue (PPB) staining. RESULTS: The rat groups with iron overload including the Group H and Group L had higher hippocampal magnetic susceptibility values than the control rat group, i.e., Group D. In addition, the iron overloaded groups had longer MWM escape latency than the control group, and reduced number of MWM platform crossings. There was a positive correlation between the mean escape latency and the mean hippocampal magnetic susceptibility value, a negative correlation between the number of platform crossings and the mean hippocampal magnetic susceptibility value, and a negative correlation between the number of platform crossings and the latent escape time in Group H and Group L. CONCLUSION: This rat model simulating iron overload in thalassemia showed hippocampal iron overload being associated with impairment of spatial reference memory. QSM could be used to quantify brain iron overload in vivo, highlighting its potential clinical application for assessing cognitive impairment in patients with thalassemia.

Details DOI

NeurIPS Conference 2024 Conference Paper

Assemblage: Automatic Binary Dataset Construction for Machine Learning

Chang Liu
Rebecca Saul
Yihao Sun
Edward Raff
Maya Fuchs
Townsend Southard Pantano
James Holt
Kristopher Micinski

Binary code is pervasive, and binary analysis is a key task in reverse engineering, malware classification, and vulnerability discovery. Unfortunately, while there exist large corpuses of malicious binaries, obtaining high-quality corpuses of benign binaries for modern systems has proven challenging (e. g. , due to licensing issues). Consequently, machine learning based pipelines for binary analysis utilize either costly commercial corpuses (e. g. , VirusTotal) or open-source binaries (e. g. , coreutils) available in limited quantities. To address these issues, we present Assemblage: an extensible cloud-based distributed system that crawls, configures, and builds Windows PE binaries to obtain high-quality binary corpuses suitable for training state-of-the-art models in binary analysis. We have run Assemblage on AWS over the past year, producing 890k Windows PE and 428k Linux ELF binaries across 29 configurations. Assemblage is designed to be both reproducible and extensible, enabling users to publish "recipes" for their datasets, and facilitating the extraction of a wide array of features. We evaluated Assemblage by using its data to train modern learning-based pipelines for compiler provenance and binary function similarity. Our results illustrate the practical need for robust corpuses of high-quality Windows PE binaries in training modern learning-based binary analyses.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward

Haoxin Lin
Hongqiu Wu
Jiaji Zhang
Yihao Sun
Junyin Ye
Yang Yu

Real-world decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance. The code is available at https://github.com/HxLyn3/Diaster.

PDF Details DOI

ICLR Conference 2024 Conference Paper

Flow to Better: Offline Preference-based Reinforcement Learning via Preferred Trajectory Generation

Zhilong Zhang
Yihao Sun
Junyin Ye
Tian-Shuo Liu
Jiaji Zhang
Yang Yu 0001

Offline preference-based reinforcement learning (PbRL) offers an effective solution to overcome the challenges associated with designing rewards and the high costs of online interactions. In offline PbRL, agents are provided with a fixed dataset containing human preferences between pairs of trajectories. Previous studies mainly focus on recovering the rewards from the preferences, followed by policy optimization with an off-the-shelf offline RL algorithm. However, given that preference label in PbRL is inherently trajectory-based, accurately learning transition-wise rewards from such label can be challenging, potentially leading to misguidance during subsequent offline RL training. To address this issue, we introduce our method named $\textit{Flow-to-Better (FTB)}$, which leverages the pairwise preference relationship to guide a generative model in producing preferred trajectories, avoiding Temporal Difference (TD) learning with inaccurate rewards. Conditioning on a low-preference trajectory, $\textit{FTB}$ uses a diffusion model to generate a better one with a higher preference, achieving high-fidelity full-horizon trajectory improvement. During diffusion training, we propose a technique called $\textit{Preference Augmentation}$ to alleviate the problem of insufficient preference data. As a result, we surprisingly find that the model-generated trajectories not only exhibit increased preference and consistency with the real transition but also introduce elements of $\textit{novelty}$ and $\textit{diversity}$, from which we can derive a desirable policy through imitation learning. Experimental results on D4RL benchmarks demonstrate that FTB achieves a remarkable improvement compared to state-of-the-art offline PbRL methods. Furthermore, we show that FTB can also serve as an effective data augmentation method for offline RL.

Details

ICML Conference 2024 Conference Paper

Policy-conditioned Environment Models are More Generalizable

Ruifeng Chen 0003
Xiong-Hui Chen
Yihao Sun
Siyuan Xiao
Minhui Li
Yang Yu 0001

In reinforcement learning, it is crucial to have an accurate environment dynamics model to evaluate different policies’ value in downstream tasks like offline policy optimization and policy evaluation. However, the learned model is known to be inaccurate in predictions when evaluating target policies different from data-collection policies. In this work, we found that utilizing policy representation for model learning, called policy-conditioned model (PCM) learning, is useful to mitigate the problem, especially when the offline dataset is collected from diversified behavior policies. The reason beyond that is in this case, PCM becomes a meta-dynamics model that is trained to be aware of and focus on the evaluation policies that on-the-fly adjust the model to be suitable to the evaluation policies’ state-action distribution, thus improving the prediction accuracy. Based on that intuition, we propose an easy-to-implement yet effective algorithm of PCM for accurate model learning. We also give a theoretical analysis and experimental evidence to demonstrate the feasibility of reducing value gaps by adapting the dynamics model under different policies. Experiment results show that PCM outperforms the existing SOTA off-policy evaluation methods in the DOPE benchmark by a large margin, and derives significantly better policies in offline policy selection and model predictive control compared with the standard model learning method.

Details

NeurIPS Conference 2024 Conference Paper

Provably and Practically Efficient Adversarial Imitation Learning with General Function Approximation

Tian Xu
Zhilong Zhang
Ruishuo Chen
Yihao Sun
Yang Yu

As a prominent category of imitation learning methods, adversarial imitation learning (AIL) has garnered significant practical success powered by neural network approximation. However, existing theoretical studies on AIL are primarily limited to simplified scenarios such as tabular and linear function approximation and involve complex algorithmic designs that hinder practical implementation, highlighting a gap between theory and practice. In this paper, we explore the theoretical underpinnings of online AIL with general function approximation. We introduce a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions. Theoretically, we prove that OPT-AIL achieves polynomial expert sample complexity and interaction complexity for learning near-expert policies. To our best knowledge, OPT-AIL is the first provably efficient AIL method with general function approximation. Practically, OPT-AIL only requires the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods in several challenging tasks.

PDF Details DOI

ECAI Conference 2023 Conference Paper

Model-Based Reinforcement Learning with Multi-Step Plan Value Estimation

Haoxin Lin
Yihao Sun
Jiaji Zhang
Yang Yu 0001

A promising way to improve the sample efficiency of reinforcement learning is model-based methods, in which many explorations and evaluations can happen in the learned models to save real-world samples. However, when the learned model has a non-negligible model error, sequential steps in the model are hard to be accurately evaluated, limiting the model’s utilization. This paper proposes to alleviate this issue by introducing multi-step plans into policy optimization for model-based RL. We employ the multi-step plan value estimation, which evaluates the expected discounted return after executing a sequence of action plans at a given state, and updates the policy by directly computing the multi-step policy gradient via plan value estimation. The new model-based reinforcement learning algorithm MPPVE (Model-based Planning Policy Learning with Multi-step Plan Value Estimation) shows a better utilization of the learned model and achieves a better sample efficiency than state-of-the-art model-based RL approaches. The code is available at https: //github. com/HxLyn3/MPPVE.

Details

ICML Conference 2023 Conference Paper

Model-Bellman Inconsistency for Model-based Offline Reinforcement Learning

Yihao Sun
Jiaji Zhang
Chengxing Jia
Haoxin Lin
Junyin Ye
Yang Yu 0001

For offline reinforcement learning (RL), model-based methods are expected to be data-efficient as they incorporate dynamics models to generate more data. However, due to inevitable model errors, straightforwardly learning a policy in the model typically fails in the offline setting. Previous studies have incorporated conservatism to prevent out-of-distribution exploration. For example, MOPO penalizes rewards through uncertainty measures from predicting the next states, which we have discovered are loose bounds of the ideal uncertainty, i. e. , the Bellman error. In this work, we propose MOdel-Bellman Inconsistency penalized offLinE Policy Optimization (MOBILE), a novel uncertainty-driven offline RL algorithm. MOBILE conducts uncertainty quantification through the inconsistency of Bellman estimations under an ensemble of learned dynamics models, which can be a better approximator to the true Bellman error, and penalizes the Bellman estimation based on this uncertainty. Empirically we have verified that our proposed uncertainty quantification can be significantly closer to the true Bellman error than the compared methods. Consequently, MOBILE outperforms prior offline RL approaches on most tasks of D4RL and NeoRL benchmarks.

Details

AIIM Journal 2022 Journal Article

ISSMF: Integrated semantic and spatial information of multi-level features for automatic segmentation in prenatal ultrasound images

Yihao Sun
Hongjian Yang
Jiliu Zhou
Yan Wang

As an effective way of routine prenatal diagnosis, ultrasound (US) imaging has been widely used recently. Biometrics obtained from the fetal segmentation shed light on fetal health monitoring. However, the segmentation in US images has strict requirements for sonographers on accuracy, making this task quite time-consuming and tedious. In this paper, we use DeepLabv3+ as the backbone and propose an Integrated Semantic and Spatial Information of Multi-level Features (ISSMF) based network to achieve the automatic and accurate segmentation of four parts of the fetus in US images while most of the previous works only segment one or two parts. Our contributions are threefold. First, to incorporate semantic information of high-level features and spatial information of low-level features of US images, we introduce a multi-level feature fusion module to integrate the features at different scales. Second, we propose to leverage the content-aware reassembly of features (CARAFE) upsampler to deeply explore the semantic and spatial information of multi-level features. Third, in order to alleviate performance degradation caused by batch normalization (BN) when batch size is small, we use group normalization (GN) instead. Experiments on four parts of fetus in US images show that our method outperforms the U-Net, DeepLabv3+ and the U-Net++ and the biometric measurements based on our segmentation results are pretty close to those derived from sonographers with ten-year work experience.

Details DOI