Arrow Research search

Author name cluster

Zichong Li

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

ICRA Conference 2025 Conference Paper

MARLadona - Towards Cooperative Team Play Using Multi-Agent Reinforcement Learning

  • Zichong Li
  • Filip Bjelonic
  • Victor Klemm
  • Marco Hutter 0001

Robot soccer, in its full complexity, poses an unsolved research challenge. Current solutions heavily rely on engineered heuristic strategies, which lack robustness and adaptability. Deep reinforcement learning has gained significant traction in various complex robotics tasks such as locomotion, manipulation, and competitive games (e. g. , AlphaZero, OpenAI Five), making it a promising solution to the robot soccer problem. This paper introduces MARLadona. A decentralized multi-agent reinforcement learning (MARL) training pipeline capable of producing agents with sophisticated team play behavior, bridging the shortcomings of heuristic methods. Furthermore, we created an open-source multi-agent soccer environment. Utilizing our MARL framework and a modified global entity encoder (GEE) as our core architecture, our approach achieves a 66. 8 % win rate against HELIOS agent, which employs a state-of-the-art heuristic strategy. In addition, we provided an in-depth analysis of the policy behavior and interpreted the agent's intention using the critic network.

NeurIPS Conference 2024 Conference Paper

Adaptive Preference Scaling for Reinforcement Learning with Human Feedback

  • Ilgee Hong
  • Zichong Li
  • Alexander Bukharin
  • Yixiao Li
  • Haoming Jiang
  • Tianbao Yang
  • Tuo Zhao

Reinforcement learning from human feedback (RLHF) is a prevalent approach to align AI systems with human values by learning rewards from human preference data. Due to various reasons, however, such data typically takes the form of rankings over pairs of trajectory segments, which fails to capture the varying strengths of preferences across different pairs. In this paper, we propose a novel adaptive preference loss, underpinned by distributionally robust optimization (DRO), designed to address this uncertainty in preference strength. By incorporating an adaptive scaling parameter into the loss for each pair, our method increases the flexibility of the reward function. Specifically, it assigns small scaling parameters to pairs with ambiguous preferences, leading to more comparable rewards, and large scaling parameters to those with clear preferences for more distinct rewards. Computationally, our proposed loss function is strictly convex and univariate with respect to each scaling parameter, enabling its efficient optimization through a simple second-order algorithm. Our method is versatile and can be readily adapted to various preference optimization frameworks, including direct preference optimization (DPO). Our experiments with robotic control and natural language generation with large language models (LLMs) show that our method not only improves policy performance but also aligns reward function selection more closely with policy optimization, simplifying the hyperparameter tuning process.

ICML Conference 2024 Conference Paper

Beyond Point Prediction: Score Matching-based Pseudolikelihood Estimation of Neural Marked Spatio-Temporal Point Process

  • Zichong Li
  • Qunzhi Xu
  • Zhenghao Xu
  • Yajun Mei
  • Tuo Zhao
  • Hongyuan Zha

Spatio-temporal point processes (STPPs) are potent mathematical tools for modeling and predicting events with both temporal and spatial features. Despite their versatility, most existing methods for learning STPPs either assume a restricted form of the spatio-temporal distribution, or suffer from inaccurate approximations of the intractable integral in the likelihood training objective. These issues typically arise from the normalization term of the probability density function. Moreover, existing works only provide point prediction for events without quantifying their uncertainty, such as confidence intervals for the event’s arrival time and confidence regions for the event’s location, which is crucial given the considerable randomness of the data. To tackle these challenges, we introduce SMASH: a Score MAtching-based pSeudolikeliHood estimator for learning marked STPPs. Specifically, our framework adopts a normalization-free objective by estimating the pseudolikelihood of marked STPPs through score-matching and predicts confidence intervals/regions for event time and location by generating samples through a score-based sampling algorithm. The superior performance of our proposed framework is demonstrated through extensive experiments on both point and confidence interval/region prediction of events.

NeurIPS Conference 2024 Conference Paper

Robust Reinforcement Learning from Corrupted Human Feedback

  • Alexander Bukharin
  • Ilgee Hong
  • Haoming Jiang
  • Zichong Li
  • Qingru Zhang
  • Zixuan Zhang
  • Tuo Zhao

Reinforcement learning from human feedback (RLHF) provides a principled framework for aligning AI systems with human preference data. For various reasons, e. g. , personal bias, context ambiguity, lack of training, etc, human annotators may give incorrect or inconsistent preference labels. To tackle this challenge, we propose a robust RLHF approach -- $R^3M$, which models the potentially corrupted preference label as sparse outliers. Accordingly, we formulate the robust reward learning as an $\ell_1$-regularized maximum likelihood estimation problem. Computationally, we develop an efficient alternating optimization algorithm, which only incurs negligible computational overhead compared with the standard RLHF approach. Theoretically, we prove that under proper regularity conditions, $R^3M$ can consistently learn the underlying reward and identify outliers, provided that the number of outlier labels scales sublinearly with the preference sample size. Furthermore, we remark that $R^3M$ is versatile and can be extended to various preference optimization methods, including direct preference optimization (DPO). Our experiments on robotic control and natural language generation with large language models (LLMs) show that $R^3M$ improves robustness of the reward against several types of perturbations to the preference data.

ICML Conference 2023 Conference Paper

SMURF-THP: Score Matching-based UnceRtainty quantiFication for Transformer Hawkes Process

  • Zichong Li
  • Yanbo Xu
  • Simiao Zuo
  • Haoming Jiang
  • Chao Zhang
  • Tuo Zhao
  • Hongyuan Zha

Transformer Hawkes process models have shown to be successful in modeling event sequence data. However, most of the existing training methods rely on maximizing the likelihood of event sequences, which involves calculating some intractable integral. Moreover, the existing methods fail to provide uncertainty quantification for model predictions, e. g. , confidence interval for the predicted event’s arrival time. To address these issues, we propose SMURF-THP, a score-based method for learning Transformer Hawkes process and quantifying prediction uncertainty. Specifically, SMURF-THP learns the score function of the event’s arrival time based on a score-matching objective that avoids the intractable computation. With such a learnt score function, we can sample arrival time of events from the predictive distribution. This naturally allows for the quantification of uncertainty by computing confidence intervals over the generated samples. We conduct extensive experiments in both event type prediction and uncertainty quantification on time of arrival. In all the experiments, SMURF-THP outperforms existing likelihood-based methods in confidence calibration while exhibiting comparable prediction accuracy.

AAAI Conference 2022 Conference Paper

Zeroth-Order Optimization for Composite Problems with Functional Constraints

  • Zichong Li
  • Pin-Yu Chen
  • Sijia Liu
  • Songtao Lu
  • Yangyang Xu

In many real-world problems, first-order (FO) derivative evaluations are too expensive or even inaccessible. For solving these problems, zeroth-order (ZO) methods that only need function evaluations are often more efficient than FO methods or sometimes the only options. In this paper, we propose a novel zeroth-order inexact augmented Lagrangian method (ZO-iALM) to solve black-box optimization problems, which involve a composite (i. e. , smooth+nonsmooth) objective and functional constraints. This appears to be the first work that develops an iALM-based ZO method for functional constrained optimization and meanwhile achieves query complexity results matching the best-known FO complexity results up to a factor of variable dimension. With an extensive experimental study, we show the effectiveness of our method. The applications of our method span from classical optimization problems to practical machine learning examples such as resource allocation in sensor networks and adversarial example generation.

ICML Conference 2020 Conference Paper

Transformer Hawkes Process

  • Simiao Zuo
  • Haoming Jiang
  • Zichong Li
  • Tuo Zhao
  • Hongyuan Zha

Modern data acquisition routinely produce massive amounts of event sequence data in various domains, such as social media, healthcare, and financial markets. These data often exhibit complicated short-term and long-term temporal dependencies. However, most of the existing recurrent neural network based point process models fail to capture such dependencies, and yield unreliable prediction performance. To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency. Numerical experiments on various datasets show that THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin. Moreover, THP is quite general and can incorporate additional structural knowledge. We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.