Arrow Research search

Author name cluster

Yue Jin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

AAAI Conference 2026 Conference Paper

Partial Action Replacement: Tackling Distribution Shift in Offline MARL

  • Yue Jin
  • Giovanni Montana

Offline multi-agent reinforcement learning (MARL) is severely hampered by the challenge of evaluating out-of-distribution (OOD) joint actions. Our core finding is that when the behavior policy is factorized—a common scenario where agents act fully or partially independently during data collection—a strategy of partial action replacement (PAR) can significantly mitigate this challenge. PAR updates a single or part of agents' actions while the others remain fixed to the behavioral data, reducing distribution shift compared to full joint-action updates. Based on this insight, we develop Soft-Partial Conservative Q-Learning (SPaCQL), using PAR to mitigate OOD issue and dynamically weighting different PAR strategies based on the uncertainty of value estimation. We provide a rigorous theoretical foundation for this approach, proving that under factorized behavior policies, the induced distribution shift scales linearly with the number of deviating agents rather than exponentially with the joint-action space. This yields a provably tighter value error bound for this important class of offline MARL problems. Our theoretical results also indicate that SPaCQL adaptively addresses distribution shift using uncertainty-informed weights. Our empirical results demonstrate SPaCQL enables more effective policy learning, and manifest its remarkable superiority over baseline algorithms when the offline dataset exhibits the independence structure.

ICLR Conference 2025 Conference Paper

Learning on One Mode: Addressing Multi-modality in Offline Reinforcement Learning

  • Mianchu Wang
  • Yue Jin
  • Giovanni Montana

Offline reinforcement learning (RL) seeks to learn optimal policies from static datasets without interacting with the environment. A common challenge is handling multi-modal action distributions, where multiple behaviours are represented in the data. Existing methods often assume unimodal behaviour policies, leading to suboptimal performance when this assumption is violated. We propose weighted imitation Learning on One Mode (LOM), a novel approach that focuses on learning from a single, promising mode of the behaviour policy. By using a Gaussian mixture model to identify modes and selecting the best mode based on expected returns, LOM avoids the pitfalls of averaging over conflicting actions. Theoretically, we show that LOM improves performance while maintaining simplicity in policy learning. Empirically, LOM outperforms existing methods on standard D4RL benchmarks and demonstrates its effectiveness in complex, multi-modal scenarios.

TMLR Journal 2025 Journal Article

State-Constrained Offline Reinforcement Learning

  • Charles Alexander Hepburn
  • Yue Jin
  • Giovanni Montana

Traditional offline reinforcement learning (RL) methods predominantly operate in a batch-constrained setting. This confines the algorithms to a specific state-action distribution present in the dataset, reducing the effects of distributional shift but restricting the policy to seen actions. In this paper, we alleviate this limitation by introducing state-constrained offline RL, a novel framework that focuses solely on the dataset’s state distribution. This approach allows the policy to take high-quality out-of-distribution actions that lead to in- distribution states, significantly enhancing learning potential. The proposed setting not only broadens the learning horizon but also improves the ability to combine different trajectories from the dataset effectively, a desirable property inherent in offline RL. Our research is underpinned by theoretical findings that pave the way for subsequent advancements in this area. Additionally, we introduce StaCQ, a deep learning algorithm that achieves state-of-the-art performance on the D4RL benchmark datasets and aligns with our theoretical propositions. StaCQ establishes a strong baseline for forthcoming explorations in this domain.

TMLR Journal 2024 Journal Article

Mitigating Relative Over-Generalization in Multi-Agent Reinforcement Learning

  • Ting Zhu
  • Yue Jin
  • Jeremie Houssineau
  • Giovanni Montana

In decentralized multi-agent reinforcement learning, agents learning in isolation can lead to relative over-generalization (RO), where optimal joint actions are undervalued in favor of suboptimal ones. This hinders effective coordination in cooperative tasks, as agents tend to choose actions that are individually rational but collectively suboptimal. To address this issue, we introduce MaxMax Q-Learning (MMQ), which employs an iterative process of sampling and evaluating potential next states, selecting those with maximal Q-values for learning. This approach refines approximations of ideal state transitions, aligning more closely with the optimal joint policy of collaborating agents. We provide theoretical analysis supporting MMQ's potential and present empirical evaluations across various environments susceptible to RO. Our results demonstrate that MMQ frequently outperforms existing baselines, exhibiting enhanced convergence and sample efficiency.

AAMAS Conference 2022 Conference Paper

Learning to Advise and Learning from Advice in Cooperative Multiagent Reinforcement Learning

  • Yue Jin
  • Shuangqing Wei
  • Jian Yuan
  • Xudong Zhang

We propose a novel policy-level generative adversarial learning framework to enhance cooperative multiagent reinforcement learning (MARL), which consists of a centralized advisor, MARL agents and discriminators. The advisor is realized through a dual graph convolutional network (DualGCN) to give advice to agents from a global perspective via fusing decision information, resolving spatial conflicts, and maintaining temporal continuity. Each discriminator trained can distinguish between the policies of the advisor and an agent. Leveraging the discriminator’s judgment, each agent learns to match with the advised policy in addition to learning by its own exploration, which accelerates learning and enhances policy performance. Additionally, an advisor boosting method which incorporates the relevant suggestion made by the discriminators into the training of DualGCN is proposed to further help improve MARL agents. We validate our methods in cooperative navigation tasks. Results demonstrate that our method outperforms baseline methods in terms of both learning efficiency and policy efficacy.

ICML Conference 2022 Conference Paper

Supervised Off-Policy Ranking

  • Yue Jin
  • Yue Zhang
  • Tao Qin 0001
  • Xudong Zhang 0001
  • Jian Yuan
  • Houqiang Li
  • Tie-Yan Liu

Off-policy evaluation (OPE) is to evaluate a target policy with data generated by other policies. Most previous OPE methods focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is a much simpler task than precisely evaluating their true performance; and (2) there are usually multiple policies that have been deployed to serve users in real-world systems and thus the true performance of these policies can be known. Inspired by the two observations, in this work, we study a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of target policies based on supervised learning by leveraging off-policy data and policies with known performance. We propose a method to solve SOPR, which learns a policy scoring model by minimizing a ranking loss of the training policies rather than estimating the precise policy performance. The scoring model in our method, a hierarchical Transformer based model, maps a set of state-action pairs to a score, where the state of each pair comes from the off-policy data and the action is taken by a target policy on the state in an offline manner. Extensive experiments on public datasets show that our method outperforms baseline methods in terms of rank correlation, regret value, and stability. Our code is publicly available at GitHub.

ICRA Conference 2011 Conference Paper

A novel optimal calibration algorithm on a dexterous 6 DOF serial robot-with the optimization of measurement poses number

  • Tian Li
  • Kui Sun
  • Yue Jin
  • Hong Liu 0002

Normally, people always believe that the more measurement poses used in a robot calibration process, the more accurate result can be obtained. However, the accuracy improvement converges to a threshold after a number of measurement poses. Moreover, robot calibration is a time consuming process, too many poses would seriously complicate the process and consumedly increase the spending time. In this paper, an optimal measurement pose number searching method was proposed to improve the calibration method in time spending aspect. Optimal robot poses were added to an initial pose set one by one to establish a new pose set for the robot calibration. The root mean squares (RMS) of the end-effector pose errors after being calibrated by using these pose sets were calculated. The optimal number of the configuration set which correspond to the least RMS of pose error can then be obtained. This algorithm can get higher end-effector accuracy, meanwhile consume less time. The simulation on a serial robot manipulator with 24 unknown kinematic parameters shows that the end-effector pose accuracy after calibrated by the using of the optimal pose set is much better than the result before calibration, and is better than the using of a random pose set.