Arrow Research search

Author name cluster

Xue Yan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers
2 author rows

Possible papers

8

ICLR Conference 2025 Conference Paper

Efficient Reinforcement Learning with Large Language Model Priors

  • Xue Yan
  • Yan Song
  • Xidong Feng
  • Mengyue Yang
  • Haifeng Zhang
  • Haitham Bou-Ammar
  • Jun Wang

In sequential decision-making (SDM) tasks, methods like reinforcement learning (RL) and heuristic search have made notable advances in specific cases. However, they often require extensive exploration and face challenges in generalizing across diverse environments due to their limited grasp of the underlying decision dynamics. In contrast, large language models (LLMs) have recently emerged as powerful general-purpose tools, due to their capacity to maintain vast amounts of domain-specific knowledge. To harness this rich prior knowledge for efficiently solving complex SDM tasks, we propose treating LLMs as prior action distributions and integrating them into RL frameworks through Bayesian inference methods, making use of variational inference and direct posterior sampling. The proposed approaches facilitate the seamless incorporation of fixed LLM priors into both policy-based and value-based RL frameworks. Our experiments show that incorporating LLM-based action priors significantly reduces exploration and optimization complexity, substantially improving sample efficiency compared to traditional RL techniques, e.g., using LLM priors decreases the number of required samples by over 90\% in offline learning scenarios.

AAMAS Conference 2025 Conference Paper

Mean Field Correlated Imitation Learning

  • Zhiyu Zhao
  • Chengdong Ma
  • Qirui Mi
  • Ning Yang
  • Xue Yan
  • Mengyue Yang
  • Haifeng Zhang
  • Jun Wang

Modeling the behaviors of many-agent games is crucial for capturing the dynamics of large-scale complex systems. This is typically achieved by recovering policies from demonstrations within the Mean Field Game Imitation Learning (MFGIL) framework. However, most MFGIL methods assume that demonstrations are collected from Mean Field Nash Equilibrium (MFNE), implying that agents make decisions independently. When directly applied to situations where agents’ decisions are coordinated, such as publicly routed traffic networks, these techniques often fall short. In this paper, we propose the Adaptive Mean Field Correlated Equilibrium (AMFCE), which introduces a generalized assumption that effectively integrates the correlated behaviors common in real-world systems. We prove the existence of AMFCE under mild conditions and theoretically show that MFNE is a special case of AMFCE. Building upon this, we introduce a new Mean Field Correlated Imitation Learning (MFCIL) algorithm, which recovers expert policy more accurately in scenarios where agents’ decisions are coordinated. We also provide a theoretical upper bound for the error in recovering the expert policy, which is tighter than that of existing methods. Empirical results on real-world traffic flow prediction and large-scale economic simulations demonstrate that MFCIL significantly improves the predictive performance of large populations’ behaviors compared to existing MFGIL baselines. This improvement highlights potential of MFCIL to model real-world multi-agent systems. *Corresponding to Yaodong Yang ⟨yaodong. yang@pku. edu. cn⟩. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org).

NeurIPS Conference 2025 Conference Paper

Self-Verifying Reflection Helps Transformers with CoT Reasoning

  • Zhongwei Yu
  • Wannian Xia
  • Xue Yan
  • Bo Xu
  • Haifeng Zhang
  • Yali Du
  • Jun Wang

Advanced large language models (LLMs) frequently reflect in reasoning chain-of-thoughts (CoTs), where they self-verify the correctness of current solutions and explore alternatives. However, given recent findings that LLMs detect limited errors in CoTs, how reflection contributes to empirical improvements remains unclear. To analyze this issue, in this paper, we present a minimalistic reasoning framework to support basic self-verifying reflection for small transformers without natural language, which ensures analytic clarity and reduces the cost of comprehensive experiments. Theoretically, we prove that self-verifying reflection guarantees improvements if verification errors are properly bounded. Experimentally, we show that tiny transformers, with only a few million parameters, benefit from self-verification in both training and reflective execution, reaching remarkable LLM-level performance in integer multiplication and Sudoku. Similar to LLM results, we find that reinforcement learning (RL) improves in-distribution performance and incentivizes frequent reflection for tiny transformers, yet RL mainly optimizes shallow statistical patterns without faithfully reducing verification errors. In conclusion, integrating generative transformers with discriminative verification inherently facilitates CoT reasoning, regardless of scaling and natural language.

NeurIPS Conference 2024 Conference Paper

Large Language Models Play StarCraft II:Benchmarks and A Chain of Summarization Approach

  • Weiyu Ma
  • Qirui Mi
  • Yongcheng Zeng
  • Xue Yan
  • Yuqiao Wu
  • Runji Lin
  • Haifeng Zhang
  • Jun Wang

With the continued advancement of Large Language Models (LLMs) Agents in reasoning, planning, and decision-making, benchmarks have become crucial in evaluating these skills. However, there is a notable gap in benchmarks for real-time strategic decision-making. StarCraft II (SC2), with its complex and dynamic nature, serves as an ideal setting for such evaluations. To this end, we have developed TextStarCraft II, a specialized environment for assessing LLMs in real-time strategic scenarios within SC2. Addressing the limitations of traditional Chain of Thought (CoT) methods, we introduce the Chain of Summarization (CoS) method, enhancing LLMs' capabilities in rapid and effective decision-making. Our key experiments included: 1. LLM Evaluation: Tested 10 LLMs in TextStarCraft II, most of them defeating LV5 build-in AI, showcasing effective strategy skills. 2. Commercial Model Knowledge: Evaluated four commercial models on SC2 knowledge; GPT-4 ranked highest by Grandmaster-level experts. 3. Human-AI Matches: Experimental results showed that fine-tuned LLMs performed on par with Gold-level players in real-time matches, demonstrating comparable strategic abilities. All code and data from thisstudy have been made pulicly available at https: //github. com/histmeisah/Large-Language-Models-play-StarCraftII

NeurIPS Conference 2023 Conference Paper

An Efficient End-to-End Training Approach for Zero-Shot Human-AI Coordination

  • Xue Yan
  • Jiaxian Guo
  • Xingzhou Lou
  • Jun Wang
  • Haifeng Zhang
  • Yali Du

The goal of zero-shot human-AI coordination is to develop an agent that can collaborate with humans without relying on human data. Prevailing two-stage population-based methods require a diverse population of mutually distinct policies to simulate diverse human behaviors. The necessity of such populations severely limits their computational efficiency. To address this issue, we propose E3T, an E fficient E nd-to- E nd T raining approach for zero-shot human-AI coordination. E3T employs a mixture of ego policy and random policy to construct the partner policy, making it both coordination-skilled and diverse. In this way, the ego agent is end-to-end trained with this mixture policy without the need of a pre-trained population, thus significantly improving the training efficiency. In addition, a partner modeling module is proposed to predict the partner's action from historical information. With the predicted partner's action, the ego policy is able to adapt its policy and take actions accordingly when collaborating with humans of different behavior patterns. Empirical results on the Overcooked environment show that our method significantly improves the training efficiency while preserving comparable or superior performance than the population-based baselines. Demo videos are available at https: //sites. google. com/view/e3t-overcooked.

EAAI Journal 2023 Journal Article

Covid-19 epidemic and regional carbon emissions: A study based on metabolic multivariate grey model with new information priority

  • Pingping Xiong
  • Xiaojie Wu
  • Xiaosu Zeng
  • Lingshan Hu
  • Xue Yan

The COVID-19 epidemic has had an unexpected impact on global carbon emissions. In this context, carbon emission prediction is very important for policy formulation and implementation. It is worth pondering how the grey model deals with the prediction of unexpected events. Considering the spatial dependence of carbon emissions, this paper improves the multivariate grey model, gives new information a higher reference weight, and constructs a metabolic multivariate grey model -MMGM(1, m | λ ) which takes new information as the priority. An optimization algorithm is constructed with the minimum error as the goal to determine the value of the weight adjustment parameter λ. Then we simulate and predict the carbon emissions of three regions: China’s Yangtze River Delta (Shanghai, Jiangsu, Zhejiang, and Anhui), the North American Free Trade Area (Canada, Mexico, and the US), and West Europe (the UK and France). These three cases have different regions, different numbers of behavioural variables, different degrees of influence by COVID-19, and different trends, which are comparative. The results show that the new model has a good fitting effect, and it is still applicable in dealing with unexpected events. The new model can reflect regional carbon emission changes more systematically and accurately. Finally, according to cases and discussions, we get some management enlightenment to promote regional environmental collaborative governance.

AAAI Conference 2022 Conference Paper

Learning to Identify Top Elo Ratings: A Dueling Bandits Approach

  • Xue Yan
  • Yali Du
  • Binxin Ru
  • Jun Wang
  • Haifeng Zhang
  • Xu Chen

The Elo rating system is widely adopted to evaluate the skills of (chess) game and sports players. Recently it has been also integrated into machine learning algorithms in evaluating the performance of computerised AI agents. However, an accurate estimation of the Elo rating (for the top players) often requires many rounds of competitions, which can be expensive to carry out. In this paper, to improve the sample efficiency of the Elo evaluation (for top players), we propose an efficient online match scheduling algorithm. Specifically, we identify and match the top players through a dueling bandits framework and tailor the bandit algorithm to the gradient-based update of Elo. We show that it reduces the per-step memory and time complexity to constant, compared to the traditional likelihood maximization approaches requiring O(t) time. Our algorithm has a regret guarantee of Õ( √ T), sublinear in the number of competition rounds and has been extended to the multidimensional Elo ratings for handling intransitive games. We empirically demonstrate that our method achieves superior convergence speed and time efficiency on a variety of gaming tasks.

ICML Conference 2021 Conference Paper

Estimating α-Rank from A Few Entries with Low Rank Matrix Completion

  • Yali Du 0001
  • Xue Yan
  • Xu Chen 0017
  • Jun Wang 0012
  • Haifeng Zhang 0002

Multi-agent evaluation aims at the assessment of an agent’s strategy on the basis of interaction with others. Typically, existing methods such as $\alpha$-rank and its approximation still require to exhaustively compare all pairs of joint strategies for an accurate ranking, which in practice is computationally expensive. In this paper, we aim to reduce the number of pairwise comparisons in recovering a satisfying ranking for $n$ strategies in two-player meta-games, by exploring the fact that agents with similar skills may achieve similar payoffs against others. Two situations are considered: the first one is when we can obtain the true payoffs; the other one is when we can only access noisy payoff. Based on these formulations, we leverage low-rank matrix completion and design two novel algorithms for noise-free and noisy evaluations respectively. For both of these settings, we theorize that $O(nr \log n)$ ($n$ is the number of agents and $r$ is the rank of the payoff matrix) payoff entries are required to achieve sufficiently well strategy evaluation performance. Empirical results on evaluating the strategies in three synthetic games and twelve real world games demonstrate that strategy evaluation from a few entries can lead to comparable performance to algorithms with full knowledge of the payoff matrix.