Arrow Research search

Author name cluster

Qiang Fu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

39 papers
2 author rows

Possible papers

39

AAAI Conference 2026 Conference Paper

Deep (Predictive) Discounted Counterfactual Regret Minimization

  • Hang Xu
  • Kai Li
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. To enhance CFR's applicability in large games, researchers use neural networks to approximate its behavior. However, existing methods are mainly based on vanilla CFR and struggle to effectively integrate more advanced CFR variants. In this work, we propose an efficient model-free neural CFR algorithm, overcoming the limitations of existing methods in approximating advanced CFR variants. At each iteration, it collects variance-reduced sampled advantages based on a value network, fits cumulative advantages by bootstrapping, and applies discounting and clipping operations to simulate the update mechanisms of advanced CFR variants. Experimental results show that, compared with model-free neural algorithms, it exhibits faster convergence in typical imperfect-information games and demonstrates stronger adversarial performance in a large poker game.

AAAI Conference 2026 Conference Paper

FGD-Align: Pluralistic Alignment for Large Language Models via Fuzzy Group Decision-Making

  • Weihang Pan
  • Zhengxu Yu
  • Yong Wu
  • Xun Liang
  • Zhongming Jin
  • Qiang Fu
  • Penghui Shang
  • Binbin Lin

Ensuring alignment with human values is essential for modern large language models (LLMs), especially amid growing concerns around AI safety and social impact. Yet achieving such alignment remains challenging due to the limited, noisy, and often conflicting nature of human feedback from diverse annotators. Most existing approaches, such as Direct Preference Optimization (DPO), assume consistent and conflict-free supervision, overlooking the ambiguity, inconsistency, and value trade-offs inherent in real-world preferences—often leading to reduced robustness and exclusion of minority views. To address this, we propose FGD-Align, a novel pluralistic alignment framework grounded in Fuzzy Group Decision-Making theory. Our approach rigorously models and aggregates human preferences while retaining the complexity of real-world value trade-offs. Unlike traditional methods that rely on coarse-grained preference pairs, FGD-Align introduces fuzzy preference modeling via triangular fuzzy numbers to capture nuanced, multi-criteria human judgments. We further develop a new training objective, Probabilistic Fuzzy DPO, which incorporates fuzzy preference strength as adaptive loss weights and gradient filters, enhancing robustness to ambiguity and inconsistency in feedback. Comprehensive experiments demonstrate that FGD-Align consistently outperforms both DPO variants and advanced preference aggregation methods in terms of preference accuracy and robustness to ambiguity. It achieves superior alignment stability and better preserves minority preferences, all with minimal computational overhead. Our work bridges the gap between algorithmic tractability and the nuanced landscape of human values, enabling more scalable, inclusive, and socially-aware AI alignment.

EAAI Journal 2025 Journal Article

A global linear attention incorporated video transformer for robust sintering condition recognition

  • Leyuan Wu
  • Junlin Wu
  • Dingxiang Wang
  • Qiang Fu

Robust and accurate sintering condition recognition is a fundamental yet critical issue in the design of image-based intelligent combustion control systems. However, owing to the weak texture and fast changing characteristics of flame videos, capturing the condition indicator using existing gradient-based methods is challenging. To address this issue, we propose a global linear attention incorporated video transformer model for sintering condition recognition. First, to reduce the prediction error and uncertainty, the spatial-temporal features are extracted to describe the dynamic characteristics of the flame video streams based on the video shifted window (Swin) Transformer architecture. Next, to address the problem that the local attention strategy used in the Video Swin Transformer is insufficient for global flame feature extraction, we propose a Video Linear Attention block that obtains the global attention as a supplement. Extensive experiments conducted on a real-world rotary kiln sintering dataset demonstrate the effectiveness of our approach, achieving an overall accuracy of 97. 76% and an F1-score of 95. 30%. Compared to the Video Swin Transformer model, these results represent improvements of 2. 00% in accuracy and 4. 96% in F1-score, respectively. This research is particularly significant in the context of real-time identification of combustion process conditions, optimization of control parameters, and realization of more stable and efficient combustion process control.

EAAI Journal 2025 Journal Article

A novel condition monitoring approach using hybrid lightweighted adaptive models for complex machinery

  • Yingqian Liu
  • Rongyong Zhang
  • Luigi Grossi
  • Zhipin Ye
  • Huairui Li
  • Rongsheng Zhu
  • Qiang Fu

Condition monitoring of complex industrial systems is critical for ensuring operational reliability. Data-driven methods using artificial intelligence have advanced anomaly detection (AD) and fault diagnosis (FD), but existing approaches often treat them separately, focus on known faults, and struggle with previously unseen or rare conditions in multi-modal scenarios. This study proposes a novel condition monitoring framework that integrates AD and FD within a distributed architecture. Lightweight models—including kernel principal component analysis, support vector machines, and one-dimensional convolutional neural networks—enable efficient and scalable processing. A multilevel information fusion strategy ensures consistent detection and diagnosis while facilitating the isolation of previously unknown faults. Module test results demonstrate the effectiveness and robustness of the proposed feature extraction and adaptive modeling approaches. The overall test results for previously unknown faults vary across channels and modules. For samples with misalignment and inner blade wear, channel-level detection accuracy ranges from 0. 007 to 0. 989, with unknown recognition rates up to 0. 933 and diagnosis probabilities from 0. 508 to 0. 933. For strong misalignment and fan-end inner race faults, nearly all channels achieve 100 % detection accuracy, with some diagnosis probabilities above 0. 9, while unknown recognition remains minimal (mostly below 0. 05). Importantly, the proposed framework integrates detection and diagnostic outputs across channels, effectively mapping previously unseen faults to similar known categories or to an unknown category. Overall, the proposed framework offers a referenced solution for condition monitoring of industrial systems like pumps, turbines, and compressors, and lays the foundation for future improvements incorporating domain knowledge and model-driven interpretability.

AAAI Conference 2025 Conference Paper

An Open-Ended Learning Framework for Opponent Modeling

  • Yuheng Jing
  • Kai Li
  • Bingyun Liu
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Opponent Modeling (OM) aims to enhance decision-making by modeling other agents in multi-agent environments. Existing works typically learn opponent models against a pre-designated fixed set of opponents during training. However, this will cause poor generalization when facing unknown opponents during testing, as previously unseen opponents can exhibit out-of-distribution (OOD) behaviors that the learned opponent models cannot handle. To tackle this problem, we introduce a novel Open-Ended Opponent Modeling (OEOM) framework, which continuously generates opponents with diverse strengths and styles to reduce the possibility of OOD situations occurring during testing. Founded on population-based training and information-theoretic trajectory space diversity regularization, OEOM generates a dynamic set of opponents. This set is then fed to any OM approaches to train a potentially generalizable opponent model. Upon this, we further propose a simple yet effective OM approach that naturally fits within the OEOM framework. This approach is based on in-context reinforcement learning and learns a Transformer that dynamically recognizes and responds to opponents based on their trajectories. Extensive experiments in cooperative, competitive, and mixed environments demonstrate that OEOM is an approach-agnostic framework that improves generalizability compared to training against a fixed set of opponents, regardless of OM approaches or testing opponent settings. The results also indicate that our proposed approach generally outperforms existing OM baselines.

NeurIPS Conference 2025 Conference Paper

Hamiltonian Descent Algorithms for Optimization: Accelerated Rates via Randomized Integration Time

  • Qiang Fu
  • Andre Wibisono

We study the Hamiltonian flow for optimization (HF-opt), which simulates the Hamiltonian dynamics for some integration time and resets the velocity to $0$ to decrease the objective function; this is the optimization analogue of the Hamiltonian Monte Carlo algorithm for sampling. For short integration time, HF-opt has the same convergence rates as gradient descent for minimizing strongly and weakly convex functions. We show that by randomizing the integration time in HF-opt, the resulting randomized Hamiltonian flow (RHF) achieves accelerated convergence rates in continuous time, similar to the rates for accelerated gradient flow. We study a discrete-time implementation of RHF as the randomized Hamiltonian gradient descent (RHGD) algorithm. We prove that RHGD achieves the same accelerated convergence rates as Nesterov's accelerated gradient descent (AGD) for minimizing smooth strongly and weakly convex functions. We provide numerical experiments to demonstrate that RHGD is competitive with classical accelerated methods such as AGD across all settings and outperforms them in certain regimes.

ICLR Conference 2025 Conference Paper

Online-to-Offline RL for Agent Alignment

  • Xu Liu
  • Haobo Fu
  • Stefano V. Albrecht
  • Qiang Fu
  • Shuai Li

Reinforcement learning (RL) has shown remarkable success in training agents to achieve high-performing policies, particularly in domains like Game AI where simulation environments enable efficient interactions. However, despite their success in maximizing these returns, such online-trained policies often fail to align with human preferences concerning actions, styles, and values. The challenge lies in efficiently adapting these online-trained policies to align with human preferences, given the scarcity and high cost of collecting human behavior data. In this work, we formalize the problem as *online-to-offline* RL and propose ALIGNment of Game AI to Preferences (ALIGN-GAP), an innovative approach for the alignment of well-trained game agents to human preferences. Our method features a carefully designed reward model that encodes human preferences from limited offline data and incorporates curriculum-based preference learning to align RL agents with targeted human preferences. Experiments across diverse environments and preference types demonstrate the performance of ALIGN-GAP, achieving effective alignment with human preferences.

NeurIPS Conference 2025 Conference Paper

TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs

  • Yuxiang Zhang
  • Zhengxu Yu
  • Weihang Pan
  • Zhongming Jin
  • Qiang Fu
  • Deng Cai
  • Binbin Lin
  • Jieping Ye

Emerging reasoning LLMs such as OpenAI-o1 and DeepSeek-R1 have achieved strong performance on complex reasoning tasks by generating long chain-of-thought (CoT) traces. However, these long CoTs result in increased token usage, leading to higher inference latency and memory consumption. As a result, balancing accuracy and reasoning efficiency has become essential for deploying reasoning LLMs in practical applications. Existing long-to-short (Long2Short) methods aim to reduce inference length but often sacrifice accuracy, revealing a need for an approach that maintains performance while lowering token costs. To address this efficiency-accuracy tradeoff, we propose TokenSqueeze, a novel Long2Short method that condenses reasoning paths while preserving performance and relying exclusively on self-generated data. First, to prevent performance degradation caused by excessive compression of reasoning depth, we propose to select self-generated samples whose reasoning depth is adaptively matched to the complexity of the problem. To further optimize the linguistic expression without altering the underlying reasoning paths, we introduce a distribution-aligned linguistic refinement method that enhances the clarity and conciseness of the reasoning path while preserving its logical integrity. Comprehensive experimental results demonstrated the effectiveness of TokenSqueeze in reducing token usage while maintaining accuracy. Notably, DeepSeek‑R1‑Distill‑Qwen‑7B fine-tuned by using our proposed method achieved a 50\% average token reduction while preserving accuracy on the MATH500 benchmark. TokenSqueeze exclusively utilizes the model's self-generated data, enabling efficient and high-fidelity reasoning without relying on manually curated short-answer datasets across diverse applications. Our code is available at \url{https: //github. com/zhangyx1122/TokenSqueeze}.

TMLR Journal 2024 Journal Article

Affordable Generative Agents

  • Yangbin Yu
  • Qin Zhang
  • Junyou Li
  • Qiang Fu
  • Deheng Ye

The emergence of large language models (LLMs) has significantly advanced the simulation of believable interactive agents. However, the substantial cost on maintaining the prolonged agent interactions poses challenge over the deployment of believable LLM-based agents. Therefore, in this paper, we develop Affordable Generative Agents (AGA), a framework for enabling the generation of believable and low-cost interactions on both agent-environment and inter-agents. Specifically, for agent-environment interactions, we substitute repetitive LLM inferences with learned policies; while for inter-agent interactions, we model the social relationships between agents and compress auxiliary dialogue information. Extensive experiments on multiple environments show the effectiveness and efficiency of our proposed framework. Also, we delve into the mechanisms of emergent believable behaviors lying in LLM agents, demonstrating that agents can only generate finite behaviors in fixed environments, based upon which, we understand ways to facilitate emergent interaction behaviors. Our code is publicly available at: https://github.com/AffordableGenerativeAgents/Affordable-Generative-Agents.

AIJ Journal 2024 Journal Article

Automatically designing counterfactual regret minimization algorithms for solving imperfect-information games

  • Kai Li
  • Hang Xu
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing

Strategic decision-making in imperfect-information games is an important problem in artificial intelligence. Counterfactual regret minimization (CFR), a family of iterative algorithms, has been the workhorse for solving these types of games since its inception. In recent years, a series of novel CFR variants have been proposed, significantly improving the convergence rate of vanilla CFR. However, most of these new variants are hand-designed by researchers through trial and error, often based on different motivations, which generally requires a tremendous amount of effort and insight. This work proposes AutoCFR, a systematic framework that meta-learns novel CFR algorithms through evolution, easing the burden of manual algorithm design. We first design a search language that is rich enough to represent various CFR variants. We then exploit a scalable regularized evolution algorithm with a set of acceleration techniques to efficiently search over the combinatorial space of algorithms defined by this language. The learned novel CFR algorithm can generalize to new imperfect-information games not seen during training and performs on par with or better than existing state-of-the-art CFR variants. In addition to superior empirical performance, we also theoretically show that the learned algorithm converges to an approximate Nash equilibrium. Extensive experiments across diverse imperfect-information games highlight the scalability, extensibility, and generalizability of AutoCFR, establishing it as a general-purpose framework for solving imperfect-information games.

NeurIPS Conference 2024 Conference Paper

Efficient Multi-task Reinforcement Learning with Cross-Task Policy Guidance

  • Jinmin He
  • Kai Li
  • Yifan Zang
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Multi-task reinforcement learning endeavors to efficiently leverage shared information across various tasks, facilitating the simultaneous learning of multiple tasks. Existing approaches primarily focus on parameter sharing with carefully designed network structures or tailored optimization procedures. However, they overlook a direct and complementary way to exploit cross-task similarities: the control policies of tasks already proficient in some skills can provide explicit guidance for unmastered tasks to accelerate skills acquisition. To this end, we present a novel framework called Cross-Task Policy Guidance (CTPG), which trains a guide policy for each task to select the behavior policy interacting with the environment from all tasks' control policies, generating better training trajectories. In addition, we propose two gating mechanisms to improve the learning efficiency of CTPG: one gate filters out control policies that are not beneficial for guidance, while the other gate blocks tasks that do not necessitate guidance. CTPG is a general framework adaptable to existing parameter sharing approaches. Empirical evaluations demonstrate that incorporating CTPG with these approaches significantly enhances performance in manipulation and locomotion benchmarks.

IJCAI Conference 2024 Conference Paper

Minimizing Weighted Counterfactual Regret with Optimistic Online Mirror Descent

  • Hang Xu
  • Kai Li
  • Bingyun Liu
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Counterfactual regret minimization (CFR) is a family of algorithms for effectively solving imperfect-information games. It decomposes the total regret into counterfactual regrets, utilizing local regret minimization algorithms, such as Regret Matching (RM) or RM+, to minimize them. Recent research establishes a connection between Online Mirror Descent (OMD) and RM+, paving the way for an optimistic variant PRM+ and its extension PCFR+. However, PCFR+ assigns uniform weights for each iteration when determining regrets, leading to substantial regrets when facing dominated actions. This work explores minimizing weighted counterfactual regret with optimistic OMD, resulting in a novel CFR variant PDCFR+. It integrates PCFR+ and Discounted CFR (DCFR) in a principled manner, swiftly mitigating negative effects of dominated actions and consistently leveraging predictions to accelerate convergence. Theoretical analyses prove that PDCFR+ converges to a Nash equilibrium, particularly under distinct weighting schemes for regrets and average strategies. Experimental results demonstrate PDCFR+'s fast convergence in common imperfect-information games. The code is available at https: //github. com/rpSebastian/PDCFRPlus.

TMLR Journal 2024 Journal Article

More Agents Is All You Need

  • Junyou Li
  • Qin Zhang
  • Yangbin Yu
  • Qiang Fu
  • Deheng Ye

We find that, simply via a sampling-and-voting method, the performance of large language models (LLMs) scales with the number of agents instantiated. Also, this method, termed as Agent Forest, is orthogonal to existing complicated methods to further enhance LLMs, while the degree of enhancement is correlated to the task difficulty. We conduct comprehensive experiments on a wide range of LLM benchmarks to verify the presence of our finding, and to study the properties that can facilitate its occurrence. Our code is publicly available at: https://github.com/MoreAgentsIsAllYouNeed/AgentForest

AAAI Conference 2024 Conference Paper

Not All Tasks Are Equally Difficult: Multi-Task Deep Reinforcement Learning with Dynamic Depth Routing

  • Jinmin He
  • Kai Li
  • Yifan Zang
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Multi-task reinforcement learning endeavors to accomplish a set of different tasks with a single policy. To enhance data efficiency by sharing parameters across multiple tasks, a common practice segments the network into distinct modules and trains a routing network to recombine these modules into task-specific policies. However, existing routing approaches employ a fixed number of modules for all tasks, neglecting that tasks with varying difficulties commonly require varying amounts of knowledge. This work presents a Dynamic Depth Routing (D2R) framework, which learns strategic skipping of certain intermediate modules, thereby flexibly choosing different numbers of modules for each task. Under this framework, we further introduce a ResRouting method to address the issue of disparate routing paths between behavior and target policies during off-policy training. In addition, we design an automatic route-balancing mechanism to encourage continued routing exploration for unmastered tasks without disturbing the routing of mastered ones. We conduct extensive experiments on various robotics manipulation tasks in the Meta-World benchmark, where D2R achieves state-of-the-art performance with significantly improved learning efficiency.

NeurIPS Conference 2024 Conference Paper

Opponent Modeling with In-context Search

  • Yuheng Jing
  • Bingyun Liu
  • Kai Li
  • Yifan Zang
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Opponent modeling is a longstanding research topic aimed at enhancing decision-making by modeling information about opponents in multi-agent environments. However, existing approaches often face challenges such as having difficulty generalizing to unknown opponent policies and conducting unstable performance. To tackle these challenges, we propose a novel approach based on in-context learning and decision-time search named Opponent Modeling with In-context Search (OMIS). OMIS leverages in-context learning-based pretraining to train a Transformer model for decision-making. It consists of three in-context components: an actor learning best responses to opponent policies, an opponent imitator mimicking opponent actions, and a critic estimating state values. When testing in an environment that features unknown non-stationary opponent agents, OMIS uses pretrained in-context components for decision-time search to refine the actor's policy. Theoretically, we prove that under reasonable assumptions, OMIS without search converges in opponent policy recognition and has good generalization properties; with search, OMIS provides improvement guarantees, exhibiting performance stability. Empirically, in competitive, cooperative, and mixed environments, OMIS demonstrates more effective and stable adaptation to opponents than other approaches. See our project website at https: //sites. google. com/view/nips2024-omis.

AAAI Conference 2024 Conference Paper

Text-to-Image Generation for Abstract Concepts

  • Jiayi Liao
  • Xu Chen
  • Qiang Fu
  • Lun Du
  • Xiangnan He
  • Xiang Wang
  • Shi Han
  • Dongmei Zhang

Recent years have witnessed the substantial progress of large-scale models across various domains, such as natural language processing and computer vision, facilitating the expression of concrete concepts. Unlike concrete concepts that are usually directly associated with physical objects, expressing abstract concepts through natural language requires considerable effort since they are characterized by intricate semantics and connotations. An alternative approach is to leverage images to convey rich visual information as a supplement. Nevertheless, existing Text-to-Image (T2I) models are primarily trained on concrete physical objects and often struggle to visualize abstract concepts. Inspired by the three-layer artwork theory that identifies critical factors, intent, object and form during artistic creation, we propose a framework of Text-to-Image generation for Abstract Concepts (TIAC). The abstract concept is clarified into a clear intent with a detailed definition to avoid ambiguity. LLMs then transform it into semantic-related physical objects, and the concept-dependent form is retrieved from an LLM-extracted form pattern set. Information from these three aspects will be integrated to generate prompts for T2I models via LLM. Evaluation results from human assessments and our newly designed metric concept score demonstrate the effectiveness of our framework in creating images that can sufficiently express abstract concepts.

NeurIPS Conference 2023 Conference Paper

A Robust and Opponent-Aware League Training Method for StarCraft II

  • Ruozi Huang
  • Xipeng Wu
  • Hongsheng Yu
  • Zhong Fan
  • Haobo Fu
  • Qiang Fu
  • Wei Yang

It is extremely difficult to train a superhuman Artificial Intelligence (AI) for games of similar size to StarCraft II. AlphaStar is the first AI that beat human professionals in the full game of StarCraft II, using a league training framework that is inspired by a game-theoretic approach. In this paper, we improve AlphaStar's league training in two significant aspects. We train goal-conditioned exploiters, whose abilities of spotting weaknesses in the main agent and the entire league are greatly improved compared to the unconditioned exploiters in AlphaStar. In addition, we endow the agents in the league with the new ability of opponent modeling, which makes the agent more responsive to the opponent's real-time strategy. Based on these improvements, we train a better and superhuman AI with orders of magnitude less resources than AlphaStar (see Table 1 for a full comparison). Considering the iconic role of StarCraft II in game AI research, we believe our method and results on StarCraft II provide valuable design principles on how one would utilize the general league training framework for obtaining a least-exploitable strategy in various, large-scale, real-world games.

NeurIPS Conference 2023 Conference Paper

Automatic Grouping for Efficient Cooperative Multi-Agent Reinforcement Learning

  • Yifan Zang
  • Jinmin He
  • Kai Li
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing
  • Jian Cheng

Grouping is ubiquitous in natural systems and is essential for promoting efficiency in team coordination. This paper proposes a novel formulation of Group-oriented Multi-Agent Reinforcement Learning (GoMARL), which learns automatic grouping without domain knowledge for efficient cooperation. In contrast to existing approaches that attempt to directly learn the complex relationship between the joint action-values and individual utilities, we empower subgroups as a bridge to model the connection between small sets of agents and encourage cooperation among them, thereby improving the learning efficiency of the whole team. In particular, we factorize the joint action-values as a combination of group-wise values, which guide agents to improve their policies in a fine-grained fashion. We present an automatic grouping mechanism to generate dynamic groups and group action-values. We further introduce a hierarchical control for policy learning that drives the agents in the same group to specialize in similar policies and possess diverse strategies for various groups. Experiments on the StarCraft II micromanagement tasks and Google Research Football scenarios verify our method's effectiveness. Extensive component studies show how grouping works and enhances performance.

IJCAI Conference 2023 Conference Paper

Causal-Based Supervision of Attention in Graph Neural Network: A Better and Simpler Choice towards Powerful Attention

  • Hongjun Wang
  • Jiyuan Chen
  • Lun Du
  • Qiang Fu
  • Shi Han
  • Xuan Song

Recent years have witnessed the great potential of attention mechanism in graph representation learning. However, while variants of attention-based GNNs are setting new benchmarks for numerous real-world datasets, recent works have pointed out that their induced attentions are less robust and generalizable against noisy graphs due to lack of direct supervision. In this paper, we present a new framework which utilizes the tool of causality to provide a powerful supervision signal for the learning process of attention functions. Specifically, we estimate the direct causal effect of attention to the final prediction, and then maximize such effect to guide attention attending to more meaningful neighbors. Our method can serve as a plug-and-play module for any canonical attention-based GNNs in an end-to-end fashion. Extensive experiments on a wide range of benchmark datasets illustrated that, by directly supervising attention functions, the model is able to converge faster with a clearer decision boundary, and thus yields better performances.

NeurIPS Conference 2023 Conference Paper

Hokoff: Real Game Dataset from Honor of Kings and its Offline Reinforcement Learning Benchmarks

  • Yun Qu
  • Boyuan Wang
  • Jianzhun Shao
  • Yuhang Jiang
  • Chen Chen
  • Zhenbin Ye
  • Liu Linc
  • Yang Feng

The advancement of Offline Reinforcement Learning (RL) and Offline Multi-Agent Reinforcement Learning (MARL) critically depends on the availability of high-quality, pre-collected offline datasets that represent real-world complexities and practical applications. However, existing datasets often fall short in their simplicity and lack of realism. To address this gap, we propose Hokoff, a comprehensive set of pre-collected datasets that covers both offline RL and offline MARL, accompanied by a robust framework, to facilitate further research. This data is derived from Honor of Kings, a recognized Multiplayer Online Battle Arena (MOBA) game known for its intricate nature, closely resembling real-life situations. Utilizing this framework, we benchmark a variety of offline RL and offline MARL algorithms. We also introduce a novel baseline algorithm tailored for the inherent hierarchical action space of the game. We reveal the incompetency of current offline RL approaches in handling task complexity, generalization and multi-task learning.

IJCAI Conference 2023 Conference Paper

Multi-objective Optimization-based Selection for Quality-Diversity by Non-surrounded-dominated Sorting

  • Ren-Jian Wang
  • Ke Xue
  • Haopu Shang
  • Chao Qian
  • Haobo Fu
  • Qiang Fu

Quality-Diversity (QD) algorithms, a subset of evolutionary algorithms, maintain an archive (i. e. , a set of solutions) and simulate the natural evolution process through iterative selection and reproduction, with the goal of generating a set of high-quality and diverse solutions. Though having found many successful applications in reinforcement learning, QD algorithms often select the parent solutions uniformly at random, which lacks selection pressure and may limit the performance. Recent studies have treated each type of behavior of a solution as an objective, and selected the parent solutions based on Multi-objective Optimization (MO), which is a natural idea, but has not lead to satisfactory performance as expected. This paper gives the reason for the first time, and then proposes a new MO-based selection method by non-surrounded-dominated sorting (NSS), which considers all possible directions of the behaviors, and thus can generate diverse solutions over the whole behavior space. By combining NSS with the most widespread QD algorithm, MAP-Elites, we perform experiments on synthetic functions and several complex tasks (i. e. , QDGym, robotic arm, and Mario environment generation), showing that NSS achieves better performance than not only other MO-based selection methods but also state-of-the-art selection methods in QD.

NeurIPS Conference 2023 Conference Paper

Policy Space Diversity for Non-Transitive Games

  • Jian Yao
  • Weiming Liu
  • Haobo Fu
  • Yaodong Yang
  • Stephen McAleer
  • Qiang Fu
  • Wei Yang

Policy-Space Response Oracles (PSRO) is an influential algorithm framework for approximating a Nash Equilibrium (NE) in multi-agent non-transitive games. Many previous studies have been trying to promote policy diversity in PSRO. A major weakness with existing diversity metrics is that a more diverse (according to their diversity metrics) population does not necessarily mean (as we proved in the paper) a better approximation to a NE. To alleviate this problem, we propose a new diversity metric, the improvement of which guarantees a better approximation to a NE. Meanwhile, we develop a practical and well-justified method to optimize our diversity metric using only state-action samples. By incorporating our diversity regularization into the best response solving of PSRO, we obtain a new PSRO variant, \textit{Policy Space Diversity} PSRO (PSD-PSRO). We present the convergence property of PSD-PSRO. Empirically, extensive experiments on single-state games, Leduc, and Goofspiel demonstrate that PSD-PSRO is more effective in producing significantly less exploitable policies than state-of-the-art PSRO variants.

AAAI Conference 2023 Conference Paper

RLogist: Fast Observation Strategy on Whole-Slide Images with Deep Reinforcement Learning

  • Boxuan Zhao
  • Jun Zhang
  • Deheng Ye
  • Jian Cao
  • Xiao Han
  • Qiang Fu
  • Wei Yang

Whole-slide images (WSI) in computational pathology have high resolution with gigapixel size, but are generally with sparse regions of interest, which leads to weak diagnostic relevance and data inefficiency for each area in the slide. Most of the existing methods rely on a multiple instance learning framework that requires densely sampling local patches at high magnification. The limitation is evident in the application stage as the heavy computation for extracting patch-level features is inevitable. In this paper, we develop RLogist, a benchmarking deep reinforcement learning (DRL) method for fast observation strategy on WSIs. Imitating the diagnostic logic of human pathologists, our RL agent learns how to find regions of observation value and obtain representative features across multiple resolution levels, without having to analyze each part of the WSI at the high magnification. We benchmark our method on two whole-slide level classification tasks, including detection of metastases in WSIs of lymph node sections, and subtyping of lung cancer. Experimental results demonstrate that RLogist achieves competitive classification performance compared to typical multiple instance learning algorithms, while having a significantly short observation path. In addition, the observation path given by RLogist provides good decision-making interpretability, and its ability of reading path navigation can potentially be used by pathologists for educational/assistive purposes. Our code is available at: https://github.com/tencent-ailab/RLogist.

TMLR Journal 2023 Journal Article

RLTF: Reinforcement Learning from Unit Test Feedback

  • Jiate Liu
  • Yiqin Zhu
  • Kaiwen Xiao
  • Qiang Fu
  • Xiao Han
  • Yang Wei
  • Deheng Ye

The goal of program synthesis, or code generation, is to generate executable code based on given descriptions. Recently, there has been an increasing number of studies employing reinforcement learning (RL) to improve the performance of large language models (LLMs) for code. However, some of the current representative RL methods have only used offline frameworks, limiting the exploration of new sample spaces. Additionally, the utilization of unit test signals is limited, not accounting for specific error locations within the code. To address these issues, we proposed RLTF, i.e., Reinforcement Learning from Unit Test Feedback, a novel online RL framework with unit test feedback of multi-granularity for refining code LLMs. Our approach generates data in real-time during training and simultaneously utilizes fine-grained feedback signals to guide the model towards producing higher-quality code. Extensive experiments show that RLTF achieves state-of-the-art performance on the APPS and the MBPP benchmarks. Our code is available at: \url{https://github.com/Zyq-scut/RLTF}.

AAMAS Conference 2023 Conference Paper

Sequential Cooperative Multi-Agent Reinforcement Learning

  • Yifan Zang
  • Jinmin He
  • Kai Li
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing

Cooperative multi-agent reinforcement learning (MARL) aims to coordinate the actions of multiple agents via a shared team reward. The complex interactions among agents make this problem extremely difficult. The mainstream of MARL methods often implicitly learn an inexplicable value decomposition from the shared reward into individual utilities, failing to give insights into how well each agent acts and lacking direct policy optimization guidance. This paper presents a sequential MARL framework that factorizes and simplifies the complex interaction analysis into a sequential evaluation process for more effective and efficient learning. We explicitly formulate this factorization via a novel sequential advantage function to evaluate each agent’s actions, which achieves an explicable credit assignment and substantially facilitates policy optimization. We realize the sequential credit assignment (SeCA) by dynamically adjusting the sequence in light of agents’ contributions to the team. Extensive experimental validations on a challenging set of StarCraft II micromanagement tasks verify SeCA’s effectiveness.

IROS Conference 2022 Conference Paper

A Novel Robot with Rolling and Climbing Modes for Power Transmission Line Inspection

  • Qiang Fu
  • Yisheng Guan
  • Haifei Zhu

As a hard high-altitude work, power transmission line inspection increasingly demands robots to conduct in place of the human being. A variety of robots have been developed to this end, with basic locomotion and inspection implemented on the lines. However, most current line inspection robots (LIRs) are only mobile platforms with complex structures and large weights, lacking sufficiently dexterous locomotion on lines, especially for obstacle overcoming and line transition. Also with sensors fixed on the platform, the inspection range is largely limited. For higher mobility and a larger inspection range, a novel biped robot that can roll and climb on a power transmission line for inspection, called Climbot-L, is proposed in this paper. While the rolling mode has the advantage of high locomotion efficiency, the biped climbing mode makes it possible to easily overcome obstacles on the line, transition to adjacent cables, and have multi-view detection. In this paper, the design of this novel robot is first introduced, the working principle of the wheel-gripper modules is then analyzed, and obstacle overcoming gaits are stated. The effectiveness and high maneuverability of the presented robot are verified by a series of experiments.

AAAI Conference 2022 Conference Paper

AutoCFR: Learning to Design Counterfactual Regret Minimization Algorithms

  • Hang Xu
  • Kai Li
  • Haobo Fu
  • Qiang Fu
  • Junliang Xing

Counterfactual regret minimization (CFR) is the most commonly used algorithm to approximately solving two-player zero-sum imperfect-information games (IIGs). In recent years, a series of novel CFR variants such as CFR+, Linear CFR, DCFR have been proposed and have significantly improved the convergence rate of the vanilla CFR. However, most of these new variants are hand-designed by researchers through trial and error based on different motivations, which generally requires a tremendous amount of efforts and insights. This work proposes to meta-learn novel CFR algorithms through evolution to ease the burden of manual algorithm design. We first design a search language that is rich enough to represent many existing hand-designed CFR variants. We then exploit a scalable regularized evolution algorithm with a bag of acceleration techniques to efficiently search over the combinatorial space of algorithms defined by this language. The learned novel CFR algorithm can generalize to new IIGs not seen during training and performs on par with or better than existing state-of-the-art CFR variants. The code is available at https: //github. com/rpSebastian/AutoCFR.

ICML Conference 2022 Conference Paper

BabelTower: Learning to Auto-parallelized Program Translation

  • Yuanbo Wen 0001
  • Qi Guo 0001
  • Qiang Fu
  • Xiaqing Li
  • Jianxing Xu
  • Yanlin Tang
  • Yongwei Zhao 0001
  • Xing Hu 0001

GPUs have become the dominant computing platforms for many applications, while programming GPUs with the widely-used CUDA parallel programming model is difficult. As sequential C code is relatively easy to obtain either from legacy repositories or by manual implementation, automatically translating C to its parallel CUDA counterpart is promising to relieve the burden of GPU programming. However, because of huge differences between the sequential C and the parallel CUDA programming model, existing approaches fail to conduct the challenging auto-parallelized program translation. In this paper, we propose a learning-based framework, i. e. , BabelTower, to address this problem. We first create a large-scale dataset consisting of compute-intensive function-level monolingual corpora. We further propose using back-translation with a discriminative reranker to cope with unpaired corpora and parallel semantic conversion. Experimental results show that BabelTower outperforms state-of-the-art by 1. 79, 6. 09, and 9. 39 in terms of BLEU, CodeBLEU, and specifically designed ParaBLEU, respectively. The CUDA code generated by BabelTower attains a speedup of up to 347x over the sequential C code, and the developer productivity is improved by at most 3. 8x.

NeurIPS Conference 2022 Conference Paper

Honor of Kings Arena: an Environment for Generalization in Competitive Reinforcement Learning

  • Hua Wei
  • Jingxiao Chen
  • Xiyang Ji
  • Hongyang Qin
  • Minwen Deng
  • Siqin Li
  • Liang Wang
  • Weinan Zhang

This paper introduces Honor of Kings Arena, a reinforcement learning (RL) environment based on the Honor of Kings, one of the world’s most popular games at present. Compared to other environments studied in most previous work, ours presents new generalization challenges for competitive reinforcement learning. It is a multi-agent problem with one agent competing against its opponent; and it requires the generalization ability as it has diverse targets to control and diverse opponents to compete with. We describe the observation, action, and reward specifications for the Honor of Kings domain and provide an open-source Python-based interface for communicating with the game engine. We provide twenty target heroes with a variety of tasks in Honor of Kings Arena and present initial baseline results for RL-based methods with feasible computing resources. Finally, we showcase the generalization challenges imposed by Honor of Kings Arena and possible remedies to the challenges. All of the software, including the environment-class, are publicly available.

IJCAI Conference 2022 Conference Paper

JueWu-MC: Playing Minecraft with Sample-efficient Hierarchical Reinforcement Learning

  • Zichuan Lin
  • Junyou Li
  • Jianing Shi
  • Deheng Ye
  • Qiang Fu
  • Wei Yang

Learning rational behaviors in open-world games like Minecraft remains to be challenging for Reinforcement Learning (RL) research due to the compound challenge of partial observability, high-dimensional visual perception and delayed reward. To address this, we propose JueWu-MC, a sample-efficient hierarchical RL approach equipped with representation learning and imitation learning to deal with perception and exploration. Specifically, our approach includes two levels of hierarchy, where the high-level controller learns a policy to control over options and the low-level workers learn to solve each sub-task. To boost the learning of sub-tasks, we propose a combination of techniques including 1) action-aware representation learning which captures underlying relations between action and representation, 2) discriminator-based self-imitation learning for efficient exploration, and 3) ensemble behavior cloning with consistency filtering for policy robustness. Extensive experiments show that JueWu-MC significantly improves sample efficiency and outperforms a set of baselines by a large margin. Notably, we won the championship of the NeurIPS MineRL 2021 research competition and achieved the highest performance score ever.

NeurIPS Conference 2022 Conference Paper

Neuron with Steady Response Leads to Better Generalization

  • Qiang Fu
  • Lun Du
  • Haitao Mao
  • Xu Chen
  • Wei Fang
  • Shi Han
  • Dongmei Zhang

Regularization can mitigate the generalization gap between training and inference by introducing inductive bias. Existing works have already proposed various inductive biases from diverse perspectives. However, none of them explores inductive bias from the perspective of class-dependent response distribution of individual neurons. In this paper, we conduct a substantial analysis of the characteristics of such distribution. Based on the analysis results, we articulate the Neuron Steadiness Hypothesis: the neuron with similar responses to instances of the same class leads to better generalization. Accordingly, we propose a new regularization method called Neuron Steadiness Regularization (NSR) to reduce neuron intra-class response variance. Based on the Complexity Measure, we theoretically guarantee the effectiveness of NSR for improving generalization. We conduct extensive experiments on Multilayer Perceptron, Convolutional Neural Networks, and Graph Neural Networks with popular benchmark datasets of diverse domains, which show that our Neuron Steadiness Regularization consistently outperforms the vanilla version of models with significant gain and low additional computational overhead.

IJCAI Conference 2021 Conference Paper

Boosting Offline Reinforcement Learning with Residual Generative Modeling

  • Hua Wei
  • Deheng Ye
  • Zhao Liu
  • Hao Wu
  • Bo Yuan
  • Qiang Fu
  • Wei Yang
  • Zhenhui Li

Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration. Current offline RL research includes: 1) generative modeling, i. e. , approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game, Honor of Kings.

IJCAI Conference 2021 Conference Paper

Combining Tree Search and Action Prediction for State-of-the-Art Performance in DouDiZhu

  • Yunsheng Zhang
  • Dong Yan
  • Bei Shi
  • Haobo Fu
  • Qiang Fu
  • Hang Su
  • Jun Zhu
  • Ning Chen

AlphaZero has achieved superhuman performance on various perfect-information games, such as chess, shogi and Go. However, directly applying AlphaZero to imperfect-information games (IIG) is infeasible, due to the fact that traditional MCTS methods cannot handle missing information of other players. Meanwhile, there have been several extensions of MCTS for IIGs, by implicitly or explicitly sampling a state of other players. But, due to the inability to handle private and public information well, the performance of these methods is not satisfactory. In this paper, we extend AlphaZero to multiplayer IIGs by developing a new MCTS method, Action-Prediction MCTS (AP-MCTS). In contrast to traditional MCTS extensions for IIGs, AP-MCTS first builds the search tree based on public information, adopts the policy-value network to generalize between hidden states, and finally predicts other players' actions directly. This design bypasses the inefficiency of sampling and the difficulty of predicting the state of other players. We conduct extensive experiments on the popular 3-player poker game DouDiZhu to evaluate the performance of AP-MCTS combined with the framework AlphaZero. When playing against experienced human players, AP-MCTS achieved a 65. 65\% winning rate, which is almost twice the human's winning rate. When comparing with state-of-the-art DouDiZhu AIs, the Elo rating of AP-MCTS is 50 to 200 higher than them. The ablation study shows that accurate action prediction is the key to AP-MCTS winning.

NeurIPS Conference 2021 Conference Paper

Learning Diverse Policies in MOBA Games via Macro-Goals

  • Yiming Gao
  • Bei Shi
  • Xueying Du
  • Liang Wang
  • Guangwei Chen
  • Zhenjie Lian
  • Fuhao Qiu
  • GUOAN HAN

Recently, many researchers have made successful progress in building the AI systems for MOBA-game-playing with deep reinforcement learning, such as on Dota 2 and Honor of Kings. Even though these AI systems have achieved or even exceeded human-level performance, they still suffer from the lack of policy diversity. In this paper, we propose a novel Macro-Goals Guided framework, called MGG, to learn diverse policies in MOBA games. MGG abstracts strategies as macro-goals from human demonstrations and trains a Meta-Controller to predict these macro-goals. To enhance policy diversity, MGG samples macro-goals from the Meta-Controller prediction and guides the training process towards these goals. Experimental results on the typical MOBA game Honor of Kings demonstrate that MGG can execute diverse policies in different matches and lineups, and also outperform the state-of-the-art methods over 102 heroes.

IJCAI Conference 2021 Conference Paper

MapGo: Model-Assisted Policy Optimization for Goal-Oriented Tasks

  • Menghui Zhu
  • Minghuan Liu
  • Jian Shen
  • Zhicheng Zhang
  • Sheng Chen
  • Weinan Zhang
  • Deheng Ye
  • Yong Yu

In Goal-oriented Reinforcement learning, relabeling the raw goals in past experience to provide agents with hindsight ability is a major solution to the reward sparsity problem. In this paper, to enhance the diversity of relabeled goals, we develop FGI (Foresight Goal Inference), a new relabeling strategy that relabels the goals by looking into the future with a learned dynamics model. Besides, to improve sample efficiency, we propose to use the dynamics model to generate simulated trajectories for policy training. By integrating these two improvements, we introduce the MapGo framework (Model-Assisted Policy optimization for Goal-oriented tasks). In our experiments, we first show the effectiveness of the FGI strategy compared with the hindsight one, and then show that the MapGo framework achieves higher sample efficiency when compared to model-free baselines on a set of complicated tasks.

AAAI Conference 2020 Conference Paper

Mastering Complex Control in MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Zhao Liu
  • Mingfei Sun
  • Bei Shi
  • Peilin Zhao
  • Hao Wu
  • Hongsheng Yu
  • Shaojie Yang

We study the reinforcement learning problem of complex action control in the Multi-player Online Battle Arena (MOBA) 1v1 games. This problem involves far more complicated state and action spaces than those of traditional 1v1 games, such as Go and Atari series, which makes it very difficult to search any policies with human-level performance. In this paper, we present a deep reinforcement learning framework to tackle this problem from the perspectives of both system and algorithm. Our system is of low coupling and high scalability, which enables efficient explorations at large scale. Our algorithm includes several novel strategies, including control dependency decoupling, action mask, target attention, and dualclip PPO, with which our proposed actor-critic network can be effectively trained in our system. Tested on the MOBA game Honor of Kings, the trained AI agents can defeat top professional human players in full 1v1 games.

NeurIPS Conference 2020 Conference Paper

Towards Playing Full MOBA Games with Deep Reinforcement Learning

  • Deheng Ye
  • Guibin Chen
  • Wen Zhang
  • Sheng Chen
  • Bo Yuan
  • Bo Liu
  • Jia Chen
  • Zhao Liu

MOBA games, e. g. , Honor of Kings, League of Legends, and Dota 2, pose grand challenges to AI systems such as multi-agent, enormous state-action space, complex action control, etc. Developing AI for playing MOBA games has raised much attention accordingly. However, existing work falls short in handling the raw game complexity caused by the explosion of agent combinations, i. e. , lineups, when expanding the hero pool in case that OpenAI's Dota AI limits the play to a pool of only 17 heroes. As a result, full MOBA games without restrictions are far from being mastered by any existing AI system. In this paper, we propose a MOBA AI learning paradigm that methodologically enables playing full MOBA games with deep reinforcement learning. Specifically, we develop a combination of novel and existing learning techniques, including off-policy adaption, multi-head value estimation, curriculum self-play learning, policy distillation, and Monte-Carlo tree-search, in training and playing a large pool of heroes, meanwhile addressing the scalability issue skillfully. Tested on Honor of Kings, a popular MOBA game, we show how to build superhuman AI agents that can defeat top esports players. The superiority of our AI is demonstrated by the first large-scale performance test of MOBA AI agent in the literature.

IJCAI Conference 2018 Conference Paper

Complementary Binary Quantization for Joint Multiple Indexing

  • Qiang Fu
  • Xu Han
  • Xianglong Liu
  • Jingkuan Song
  • Cheng Deng

Building multiple hash tables has been proven a successful technique for indexing massive databases, which can guarantee a desired level of overall performance. However, existing hash based multi-indexing methods suffer from the heavy redundancy, without strong table complementarity and effective hash code learning. To address the problems, this paper proposes a complementary binary quantization (CBQ) method to jointly learning multiple hash tables. It exploits the power of incomplete binary coding based on prototypes to align the original space and the Hamming space, and further utilizes the nature of multi-indexing search to jointly reduce the quantization loss based on the prototype based hash function. Our alternating optimization adaptively discovers the complementary prototype sets and the corresponding code sets of a varying size in an efficient way, which together robustly approximate the data relations. Our method can be naturally generalized to the product space for long hash codes. Extensive experiments carried out on two popular large-scale tasks including Euclidean and semantic nearest neighbor search demonstrate that the proposed CBQ method enjoys the strong table complementarity and significantly outperforms the state-of-the-art, with up to 57. 76\% performance gains relatively.

JBHI Journal 2017 Journal Article

Validation of an Adaptive Transfer Function Method to Estimate the Aortic Pressure Waveform

  • Yang Yao
  • Lisheng Xu
  • Yingxian Sun
  • Qiang Fu
  • Shuran Zhou
  • Dianning He
  • Yahui Zhang
  • Liang Guo

Aortic pulse wave reflects cardiovascular status, but, unlike the peripheral pulse wave, is difficult to be measured reliably using noninvasive techniques. Thus, the estimation of aortic pulse wave from peripheral ones is of great significance. This study proposed an adaptive transfer function (ATF) method to estimate the aortic pulse wave from the brachial pulse wave. Aortic and brachial pulse waves were derived from 26 patients who underwent cardiac catheterization. Generalized transfer functions (GTF) were derived based on the autoregressive exogenous model. Then, the GTF was adapted by its peak resonance frequency. And the optional peak resonance frequency for an individual was determined by regression formulas using brachial systolic blood pressure. The method was validated using the leave-one-out cross validation method. Compared with previous studies, the ATF method showed better performance in estimating the aortic pulse wave and predicting the feature parameters. The prediction error of the aortic systolic blood pressure and pulse pressure were 0. 2 ± 3. 1 and -0. 9 ± 3. 1 mmHg, respectively. The percentage errors of augmentation index, percentage notch amplitude, and ejection duration were -2. 1 ± 32. 7%, 12. 4 ± 9. 2%, and -2. 4 ± 3. 3%, respectively.