Author name cluster

Yang Gao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

108 papers

2 author rows

AAAI Conference 2026 Conference Paper

Causality-Aware Efficient Exploration for Cooperative Multi-Agent Reinforcement Learning

Hongye Cao
Tianpei Yang
Fan Feng
Hammadi Rafik Ouariachi
Yali Du
Meng Fang
Jing Huo
Yang Gao

Exploration is critical for cooperative multi agent reinforcement learning (MARL) to improve sample efficiency. However, existing intrinsic motivation based exploration strategies in MARL overlook the causal relationships among agents, global states, and rewards, suffering from interference by irrelevant factors and resulting in sample inefficiency. To address this issue, we propose Causality aware Efficient Exploration (CEE), a novel framework that enhances sample efficiency by inferring causal relationships between agents, global states with respect to rewards, thereby enabling causality guided exploration. Specifically, CEE operates through two components. First, CEE identifies causal relationships between global states and rewards, filtering out causally irrelevant state features that do not have a high impact on rewards to keep decision critical state information. Second, CEE discovers causal relationships between agents' behaviors and rewards to quantify each agent's contribution to collective performance. To achieve this, we introduce a causal entropy objective that promotes exploration aligned with decision critical aspects of the underlying causal structure. We provide comprehensive validation through experiments on 21 challenging tasks spanning SMAC, SMAC v2, and Google Research Football (GRF) environments. Our results demonstrate that CEE achieves superior performance in terms of sample efficiency and asymptotic performance compared to existing MARL methods.

PDF Details DOI

EAAI Journal 2026 Journal Article

Cooperative decision-making of unmanned aerial vehicles: A multi-agent reinforcement learning approach

Ziyi Wang
Guoliang Ma
Jian Guo
Chen Qian
Yang Gao
Zhuo Huang

Cooperative decision-making of unmanned aerial vehicles (UAVs) for military missions is a crucial research topic. However, the ability constraints of heterogeneous UAVs in real-world scenarios bring significant challenges to the cooperative decision-making process. To address these issues, this paper proposes a multi-agent proximal policy optimization (MAPPO) algorithm with a flexible observation feature encoding (FOFE) mechanism and a Mamba-based memory structure. Firstly, the cooperative decision-making problem for reconnaissance-strike integrated fixed-wing UAV (RSUAV) swarms is formulated as a distributed partially observable Markov decision process (Dec-POMDP). Secondly, to address the variability and incompleteness in observation inputs, an FOFE strategy is introduced. This allows the network to process multi-channel and variable-length data effectively. Furthermore, the Mamba model is incorporated to capture temporal dependencies in historical observations. This enhances decision-making in prolonged missions. Under this multi-agent reinforcement learning (MARL) framework, each RSUAV can make autonomous decisions in a decentralized manner. The simulation results show that the proposed algorithm improves the completion ratio ( > 10. 1%), survival ratio ( > 14. 8%), and reduces completion time ( > 21. 5%) compared to baselines. It also exhibits strong generalization capability and holds practical feasibility for deployment on edge computing devices. Therefore, this approach enables effective cooperative decision-making under the ability constraints of heterogeneous RSUAVs.

Details DOI

AAAI Conference 2026 Conference Paper

Faster Game Solving via Asymmetry of Step Sizes

Linjian Meng
Tianpei Yang
Youzhi Zhang
Zhenxing Ge
Yang Gao

Counterfactual Regret Minimization (CFR) algorithms are widely used to compute a Nash equilibrium (NE) in two-player zero-sum imperfect-information extensive-form games (IIGs). Among them, Predictive CFR+ (PCFR+) is particularly powerful, achieving an exceptionally fast empirical convergence rate via the prediction in many games. However, the empirical convergence rate of PCFR+ would significantly degrade if the prediction is inaccurate, leading to unstable performance on certain IIGs. To enhance the robustness of PCFR+, we propose Asymmetric PCFR+ (APCFR+), which employs an adaptive asymmetry of step sizes between the updates of implicit and explicit accumulated counterfactual regrets to mitigate the impact of the prediction inaccuracy on convergence. We present a theoretical analysis demonstrating why APCFR+ can enhance the robustness. To the best of our knowledge, we are the first to propose the asymmetry of step sizes, a simple yet novel technique that effectively improves the robustness of PCFR+. Then, to reduce the difficulty of implementing APCFR+ caused by the adaptive asymmetry, we propose a simplified version of APCFR+ called Simple APCFR+ (SAPCFR+), which uses a fixed asymmetry of step sizes to enable only a single-line modification compared to original PCFR+. Experimental results on five standard IIG benchmarks and two heads-up no-limit Texas Hold’em (HUNL) Subagems show that (i) both APCFR+ and SAPCFR+ outperform PCFR+ in most of the tested games, (ii) SAPCFR+ achieves a comparable empirical convergence rate with APCFR+, and (iii) our approach can be generalized to improve other CFR algorithms, e.g., Discount CFR (DCFR).

PDF Details DOI

AAAI Conference 2026 Conference Paper

Identifying and Analyzing Performance-Critical Tokens in Large Language Models

Yu Bai
Heyan Huang
Cesare Spinoso-Di Piano
Sanxing Chen
Marc-Antoine Rondeau
Yang Gao
Jackie Chi Kit Cheung

In-context learning (ICL) has emerged as an effective solution for few-shot learning with large language models (LLMs). However, how LLMs leverage demonstrations to specify a task and learn a corresponding computational function through ICL is underexplored. Drawing from the way humans learn from content-label mappings in demonstrations, we categorize the tokens in an ICL prompt into content, stopword, and template tokens. Our goal is to identify the types of tokens whose representations directly influence LLM's performance, a property we refer to as being performance-critical. By ablating representations from the attention of the test example, we find that the representations of informative content tokens have less influence on performance compared to template and stopword tokens, which contrasts with the human attention to informative words. We give evidence that the representations of performance-critical tokens aggregate information from the content tokens. Moreover, we demonstrate experimentally that lexical meaning, repetition, and structural cues are the main distinguishing characteristics of these tokens. Our work sheds light on how LLMs learn to perform tasks from demonstrations and deepens our understanding of the roles different types of tokens play in LLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

ManiLong-Shot: Interaction-Aware One-Shot Imitation Learning for Long-Horizon Manipulation

Zixuan Chen
Chongkai Gao
Lin Shao
Jieqi Shi
Jing Huo
Yang Gao

One-shot imitation learning (OSIL) offers a promising way to teach robots new skills without large-scale data collection. However, current OSIL methods are primarily limited to short-horizon tasks, thus limiting their applicability to complex, long-horizon manipulations. To address this limitation, we propose ManiLong-Shot, a novel framework that enables effective OSIL for long-horizon prehensile manipulation tasks. ManiLong-Shot structures long-horizon tasks around physical interaction events, reframing the problem as sequencing interaction-aware primitives instead of directly imitating continuous trajectories. This primitive decomposition can be driven by high-level reasoning from a vision-language model (VLM) or by rule-based heuristics derived from robot state changes. For each primitive, ManiLong-Shot predicts invariant regions critical to the interaction, establishes correspondences between the demonstration and the current observation, and computes the target end-effector pose, enabling effective task execution. Extensive simulation experiments show that ManiLong-Shot, trained on only 10 short-horizon tasks, generalizes to 20 unseen long-horizon tasks across three difficulty levels via one-shot imitation, achieving a 22.8% relative improvement over the SOTA. Additionally, real-robot experiments validate ManiLong-Shot’s ability to robustly execute three long-horizon manipulation tasks via OSIL, confirming its practical applicability.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Simulated Rewards, Skewed Strategies: Tracing the Acquired Preference Bias in LLM-Based Dialogue Planners

Heyan Huang
Yizhe Yang
Huashan Sun
Jiawei Li
Yang Gao

Large language models have enabled sophisticated dialogue planning policy, but their reliance on LLM-generated simulation and feedback for policy optimization may introduce systematic preference bias. We present the first comprehensive analysis of preference bias in LLM-based dialogue planners, evaluating four state-of-the-art planning policies across three dialogue domains using multiple LLM families at varying scales. Our investigation reveals that all tested planners exhibit significant preference bias, systematically favoring narrow strategy sets rather than maintaining balanced distributions. User simulation emerges as the primary bias driver, while diverse persona simulation fails as an effective mitigation strategy. Most concerning, preference bias drives planners toward ethically problematic strategies that achieve short-term success while undermining real-world effectiveness and ethical standards. Our findings establish fundamental challenges for responsible deployment of LLM-based dialogue systems and provide crucial insights for developing more reliable and ethically-aligned planning approaches.

PDF Details DOI

YNIMG Journal 2026 Journal Article

VSSI-TBM: A variational sparse source imaging method based on time basis matrix

Tianyu Gao
Jin Ding
Wen Li
Fulong Wang
Yujie Ma
Ruonan Wang
Yang Gao
Xiaolin Ning

Source imaging algorithms have been widely used to localize functional and lesion areas. Brain source reconstruction is limited by complex experimental environments (noise interference, distributed brain activity, acquisition systems, etc.), and range estimation is not accurate. This study proposes a variational sparse source imaging method based on the time basis matrix (VSSI-TBM) algorithm. VSSI-TBM permits the source spatial signal to consist of several temporal basis functions by using low-rank decomposition to extract effective signals. In a compressed space, mixed-norm constraints and a cortical source variation operator ensure spatial sparsity and smoothness. In clinical examinations or research, other a priori information regarding brain activity may be available. VSSI-TBM using lead field guide constraints can further enhance the reconstruction results. The simulation results demonstrate the robust performance of VSSI-TBM in environments with a low signal-to-noise ratio (SNR), large sources ( > 11 cm 2 ), and multiple sources. Additionally, integrating prior information enhances the imaging performance in complex environments. The algorithm is evaluated using an open-source dataset and an optically pumped magnetometer-based magnetoencephalography (OPM-MEG) system with a noisy 30-channel uniform layout. The results reveal a strong robustness of the spatial range reconstruction. Moreover, the combination of prior information effectively improves the imaging performance of the OPM-MEG system.

Details DOI

NeurIPS Conference 2025 Conference Paper

Association-Focused Path Aggregation for Graph Fraud Detection

Tian Qiu
Wenda Li
Zunlei Feng
Jie Lei
Tao Wang
Yi Gao
Mingli Song
Yang Gao

Fraudulent activities have caused substantial negative social impacts and are exhibiting emerging characteristics such as intelligence and industrialization, posing challenges of high-order interactions, intricate dependencies, and the sparse yet concealed nature of fraudulent entities. Existing graph fraud detectors are limited by their narrow "receptive fields", as they focus only on the relations between an entity and its neighbors while neglecting longer-range structural associations hidden between entities. To address this issue, we propose a novel fraud detector based on Graph Path Aggregation (GPA). It operates through variable-length path sampling, semantic-associated path encoding, path interaction and aggregation, and aggregation-enhanced fraud detection. To further facilitate interpretable association analysis, we synthesize G-Internet, the first benchmark dataset in the field of internet fraud detection. Extensive experiments across datasets in multiple fraud scenarios demonstrate that the proposed GPA outperforms mainstream fraud detectors by up to +15% in Average Precision (AP). Additionally, GPA exhibits enhanced robustness to noisy labels and provides excellent interpretability by uncovering implicit fraudulent patterns across broader contexts. Code is available at https: //github. com/horrible-dong/GPA.