Arrow Research search

Author name cluster

Yang Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

160 papers
2 author rows

Possible papers

160

AAAI Conference 2026 Conference Paper

An LLM-based Quantitative Framework for Evaluating High-Stealthy Backdoor Risks in OSS Supply Chains

  • Zihe Yan
  • Kai Luo
  • Haoyu Yang
  • Yang Yu
  • Zhuosheng Zhang
  • Guancheng Li

In modern software development workflows, the open-source software supply chain significantly contributes to efficient and convenient engineering practices. With increasing system complexity, it has become a common practice to use open-source software as third-party dependencies. However, due to the lack of maintenance for underlying dependencies and insufficient community auditing, ensuring the security of source code and the legitimacy of repository maintainers has become a challenge, particularly in the context of high-stealth backdoor attacks such as the XZ-Util incident. To address these problems, we propose a fine-grained project evaluation framework for backdoor risk assessment in open-source software. Our evaluation framework models highly stealthy backdoor attacks from the attacker’s perspective and defines targeted metrics for each attack stage. Moreover, to overcome the limitations of static analysis in assessing the reliability of repository maintenance activities, such as irregular committer privilege escalation and insufficient review participation, we employ large language models (LLMs) to perform semantic evaluation of code repositories while avoiding reliance on manually crafted patterns. The effectiveness of our framework is validated on 66 high-priority packages in the Debian ecosystem, and the experimental results reveal that the current open-source software supply chain is exposed to a series of security risks.

AAAI Conference 2026 Conference Paper

Exploring Reliable Spatiotemporal Dependencies for Efficient Visual Tracking

  • Junze Shi
  • Yang Yu
  • Jian Shi
  • Haibo Luo

Recent advances in transformer-based lightweight object tracking have established new standards across benchmarks, leveraging the global receptive field and powerful feature extraction capabilities of attention mechanisms. Despite these achievements, existing methods universally employ sparse sampling during training—utilizing only one template and one search image per sequence—which fails to comprehensively explore spatiotemporal information in videos. This limitation constrains performance and causes the gap between lightweight and high-performance trackers. To bridge this divide while maintaining real-time efficiency, we propose STDTrack, a framework that pioneers the integration of reliable spatiotemporal dependencies into lightweight trackers. Our approach implements dense video sampling to maximize spatiotemporal information utilization. We introduce a temporally propagating spatiotemporal token to guide per-frame feature extraction. To ensure comprehensive target state representation, we design the Multi-frame Information Fusion Module (MFIFM), which augments current dependencies using historical context. The MFIFM operates on features stored in our constructed Spatiotemporal Token Maintainer (STM), where a quality-based update mechanism ensures information reliability. Considering the scale variation among tracking targets, we develop a multi-scale prediction head to dynamically adapt to objects of different sizes. Extensive experiments demonstrate state-of-the-art results across six benchmarks. Notably, on GOT-10k, STDTrack rivals certain high-performance non-real-time trackers (e.g., MixFormer) while operating at 192 FPS (GPU) and 41 FPS (CPU).

AAAI Conference 2026 Conference Paper

HEV Generative Sandbox: A Framework for Assessing Domain-Specific Social Risks Through Human-LLM Simulation

  • Yiran Liu
  • Zhiyi Hou
  • Xiaoang Xu
  • Shuo Wang
  • Huijia Wu
  • Kaicheng Yu
  • Yang Yu
  • ChengXiang Zhai

Deploying Large Language Models (LLMs) in specialized domains introduces significant societal and compliance risks, including bias amplification, misinformation propagation, and privacy violations. These risks predominantly emerge from the dynamic interactions between LLMs and humans in specific contexts. Different domains face unique distribution of hazards, and varying interaction modalities introduce distinct levels of exposure and vulnerability. However, current risk assessment frameworks lack a systematic methodology to capture this dynamic interplay. In this work, we introduce the HEV Generative Sandbox, a novel risk evaluation framework that simulates human-LLM behavior to quantify domain-contextual risks across three interdependent dimensions: 1) Hazard (H): Domain-specific threats inherent to a given context; 2) Exposure (E): The extent to which the LLM and its users are subjected to hazardous scenarios; 3) Vulnerability (V): The susceptibility of the system to risk due to human interaction or model weaknesses. Our approach pioneers "domain-rooted scenario generation", wherein we sample contextual distributions from domain-specific corpora and simulate diverse inputs. By unifying dynamic scenario simulation, causal risk decomposition, and closed-loop evaluation, the HEV Generative Sandbox provides a scalable, domain-sensitive methodology for responsible LLM deployment. This work contributes to advancing the safe deployment of LLMs by providing a comprehensive and automated risk evaluation framework.

AAAI Conference 2026 Conference Paper

Multi-agent In-context Coordination via Decentralized Memory Retrieval

  • Tao Jiang
  • Zichuan Lin
  • Lihe Li
  • Yi-Chen Li
  • Cong Guan
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Large transformer models, trained on diverse datasets, have demonstrated impressive few-shot performance on previously unseen tasks without requiring parameter updates. This capability has also been explored in Reinforcement Learning (RL), where agents interact with the environment to retrieve context and maximize cumulative rewards, showcasing strong adaptability in complex settings. However, in cooperative Multi-Agent Reinforcement Learning (MARL), where agents must coordinate toward a shared goal, decentralized policy deployment can lead to mismatches in task alignment and reward assignment, limiting the efficiency of policy adaptation. To address this challenge, we introduce Multi-agent In-context Coordination via Decentralized Memory Retrieval (MAICC), a novel approach designed to enhance coordination by fast adaptation. Our method involves training a centralized embedding model to capture fine-grained trajectory representations, followed by decentralized models that approximate the centralized one to obtain team-level task information. Based on the learned embeddings, relevant trajectories are retrieved as context, which, combined with the agents' current sub-trajectories, inform decision-making. During decentralized execution, we introduce a novel memory mechanism that effectively balances test-time online data with offline memory. Based on the constructed memory, we propose a hybrid utility score that incorporates both individual- and team-level returns, ensuring credit assignment across agents. Extensive experiments on cooperative MARL benchmarks, including Level-Based Foraging (LBF) and SMAC (v1/v2), show that MAICC enables faster adaptation to unseen tasks compared to existing methods.

JBHI Journal 2026 Journal Article

Point-Supervised Coronary Semantic Segmentation in X-Ray Angiographic Images

  • Ying Chen
  • Danni Ai
  • Jianyu Du
  • Yuanyuan Wang
  • Tianyu Fu
  • Deqiang Xiao
  • Yucong Lin
  • Long Shao

Coronary semantic segmentation in X-ray angiography is essential for computer-aided diagnosis and treatment planning of coronary artery disease (CAD). Despite its importance, this task remains highly challenging due to the complex and interconnected vascular topology, as well as the similar visual characteristics among different branches, making dense pixel-level manual annotation difficult and labor-intensive. To alleviate this burden, we propose a point-supervised coronary semantic segmentation framework that significantly reduces annotation effort without compromising segmentation accuracy. The primary challenge of point label based supervision lies in the model's tendency to overfit sparse point labels, leading to limited generalization to pixel-level predictions. To enrich the supervision signals and stabilize the training process with the sparse point labels, we propose an adaptive foreground mask generation module and a region regularization strategy to ensure accurate semantic guidance while maximizing meaningful coverage of the vascular structures. To enhance coronary topology perception and branch differentiation, we propose a multi-task learning framework that jointly performs keypoint detection and coronary semantic segmentation through a shared feature extraction encoder and two task-specific decoders. The experimental results demonstrate that our point-supervised model achieves performance comparable to fully supervised model, and outperforms the existing state-of-the-art point-supervised semantic segmentation methods.

AAAI Conference 2026 Conference Paper

Reward Model Evaluation via Automatically-Ranked Policy Alignment

  • Aoran Wang
  • Lei Ou
  • Yang Yu
  • Zongzhang Zhang

Evaluating reward models is a fundamental challenge in Reinforcement Learning (RL), particularly in settings where the reward model is learned or manually designed. The standard paradigm for Reward Model Evaluation (RME) involves training an optimal policy via RL on the given reward model and assessing model quality through the performance of the resulting policy. However, this approach conflates the quality of the reward model with the effectiveness of RL training, and is computationally expensive due to the need for policy optimization. Recent RME methods attempt to circumvent this issue by evaluating reward models directly, without RL, but often rely on impractical assumptions such as access to a ground-truth reward or fail to utilize available supervision in a fine-grained manner. To overcome these limitations, we propose the Policy Preference Alignment Coefficient (PPAC), a novel metric for RME that requires neither RL training nor ground-truth rewards. PPAC first generates a sequence of automatically ranked policy preferences that guarantee monotonic improvement in the policy value, and then quantifies the alignment between these generated preferences and those implied by the candidate reward model. Experimental results across gridworld and continuous control task demonstrate that PPAC yields preference sequences with consistently increasing policy values and outperforms existing metrics in evaluating reward model quality.

JBHI Journal 2026 Journal Article

RIHA: Report-Image Hierarchical Alignment for Radiology Report Generation

  • Yucheng Chen
  • Yang Yu
  • Yufei Shi
  • Conghao Xiong
  • Xulei Yang
  • Si Yong Yeo

Radiology report generation (RRG) has emerged as a promising approach to alleviate radiologists' workload and reduce human errors by automatically generating diagnostic reports from medical images. A key challenge in RRG is achieving fine-grained alignment between complex visual features and the hierarchical structure of long-form radiology reports. Although recent methods have improved image-text representation learning, they often treat reports as flat sequences, overlooking their structured sections and semantic hierarchies. This simplification hinders precise cross-modal alignment and weakens RRG accuracy. To address this challenge, we propose RIHA (Report-Image Hierarchical Alignment Transformer), a novel end-to-end framework that performs multi-level alignment between radiological images and their corresponding reports across paragraph, sentence, and word levels. This hierarchical alignment enables more precise cross-modal mapping, essential for capturing the nuanced semantics embedded in clinical narratives. Specifically, RIHA introduces a Visual Feature Pyramid (VFP) to extract multi-scale visual features and a Text Feature Pyramid (TFP) to represent multi-granularity textual structures. These components are integrated through a Cross-modal Hierarchical Alignment (CHA) module, leveraging optimal transport to effectively align visual and textual features across various levels. Furthermore, we incorporate Relative Positional Encoding (RPE) into the decoder to model spatial and semantic relationships among tokens, enhancing the token-level alignment between visual features and generated text. Extensive experiments on two benchmark chest X-ray datasets, IU-Xray and MIMIC-CXR, demonstrate that RIHA outperforms existing state-of-the-art models in both natural language generation and clinical efficacy metrics.

JBHI Journal 2026 Journal Article

Simultaneous Decoding of Wrist Angles and Grasp Forces Based on Channel-Wise Cumulative Spike Trains

  • Yang Yu
  • Yang Xu
  • Jiamin Zhao
  • Dongxuan Li
  • Weichao Guo
  • Xinjun Sheng
  • Xiangyang Zhu

Understanding the underlying mechanism of neuromuscular system on motion/force generation is essential for human-machine interfacing. However, simultaneous decoding of wrist angles and grasp forces from neural signals remains an open challenge in the field of neural interfacing. In this study, we proposed a scheme leveraging channel-wise cumulative spike trains (cw-CSTs) of motor units to simultaneously decode wrist angles and grasp forces. Specifically, a spatial spike detection method was utilized to detect cw-CST from surface electromyography, observing as much as possible of motor unit activities. Accordingly, we extracted three neural features to drive the decoders, including a twitch force model-based (cw-MUdrive) and a discharge rate-based (DR-cwCST) neural features derived from cw-CSTs, and DR of motor units (DR-MUST) decomposed by a conventional blind source separation algorithm. Wrist- and hand-specific decoders were built to estimate wrist angles and grasp forces via Gaussian process regression. Experiments were conducted with ten subjects, in which they activated wrist motions and grasp forces concurrently. We evaluated the performance with both accuracy and output stability. Results demonstrated that the cwCST-based neural features outperformed the conventional DR-MUST features with both higher accuracy and stability metrics. Additionally, cw-MUdrive performed better than DR-cwCST in grasp force estimation and comparable to DR-cwCST in wrist angle estimation. The outcome provides an effective solution for simultaneously decoding wrist movements and hand grasp forces, promoting the development of natural control in neural interface.

AAAI Conference 2026 Conference Paper

SurgPub-Video: A Comprehensive Surgical Video Framework for Enhanced Surgical Intelligence in Vision-Language Model

  • Yaoqian Li
  • Xikai Yang
  • Dunyuan Xu
  • Yang Yu
  • Litao Zhao
  • Xiaowei Hu
  • Jinpeng Li
  • Pheng-Ann Heng

Vision-Language Models (VLMs) have shown significant potential in surgical scene analysis, yet existing models are limited by frame-level datasets and lack high-quality video data with procedural surgical knowledge. To address these challenges, we make the following contributions: (i) SurgPub-Video, a comprehensive dataset of over 3,000 surgical videos and 25 million annotated frames across 11 specialities, sourced from peer-reviewed clinical journals, (ii) SurgLLaVA-Video, a specialized VLM for surgical video understanding, built upon the TinyLLaVA-Video architecture that supports both video-level and frame-level inputs, and (iii) a video-level surgical Visual Question Answering (VQA) benchmark, covering diverse 11 surgical specialities, such as vascular, cardiology, and thoracic. Extensive experiments, conducted on the proposed benchmark and three additional surgical downstream tasks (action recognition, skill assessment, and triplet recognition), show that SurgLLaVA-Video significantly outperforms both general-purpose and surgical-specific VLMs with only three billion parameters.

NeurIPS Conference 2025 Conference Paper

Adaptable Safe Policy Learning from Multi-task Data with Constraint Prioritized Decision Transformer

  • Ruiqi Xue
  • Ziqian Zhang
  • Lihe Li
  • Cong Guan
  • Lei Yuan
  • Yang Yu

Learning safe reinforcement learning (RL) policies from offline multi-task datasets without direct environmental interaction is crucial for efficient and reliable deployment of RL agents. Benefiting from their scalability and strong in-context learning capabilities, recent approaches attempt to utilize Decision Transformer (DT) architectures for offline safe RL, demonstrating promising adaptability across varying safety budgets. However, these methods primarily focus on single-constraint scenarios and struggle with diverse constraint configurations across multiple tasks. Additionally, their reliance on heuristically defined Return-To-Go (RTG) inputs limits flexibility and reduces learning efficiency, particularly in complex multi-task environments. To address these limitations, we propose CoPDT, a novel DT-based framework designed to enhance adaptability to diverse constraints and varying safety budgets. Specifically, CoPDT introduces a constraint prioritized prompt encoder, which leverages sparse binary cost signals to accurately identify constraints, and a constraint prioritized Return-To-Go (CPRTG) token mechanism, which dynamically generates RTGs based on identified constraints and corresponding safety budgets. Extensive experiments on the OSRL benchmark demonstrate that CoPDT achieves superior efficiency and significantly enhanced safety compliance across diverse multi-task scenarios, surpassing state-of-the-art DT-based methods by satisfying safety constraints in more than twice as many tasks.

JBHI Journal 2025 Journal Article

Cognitive Load Prediction From Multimodal Physiological Signals Using Multiview Learning

  • Yingxin Liu
  • Yang Yu
  • Hong Tao
  • Zeqi Ye
  • Si Wang
  • Hao Li
  • Dewen Hu
  • Zongtan Zhou

Predicting cognitive load is a crucial issue in the emerging field of human-computer interaction and holds significant practical value, particularly in flight scenarios. Although previous studies have realized efficient cognitive load classification, new research is still needed to adapt the current state-of-the-art multimodal fusion methods. Here, we proposed a feature selection framework based on multiview learning to address the challenges of information redundancy and reveal the common physiological mechanisms underlying cognitive load. Specifically, the multimodal signal features [electroencephalogram (EEG), electrodermal activity (EDA), electrocardiogram (ECG), electrooculogram (EOG), & eye movements] at three cognitive load levels were estimated during multiattribute task battery (MATB) tasks performed by 22 healthy participants and fed into a feature selection-multiview classification with cohesion and diversity (FS-MCCD) framework. The optimized feature set was extracted from the original feature set by integrating the weight of each view and the feature weights to formulate the ranking criteria. The cognitive load prediction model, evaluated using real-time classification results, achieved an average accuracy of 81. 08% and an average F1-score of 80. 94% for three-class classification among 22 participants. Furthermore, the weights of the physiological signal features revealed the physiological mechanisms related to cognitive load. Specifically, heightened cognitive load was linked to amplified $\delta$ and $\theta$ power in the frontal lobe, reduced $\alpha$ power in the parietal lobe, and an increase in pupil diameter. Thus, the proposed multimodal feature fusion framework emphasizes the effectiveness and efficiency of using these features to predict cognitive load.

ICRA Conference 2025 Conference Paper

Dual-AEB: Synergizing Rule-Based and Multimodal Large Language Models for Effective Emergency Braking

  • Wei Zhang 0012
  • Pengfei Li 0007
  • Junli Wang
  • Bingchuan Sun
  • Qihao Jin
  • Guangjun Bao
  • Shibo Rui
  • Yang Yu

Automatic Emergency Braking (AEB) systems are a crucial component in ensuring the safety of passengers in autonomous vehicles. Conventional AEB systems primarily rely on closed-set perception modules to recognize traffic conditions and assess collision risks. To enhance the adaptability of AEB systems in open scenarios, we propose Dual-AEB, a system combines an advanced multimodal large language model (MLLM) for comprehensive scene understanding and a conventional rule-based rapid AEB to ensure quick response times. To the best of our knowledge, Dual-Aebis the first method to incorporate MLLMs within AEB systems. Through extensive experimentation, we have validated the effectiveness of our method. Codes will be publicly available at https://github.com/ChipsICU/Dual-AEB.

TMLR Journal 2025 Journal Article

Efficient Multi-Agent Cooperation Learning through Teammate Lookahead

  • Feng Chen
  • Xinwei Chen
  • Rong-Jun Qin
  • Cong Guan
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Cooperative Multi-Agent Reinforcement Learning (MARL) is a rapidly growing research field that has achieved outstanding results across a variety of challenging cooperation tasks. However, existing MARL algorithms typically overlook the concurrent updates of teammate agents. An agent always learns from the data that it cooperates with one set of (current) teammates, but then practices with another set of (updated) teammates. This phenomenon, termed as ``teammate delay'', leads to a discrepancy between the agent's learning objective and the actual evaluation scenario, which can degrade learning stability and efficiency. In this paper, we tackle this challenge by introducing a lookahead strategy that enables agents to learn to cooperate with predicted future teammates, allowing the explicit awareness of concurrent teammate updates. This lookahead strategy is designed to seamlessly integrate with existing policy-gradient-based MARL methods, enhancing their performance without significant modifications to their underlying structures. The extensive experiments demonstrate the effectiveness of this approach, showing that the lookahead strategy can enhance the cooperation learning efficiency and achieve superior performance over the state-of-the-art MARL algorithms.

NeurIPS Conference 2025 Conference Paper

Focus-Then-Reuse: Fast Adaptation in Visual Perturbation Environments

  • Jiahui Wang
  • Chao Chen
  • Jiacheng Xu
  • Zongzhang Zhang
  • Yang Yu

Visual reinforcement learning has shown promise in various real-world applications. However, deploying policies in complex real-world environments with visual perturbations remains a significant challenge. We notice that humans tend to filter information at the object level prior to decision-making, facilitating efficient skill transfer across different contexts. Inspired by this, we introduce Focus-Then-Reuse (FTR), a method utilizing a novel object selection mechanism to focus on task-relevant objects, and directly reuse the simulation-trained policy on them. The training of the object selection mechanism integrates prior knowledge from a vision-language model and feedback from the environment. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that FTR enables rapid adaptation in visual perturbation environments and achieves state-of-the-art performance. The source code is available at https: //github. com/LAMDA-RL/FTR.

JBHI Journal 2025 Journal Article

Fourier-Based Frequency Space Disentanglement and Augmentation for Generalizable Face Anti-Spoofing

  • Yang Yu
  • Zhekai Du
  • Heng Luo
  • Chengwei Xiao
  • Jiang Hu

Generalizing face anti-spoofing (FAS) models to unseen distributions is challenging due to domain shifts. Previous domain generalization (DG) based FAS methods focus on learning invariant features across domains in the spatial space, which may be ineffective in detecting subtle spoof patterns. In this paper, we propose a novel approach called Frequency Space Disentanglement and Augmentation (FSDA) for generalizable FAS. Specifically, we leverage Fourier transformation to analyze face images in the frequency space, where the amplitude spectrum captures low-level texture information that forms distinct visual appearances, and the phase spectrum corresponds to the content information. We hypothesize that the liveness of a face is more related to these low-level patterns rather than high-level content information. To locate spoof traces, we disentangle the amplitude spectrum into domain-related and spoof-related components using either empirical or learnable strategies. We then propose a frequency space augmentation technique that mixes the disentangled components of two images to synthesize new variations. By imposing a distillation loss and a consistency loss on the augmented samples, our model learns to capture spoof patterns that are robust to both domain and spoof type variations. Extensive experiments on four FAS datasets demonstrate the superiority of our method in improving the generalization ability of FAS models in various unseen scenarios.

NeurIPS Conference 2025 Conference Paper

Geometric Mixture Models for Electrolyte Conductivity Prediction

  • Anyi Li
  • Jiacheng Cen
  • Songyou Li
  • Mingze Li
  • Yang Yu
  • Wenbing Huang

Accurate prediction of ionic conductivity in electrolyte systems is crucial for advancing numerous scientific and technological applications. While significant progress has been made, current research faces two fundamental challenges: (1) the lack of high-quality standardized benchmarks, and (2) inadequate modeling of geometric structure and intermolecular interactions in mixture systems. To address these limitations, we first reorganize and enhance the CALiSol and DiffMix electrolyte datasets by incorporating geometric graph representations of molecules. We then propose GeoMix, a novel geometry-aware framework that preserves Set-SE(3) equivariance—an essential but challenging property for mixture systems. At the heart of GeoMix lies the Geometric Interaction Network (GIN), an equivariant module specifically designed for intermolecular geometric message passing. Comprehensive experiments demonstrate that GeoMix consistently outperforms diverse baselines (including MLPs, GNNs, and geometric GNNs) across both datasets, validating the importance of cross-molecular geometric interactions and equivariant message passing for accurate property prediction. This work not only establishes new benchmarks for electrolyte research but also provides a general geometric learning framework that advances modeling of mixture systems in energy materials, pharmaceutical development, and beyond.

AAAI Conference 2025 Conference Paper

GRAIN: Multi-Granular and Implicit Information Aggregation Graph Neural Network for Heterophilous Graphs

  • Songwei Zhao
  • Yuan Jiang
  • Zijing Zhang
  • Yang Yu
  • Hechang Chen

Graph neural networks (GNNs) have shown significant success in learning graph representations. However, recent studies reveal that GNNs often fail to outperform simple MLPs on heterophilous graph tasks, where connected nodes may differ in features or labels, challenging the homophily assumption. Existing methods addressing this issue often overlook the importance of information granularity and rarely consider implicit relationships between distant nodes. To overcome these limitations, we propose the Granular and Implicit Graph Network (GRAIN), a novel GNN model specifically designed for heterophilous graphs. GRAIN enhances node embeddings by aggregating multi-view information at various granularity levels and incorporating implicit data from distant, non-neighboring nodes. This approach effectively integrates local and global information, resulting in smoother, more accurate node representations. We also introduce an adaptive graph information aggregator that efficiently combines multi-granularity and implicit data, significantly improving node representation quality, as shown by experiments on 13 datasets covering varying homophily and heterophily. GRAIN consistently outperforms 12 state-of-the-art models, excelling on both homophilous and heterophilous graphs.

AAAI Conference 2025 Conference Paper

GuideNER: Annotation Guidelines Are Better than Examples for In-Context Named Entity Recognition

  • Shizhou Huang
  • Bo Xu
  • Yang Yu
  • Changqun Li
  • Xin Alex Lin

Large language models (LLMs) demonstrate impressive performance on downstream tasks through in-context learning(ICL). However, there is a significant gap between their performance in Named Entity Recognition (NER) and in fine-tuning methods. We believe this discrepancy is due to inconsistencies in labeling definitions in NER. In addition, recent research indicates that LLMs do not learn the specific input-label mappings from the demonstrations. Therefore, we argue that using examples to implicitly capture the mapping between inputs and labels in in-context learning is not suitable for NER. Instead, it requires explicitly informing the model of the range of entities contained in the labels, such as annotation guidelines. In this paper, we propose GuideNER, which uses LLMs to summarize concise annotation guidelines as contextual information in ICL. We have conducted experiments on widely used NER datasets, and the experimental results indicate that our method can consistently and significantly outperform state-of-the-art methods, while using shorter prompts. Especially on the GENIA dataset, our model outperforms the previous state-of-the-art model by 12.63 F1 scores.

AAMAS Conference 2025 Conference Paper

InCLET: Large Language Model In-context Learning can Improve Embodied Instruction-following

  • Peng-Yuan Wang
  • Jing-Cheng Pang
  • Chen-Yang Wang
  • Xuhui Liu
  • Tian-Shuo Liu
  • Si-Hang Yang
  • Hong Qian
  • Yang Yu

Natural language-conditioned reinforcement learning (NLC-RL) empowers embodied agent to complete various tasks following human instruction. However, the unbounded natural language examples still introduce much complexity for the agent that solves concrete RL tasks, which can distract policy learning from completing the task. Consequently, extracting effective task representation from human instruction emerges as the critical component of NLC-RL. While previous methods have attempted to address this issue by learning task-related representation using large language models (LLMs), they highly rely on pre-collected task data and require extra training procedure. In this study, we uncover the inherent capability of LLMs to generate task representations and present a novel method, in-context learning embedding as task representation (InCLET). InCLET is grounded on a foundational finding that LLM in-context learning using trajectories can greatly help represent tasks. We thus firstly employ LLM to imagine task trajectories following the natural language instruction, then use in-context learning of LLM to generate task representations, and ∗Equal Contribution †Corresponding Author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2025), Y. Vorobeychik, S. Das, A. Nowé (eds.), May 19 – 23, 2025, Detroit, Michigan, USA. © 2025 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). finally aggregate and project into a compact low-dimensional task representation. This representation is then used to train a human instruction-following agent. We conduct experiments on various embodied control environments and results show that InCLET creates effective task representations. Furthermore, this representation can significantly improve the RL training efficiency, compared to the baseline methods.

TMLR Journal 2025 Journal Article

Interactive Large Language Models for Reliable Answering under Incomplete Context

  • Jing-Cheng Pang
  • Heng-Bo Fan
  • Pengyuan Wang
  • Jia-Hao Xiao
  • Nan Tang
  • Si-Hang Yang
  • Chengxing Jia
  • Ming-Kun Xie

The rise of large language models (LLMs) has revolutionized the way humans interact with artificial intelligence systems. However, their reliability in sensitive applications—such as personal consultations or clinical decision-making—remains limited. A critical shortfall lies in LLMs’ inherent lack of interactivity: these models generate responses even when essential context or domain-specific knowledge is absent, risking inaccurate or misleading outputs. A potential approach to mitigate this issue is to enable LLMs to pose clarifying questions, thereby uncovering the missing information required to provide accurate responses. However, previous methods often tend to greedily prompt LLMs to ask questions. This burdens the user to respond to potentially irrelevant questions and makes the system less flexible. In this paper, we introduce LaMSeI (Language Model with Selective Interaction) method, which enhances LLMs’ ability to judge when interaction is necessary under ambiguous or incomplete contexts. The motivation of LaMSeI is to measure the level of LLMs’ uncertainty about the user query, and interacts with user only when the uncertainty is high. Additionally, we incorporate active learning techniques to select the most informative questions from question candidates, for effectively uncovering the missing context. Our empirical studies, across various challenging question answering benchmarks, where LLMs are posed queries with incomplete context, demonstrate the effectiveness of LaMSeI. The method improves answer accuracy from 31.9% to 50.9%, outperforming other leading question-answering frameworks. Moreover, in experiments involving human participants, LaMSeI consistently generates answers superior to or comparable to baselines in more than 82% of the cases. Moreover, we verify the performance of LaMSeI on various LLMs, such as LLAMA2, LLAMA3, Vicuna and GPT-3.5, highlighting its capability to improve interactive language models.

ICLR Conference 2025 Conference Paper

LLMOPT: Learning to Define and Solve General Optimization Problems from Scratch

  • Caigao Jiang
  • Xiang Shu
  • Hong Qian
  • Xingyu Lu
  • Jun Zhou
  • Aimin Zhou
  • Yang Yu

Optimization problems are prevalent across various scenarios. Formulating and then solving optimization problems described by natural language often requires highly specialized human expertise, which could block the widespread application of optimization-based decision making. To automate problem formulation and solving, leveraging large language models (LLMs) has emerged as a potential way. However, this kind of approach suffers from the issue of optimization generalization. Namely, the accuracy of most current LLM-based methods and the generality of optimization problem types that they can model are still limited. In this paper, we propose a unified learning-based framework called LLMOPT to boost optimization generalization. Starting from the natural language descriptions of optimization problems and a pre-trained LLM, LLMOPT constructs the introduced five-element formulation as a universal model for learning to define diverse optimization problem types. Then, LLMOPT employs the multi-instruction tuning to enhance both problem formalization and solver code generation accuracy and generality. After that, to prevent hallucinations in LLMs, such as sacrificing solving accuracy to avoid execution errors, the model alignment and self-correction mechanism are adopted in LLMOPT. We evaluate the optimization generalization ability of LLMOPT and compared methods across six real-world datasets covering roughly 20 fields such as health, environment, energy and manufacturing, etc. Extensive experiment results show that LLMOPT is able to model various optimization problem types such as linear/nonlinear programming, mixed integer programming, and combinatorial optimization, and achieves a notable 11.08% average solving accuracy improvement compared with the state-of-the-art methods. The code is available at https://github.com/caigaojiang/LLMOPT.

NeurIPS Conference 2025 Conference Paper

Multi-Agent Imitation by Learning and Sampling from Factorized Soft Q-Function

  • Yi-Chen Li
  • Zhongxiang Ling
  • Tao Jiang
  • Fuxiang Zhang
  • Pengyuan Wang
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Learning from multi-agent expert demonstrations, known as Multi-Agent Imitation Learning (MAIL), provides a promising approach to sequential decision-making. However, existing MAIL methods including Behavior Cloning (BC) and Adversarial Imitation Learning (AIL) face significant challenges: BC suffers from the compounding error issue, while the very nature of adversarial optimization makes AIL prone to instability. In this work, we propose \textbf{M}ulti-\textbf{A}gent imitation by learning and sampling from \textbf{F}actor\textbf{I}zed \textbf{S}oft Q-function (MAFIS), a novel method that addresses these limitations for both online and offline MAIL settings. Built upon the single-agent IQ-Learn framework, MAFIS introduces the value decomposition network to factorize the imitation objective at agent level, thus enabling scalable training for multi-agent systems. Moreover, we observe that the soft Q-function implicitly defines the optimal policy as an energy-based model, from which we can sample actions via stochastic gradient Langevin dynamics. This allows us to estimate the gradient of the factorized optimization objective for continuous control tasks, avoiding the adversarial optimization between the soft Q-function and the policy required by prior work. By doing so, we obtain a tractable and \emph{non-adversarial} objective for both discrete and continuous multi-agent control. Experiments on common benchmarks including the discrete control tasks StarCraft Multi-Agent Challenge v2 (SMACv2), Gold Miner, and Multi Particle Environments (MPE), as well as the continuous control task Multi-Agent MuJoCo (MaMuJoCo), demonstrate that MAFIS achieves superior performance compared with baselines. Our code is available at https: //github. com/LAMDA-RL/MAFIS.

NeurIPS Conference 2025 Conference Paper

TCM-Ladder: A Benchmark for Multimodal Question Answering on Traditional Chinese Medicine

  • Jiacheng Xie
  • Yang Yu
  • Ziyang Zhang
  • Shuai Zeng
  • Jiaxuan He
  • Ayush Vasireddy
  • Xiaoting tang
  • Congyu Guo

Traditional Chinese Medicine (TCM), as an effective alternative medicine, has been receiving increasing attention. In recent years, the rapid development of large language models (LLMs) tailored for TCM has highlighted the urgent need for an objective and comprehensive evaluation framework to assess their performance on real-world tasks. However, existing evaluation datasets are limited in scope and primarily text-based, lacking a unified and standardized multimodal question-answering (QA) benchmark. To address this issue, we introduce TCM-Ladder, the first comprehensive multimodal QA dataset specifically designed for evaluating large TCM language models. The dataset covers multiple core disciplines of TCM, including fundamental theory, diagnostics, herbal formulas, internal medicine, surgery, pharmacognosy, and pediatrics. In addition to textual content, TCM-Ladder incorporates various modalities such as images and videos. The dataset was constructed using a combination of automated and manual filtering processes and comprises over 52, 000 questions. These questions include single-choice, multiple-choice, fill-in-the-blank, diagnostic dialogue, and visual comprehension tasks. We trained a reasoning model on TCM-Ladder and conducted comparative experiments against nine state-of-the-art general domain and five leading TCM-specific LLMs to evaluate their performance on the dataset. Moreover, we propose Ladder-Score, an evaluation method specifically designed for TCM question answering that effectively assesses answer quality in terms of terminology usage and semantic expression. To the best of our knowledge, this is the first work to systematically evaluate mainstream general domain and TCM-specific LLMs on a unified multimodal benchmark. The datasets and leaderboard are publicly available at https: //tcmladder. com and will be continuously updated. The source code is available at https: //github. com/orangeshushu/TCM-Ladder.

NeurIPS Conference 2025 Conference Paper

Uncertainty-Sensitive Privileged Learning

  • Fan-Ming Luo
  • Lei Yuan
  • Yang Yu

Privileged learning efficiently tackles high-dimensional, partially observable decision-making problems by first training a privileged policy (PP) on low-dimensional privileged observations, and then deriving a deployment policy (DP) either by imitating the PP or coupling it with an observation encoder. However, since the DP relies on local and partial observations, a behavioral divergence (BD) often emerges between the DP and the PP, ultimately degrading deployment performance. A promising strategy is to train a PP to learn the optimal behaviors attainable under the DP’s observation space by applying reward penalties in regions with large BD. However, producing these behaviors is challenging for the PP because they rely on the DP’s information-gathering progress, which is invisible to the PP. In this paper, we quantify the DP’s information-gathering progress by estimating the prediction uncertainty of privileged observations reconstructed from partial observations, and accordingly propose the framework of Uncertainty-Sensitive Privileged Learning (USPL). USPL feeds this uncertainty estimation to the PP and combines reward transformation with privileged-observation blurring, driving the PP to choose actions that actively reduce uncertainty and thus gather the necessary information. Experiments across nine tasks demonstrate that USPL significantly reduces the behavioral discrepancies, achieving superior deployment performance compared to baselines. Additional visualization results show that the DP accurately quantifies its uncertainty, and the PP effectively adapts to uncertainty variations. Code is available at https: //github. com/FanmingL/USPL.

AAAI Conference 2025 Conference Paper

VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

  • Jiangning Wei
  • Lixiong Qin
  • Bo Yu
  • Tianjian Zou
  • Chuhan Yan
  • Dandan Xiao
  • Yang Yu
  • Lan Yang

Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

AAAI Conference 2024 Conference Paper

ACT: Empowering Decision Transformer with Dynamic Programming via Advantage Conditioning

  • Chen-Xiao Gao
  • Chenyang Wu
  • Mingjun Cao
  • Rui Kong
  • Zongzhang Zhang
  • Yang Yu

Decision Transformer (DT), which employs expressive sequence modeling techniques to perform action generation, has emerged as a promising approach to offline policy optimization. However, DT generates actions conditioned on a desired future return, which is known to bear some weaknesses such as the susceptibility to environmental stochasticity. To overcome DT's weaknesses, we propose to empower DT with dynamic programming. Our method comprises three steps. First, we employ in-sample value iteration to obtain approximated value functions, which involves dynamic programming over the MDP structure. Second, we evaluate action quality in context with estimated advantages. We introduce two types of advantage estimators, IAE and GAE, which are suitable for different tasks. Third, we train an Advantage-Conditioned Transformer (ACT) to generate actions conditioned on the estimated advantages. Finally, during testing, ACT generates actions conditioned on a desired advantage. Our evaluation results validate that, by leveraging the power of dynamic programming, ACT demonstrates effective trajectory stitching and robust action generation in spite of the environmental stochasticity, outperforming baseline methods across various benchmarks. Additionally, we conduct an in-depth analysis of ACT's various design choices through ablation studies. Our code is available at https://github.com/LAMDA-RL/ACT.

IJCAI Conference 2024 Conference Paper

ADMN: Agent-Driven Modular Network for Dynamic Parameter Sharing in Cooperative Multi-Agent Reinforcement Learning

  • Yang Yu
  • Qiyue Yin
  • Junge Zhang
  • Pei Xu
  • Kaiqi Huang

Parameter sharing is a common strategy in multi-agent reinforcement learning (MARL) to make the training more efficient and scalable. However, applying parameter sharing among agents indiscriminately hinders the emergence of agents diversity and degrades the final cooperative performance. To better balance parameter sharing and agents diversity, we propose a novel Agent-Driven Modular Network (ADMN), where agents share a base network consisting of multiple specialized modules, and each agent has its own routing to connect these modules. In ADMN, modules are shared among agents to improve the training efficiency, while the combination of different modules brings rich diversity. The agent routing at different time steps is learned end-to-end to achieve a dynamic and adaptive balance. Specifically, we also propose an information-theoretical regularization between the routing of agents and their behavior to further guarantee the identifiability of different routing. We evaluated ADMN in challenging StarCraft micromanagement games and Google Research Football games, and results demonstrate the superior performance of ADMN, particularly in larger or heterogeneous cooperative tasks.

AAAI Conference 2024 Conference Paper

ANEDL: Adaptive Negative Evidential Deep Learning for Open-Set Semi-supervised Learning

  • Yang Yu
  • Danruo Deng
  • Furui Liu
  • Qi Dou
  • Yueming Jin
  • Guangyong Chen
  • Pheng Ann Heng

Semi-supervised learning (SSL) methods assume that labeled data, unlabeled data and test data are from the same distribution. Open-set semi-supervised learning (Open-set SSL) con- siders a more practical scenario, where unlabeled data and test data contain new categories (outliers) not observed in labeled data (inliers). Most previous works focused on out- lier detection via binary classifiers, which suffer from insufficient scalability and inability to distinguish different types of uncertainty. In this paper, we propose a novel framework, Adaptive Negative Evidential Deep Learning (ANEDL) to tackle these limitations. Concretely, we first introduce evidential deep learning (EDL) as an outlier detector to quantify different types of uncertainty, and design different uncertainty metrics for self-training and inference. Furthermore, we propose a novel adaptive negative optimization strategy, making EDL more tailored to the unlabeled dataset containing both inliers and outliers. As demonstrated empirically, our proposed method outperforms existing state-of-the-art methods across four datasets.

NeurIPS Conference 2024 Conference Paper

Bias and Volatility: A Statistical Framework for Evaluating Large Language Model's Stereotypes and the Associated Generation Inconsistency

  • Yiran Liu
  • Ke Yang
  • Zehan Qi
  • Xiao Liu
  • Yang Yu
  • ChengXiang Zhai

We present a novel statistical framework for analyzing stereotypes in large language models (LLMs) by systematically estimating the bias and variation in their generation. Current evaluation metrics in the alignment literature often overlook the randomness of stereotypes caused by the inconsistent generative behavior of LLMs. For example, this inconsistency can result in LLMs displaying contradictory stereotypes, including those related to gender or race, for identical professions across varied contexts. Neglecting such inconsistency could lead to misleading conclusions in alignment evaluations and hinder the accurate assessment of the risk of LLM applications perpetuating or amplifying social stereotypes and unfairness. This work proposes a Bias-Volatility Framework (BVF) that estimates the probability distribution function of LLM stereotypes. Specifically, since the stereotype distribution fully captures an LLM's generation variation, BVF enables the assessment of both the likelihood and extent to which its outputs are against vulnerable groups, thereby allowing for the quantification of the LLM's aggregated discrimination risk. Furthermore, we introduce a mathematical framework to decompose an LLM’s aggregated discrimination risk into two components: bias risk and volatility risk, originating from the mean and variation of LLM’s stereotype distribution, respectively. We apply BVF to assess 12 commonly adopted LLMs and compare their risk levels. Our findings reveal that: i) Bias risk is the primary cause of discrimination risk in LLMs; ii) Most LLMs exhibit significant pro-male stereotypes for nearly all careers; iii) Alignment with reinforcement learning from human feedback lowers discrimination by reducing bias, but increases volatility; iv) Discrimination risk in LLMs correlates with key sociol-economic factors like professional salaries. Finally, we emphasize that BVF can also be used to assess other dimensions of generation inconsistency's impact on LLM behavior beyond stereotypes, such as knowledge mastery.

ICML Conference 2024 Conference Paper

Causality Based Front-door Defense Against Backdoor Attack on Language Models

  • Yiran Liu
  • Xiaoang Xu
  • Zhiyi Hou
  • Yang Yu

We have developed a new framework based on the theory of causal inference to protect language models against backdoor attacks. Backdoor attackers can poison language models with different types of triggers, such as words, sentences, grammar, and style, enabling them to selectively modify the decision-making of the victim model. However, existing defense approaches are only effective when the backdoor attack form meets specific assumptions, making it difficult to counter diverse backdoor attacks. We propose a new defense framework F ront-door A djustment for B ackdoor E limination (FABE) based on causal reasoning that does not rely on assumptions about the form of triggers. This method effectively differentiates between spurious and legitimate associations by creating a ’front door’ that maps out the actual causal relationships. The term ’front door’ refers to a text that retains the semantic equivalence of the initial input, which is generated by an additional, fine-tuned language model, denoted as the defense model. Our defense experiments against various attack methods at the token, sentence, and syntactic levels reduced the attack success rate from 93. 63% to 15. 12%, improving the defense effect by 2. 91 times compared to the best baseline result of 66. 61%, achieving state-of-the-art results. Through ablation study analysis, we analyzed the effect of each module in FABE, demonstrating the importance of complying with the front-door criterion and front-door adjustment formula, which also explains why previous methods failed. Our code to reproduce the experiments is available at: https: //github. com/lyr17/Frontdoor-Adjustment-Backdoor-Elimination.

IJCAI Conference 2024 Conference Paper

Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal

  • Lihe Li
  • Ruotong Chen
  • Ziqian Zhang
  • Zhichao Wu
  • Yi-Chen Li
  • Cong Guan
  • Yang Yu
  • Lei Yuan

Multi-objective reinforcement learning (MORL) approaches address real-world problems with multiple objectives by learning policies maximizing returns weighted by different user preferences. Typical methods assume the objectives remain unchanged throughout the agent's lifetime. However, in some real-world situations, the agent may encounter dynamically changing learning objectives, i. e. , different vector-valued reward functions at different learning stages. This issue has not been considered in problem formulation or algorithm design. To address this issue, we formalize the setting as a continual MORL (CMORL) problem for the first time, accounting for the evolution of objectives throughout the learning process. Subsequently, we propose Continual Multi-Objective Reinforcement Learning via Reward Model Rehearsal (CORe3), incorporating a dynamic agent network for rapid adaptation to new objectives. Moreover, we develop a reward model rehearsal technique to recover the reward signals for previous objectives, thus alleviating catastrophic forgetting. Experiments on four CMORL benchmarks showcase that CORe3 effectively learns policies satisfying different preferences on all encountered objectives, and outperforms the best baseline by 171%, highlighting the capability of CORe3 to handle situations with evolving objectives.

AAMAS Conference 2024 Conference Paper

Cost-aware Offline Safe Meta Reinforcement Learning with Robust In-Distribution Online Task Adaptation

  • Cong Guan
  • Ruiqi Xue
  • Ziqian Zhang
  • Lihe Li
  • Yi-Chen Li
  • Lei Yuan
  • Yang Yu

Despite the gained prominence made by reinforcement learning (RL) in various domains, ensuring safety in real-world applications remains a significant challenge. Offline safe RL, which learns safe policies from pre-collected data, has emerged to address these concerns. However, existing approaches assume a single constraint mode and lack adaptability to diverse safety constraints. In realworld scenarios, we often find ourselves working with datasets gathered from various tasks, with the aim of developing a generalized policy capable of handling unknown tasks during testing. To deal with such offline safe meta RL problem, we introduce a novel framework called COSTA, which is designed to facilitate the learning of a safe generalized policy that can adapt and be transferred ∗Yang Yu is the corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). to unknown tasks during testing. COSTA addresses two key challenges in offline safe meta RL: First, it develops a cost-aware task inference module using contrastive learning to distinguish tasks based on safety constraints, mitigating the MDP ambiguity problem. Second, COSTA introduces a novel metric, Safe In-Distribution Score (SIDS), to assess the in-distribution degree of trajectories, out of the consideration of both reward maximization and cost constraint satisfaction. Trajectories collected with a safe exploration policy are filtered using SIDS for robust online task adaptation. Experimental results in a tailored benchmark suite within the Mujoco environments demonstrate that COSTA consistently balances safety and reward maximization, outperforming multiple baselines.

AAMAS Conference 2024 Conference Paper

Deep Anomaly Detection via Active Anomaly Search

  • Chao Chen
  • Dawei Wang
  • Feng Mao
  • Jiacheng Xu
  • Zongzhang Zhang
  • Yang Yu

Anomaly detection (AD) holds substantial practical value, and considering the limited labeled data, the semi-supervised anomaly detection technique has garnered increasing attention. We find that previous methods suffer from insufficient exploitation of labeled data and under-exploration of unlabeled data. To tackle the above problem, we aim to search for possible anomalies in unlabeled data and use the searched anomalies to enhance performance. We innovatively model this search process as a Markov decision process and utilize a reinforcement learning algorithm to solve it. Our method, Deep Anomaly Detection and Search (DADS), integrates the exploration of unlabeled data and the exploitation of labeled data into one framework. Experimentally, we compare DADS with several state-of-the-art methods in widely used benchmarks, and the results show that DADS can efficiently search anomalies from unlabeled data and learn from them, thus achieving good performance. Code: https: //github. com/LAMDA-RL/DADS

AAMAS Conference 2024 Conference Paper

Disentangling Policy from Offline Task Representation Learning via Adversarial Data Augmentation

  • Chengxing Jia
  • Fuxiang Zhang
  • Yi-Chen Li
  • Chen-Xiao Gao
  • Xu-Hui Liu
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Offline meta-reinforcement learning (OMRL) proficiently allows an agent to tackle novel tasks while solely relying on a static dataset. For precise and efficient task identification, existing OMRL research suggests learning separate task representations that be incorporated with policy input, thus forming a context-based meta-policy. A major approach to train task representations is to adopt contrastive learning using multi-task offline data. The dataset typically encompasses interactions from various policies (i. e. , the behavior policies), thus providing a plethora of contextual information regarding different tasks. Nonetheless, amassing data from a substantial number of policies is not only impractical but also often unattainable in realistic settings. Instead, we resort to a more constrained yet practical scenario, where multi-task data collection occurs with a limited number of policies. We observed that learned task representations from previous OMRL methods tend to correlate spuriously with the behavior policy instead of reflecting the essential characteristics of the task, resulting in unfavorable out-of-distribution generalization. *These authors contributed equally. †Corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). To alleviate this issue, we introduce a novel algorithm to disentangle the impact of behavior policy from task representation learning through a process called adversarial data augmentation. Specifically, the objective of adversarial data augmentation is not merely to generate data analogous to offline data distribution; instead, it aims to create adversarial examples designed to confound learned task representations and lead to incorrect task identification. Our experiments show that learning from such adversarial samples significantly enhances the robustness and effectiveness of the task identification process and realizes satisfactory out-of-distribution generalization. The results in MuJoCo locomotion tasks demonstrate that our approach surpasses other OMRL baselines across various meta-learning task sets.

ICRA Conference 2024 Conference Paper

Distributional Reinforcement Learning with Sample-set Bellman Update

  • Weijian Zhang
  • Jianshu Wang
  • Yang Yu

Distributional Reinforcement Learning (DRL) not only endeavors to optimize expected returns, but also strives to accurately characterize the full distribution of these returns, a key aspect in enhancing risk-aware decision-making. Previous DRL implementations often inappropriately treat statistical estimations as concrete samples, which undermines the integrity of learning. While several studies have addressed this issue, they frequently give rise to new complications, including computational burdens and diminished stochastic behavior. In our work, we present a novel DRL framework that leverages the Gaussian mixture model to adeptly depict the distribution of returns. This approach ensures precise, authentic sampling critical for robust learning, while also preserving computational tractability. Through extensive evaluation on a diverse array of 59 Atari games, our method not only surpasses the efficacy of prior DRL algorithms but also presents formidable competition to contemporary top-tier RL algorithms, signifying a substantial advancement in the field.

NeurIPS Conference 2024 Conference Paper

Efficient Recurrent Off-Policy RL Requires a Context-Encoder-Specific Learning Rate

  • Fan-Ming Luo
  • Zuolin Tu
  • Zefang Huang
  • Yang Yu

Real-world decision-making tasks are usually partially observable Markov decision processes (POMDPs), where the state is not fully observable. Recent progress has demonstrated that recurrent reinforcement learning (RL), which consists of a context encoder based on recurrent neural networks (RNNs) for unobservable state prediction and a multilayer perceptron (MLP) policy for decision making, can mitigate partial observability and serve as a robust baseline for POMDP tasks. However, prior recurrent RL algorithms have faced issues with training instability. In this paper, we find that this instability stems from the autoregressive nature of RNNs, which causes even small changes in RNN parameters to produce large output variations over long trajectories. Therefore, we propose R ecurrent Off-policy RL with Context- E ncoder- S p e cific L earning Rate (RESeL) to tackle this issue. Specifically, RESeL uses a lower learning rate for context encoder than other MLP layers to ensure the stability of the former while maintaining the training efficiency of the latter. We integrate this technique into existing off-policy RL methods, resulting in the RESeL algorithm. We evaluated RESeL in 18 POMDP tasks, including classic, meta-RL, and credit assignment scenarios, as well as five MDP locomotion tasks. The experiments demonstrate significant improvements in training stability with RESeL. Comparative results show that RESeL achieves notable performance improvements over previous recurrent RL baselines in POMDP tasks, and is competitive with or even surpasses state-of-the-art methods in MDP tasks. Further ablation studies highlight the necessity of applying a distinct learning rate for the context encoder. Code is available at https: //github. com/FanmingL/Recurrent-Offpolicy-RL.

AAAI Conference 2024 Conference Paper

Episodic Return Decomposition by Difference of Implicitly Assigned Sub-trajectory Reward

  • Haoxin Lin
  • Hongqiu Wu
  • Jiaji Zhang
  • Yihao Sun
  • Junyin Ye
  • Yang Yu

Real-world decision-making problems are usually accompanied by delayed rewards, which affects the sample efficiency of Reinforcement Learning, especially in the extremely delayed case where the only feedback is the episodic reward obtained at the end of an episode. Episodic return decomposition is a promising way to deal with the episodic-reward setting. Several corresponding algorithms have shown remarkable effectiveness of the learned step-wise proxy rewards from return decomposition. However, these existing methods lack either attribution or representation capacity, leading to inefficient decomposition in the case of long-term episodes. In this paper, we propose a novel episodic return decomposition method called Diaster (Difference of implicitly assigned sub-trajectory reward). Diaster decomposes any episodic reward into credits of two divided sub-trajectories at any cut point, and the step-wise proxy rewards come from differences in expectation. We theoretically and empirically verify that the decomposed proxy reward function can guide the policy to be nearly optimal. Experimental results show that our method outperforms previous state-of-the-art methods in terms of both sample efficiency and performance. The code is available at https://github.com/HxLyn3/Diaster.

AAAI Conference 2024 Conference Paper

Focus-Then-Decide: Segmentation-Assisted Reinforcement Learning

  • Chao Chen
  • Jiacheng Xu
  • Weijian Liao
  • Hao Ding
  • Zongzhang Zhang
  • Yang Yu
  • Rui Zhao

Visual Reinforcement Learning (RL) is a promising approach to achieve human-like intelligence. However, it currently faces challenges in learning efficiently within noisy environments. In contrast, humans can quickly identify task-relevant objects in distraction-filled surroundings by applying previously acquired common knowledge. Recently, foundational models in natural language processing and computer vision have achieved remarkable successes, and the common knowledge within these models can significantly benefit downstream task training. Inspired by these achievements, we aim to incorporate common knowledge from foundational models into visual RL. We propose a novel Focus-Then-Decide (FTD) framework, allowing the agent to make decisions based solely on task-relevant objects. To achieve this, we introduce an attention mechanism to select task-relevant objects from the object set returned by a foundational segmentation model, and only use the task-relevant objects for the subsequent training of the decision module. Additionally, we specifically employed two generic self-supervised objectives to facilitate the rapid learning of this attention mechanism. Experimental results on challenging tasks based on DeepMind Control Suite and Franka Emika Robotics demonstrate that our method can quickly and accurately pinpoint objects of interest in noisy environments. Consequently, it achieves a significant performance improvement over current state-of-the-art algorithms. Project Page: https://www.lamda.nju.edu.cn/chenc/FTD.html Code: https://github.com/LAMDA-RL/FTD

AAMAS Conference 2024 Conference Paper

Foresight Distribution Adjustment for Off-policy Reinforcement Learning

  • Ruifeng Chen
  • Xu-Hui Liu
  • Tian-Shuo Liu
  • Shengyi Jiang
  • Feng Xu
  • Yang Yu

Off-policy reinforcement learning algorithms maintain a replay buffer to utilize samples obtained from earlier policies. The sampling strategy that prioritizes certain data in a buffer to train the value function or the policy, has been shown to significantly influence the sample efficiency and the final performance of the algorithm. However, which distribution for the experience prioritization is the best choice has not been explored thoroughly. In this paper, we proved that the post-update policy distribution (i. e. the visitation distribution of the policy after the current iteration of update) is the best Q training distribution to benefit the policy improvement. Nevertheless, accessing this "future" distribution is not straightforward. In this work, we find that the current experiences can be modulated by the critic information to simulate the post-update policy distribution. Technically, we derive the gradient of the visitation distribution with respect to the policy parameter and obtain an explicit expression to approximate the post-update policy distribution. The derived method is named as Foresight Distribution Adjustment (FoDA), and seamlessly integrates with conventional off-policy actor-critic algorithms. Our experiments validate FoDA’s ability to closely approximate the post-update policy distribution, and demonstrate its utility in enhancing performance across continuous control task benchmarks.

AAAI Conference 2024 Conference Paper

Generalizable Task Representation Learning for Offline Meta-Reinforcement Learning with Data Limitations

  • Renzhe Zhou
  • Chen-Xiao Gao
  • Zongzhang Zhang
  • Yang Yu

Generalization and sample efficiency have been long-standing issues concerning reinforcement learning, and thus the field of Offline Meta-Reinforcement Learning (OMRL) has gained increasing attention due to its potential of solving a wide range of problems with static and limited offline data. Existing OMRL methods often assume sufficient training tasks and data coverage to apply contrastive learning to extract task representations. However, such assumptions are not applicable in several real-world applications and thus undermine the generalization ability of the representations. In this paper, we consider OMRL with two types of data limitations: limited training tasks and limited behavior diversity and propose a novel algorithm called GENTLE for learning generalizable task representations in the face of data limitations. GENTLE employs Task Auto-Encoder (TAE), which is an encoder-decoder architecture to extract the characteristics of the tasks. Unlike existing methods, TAE is optimized solely by reconstruction of the state transition and reward, which captures the generative structure of the task models and produces generalizable representations when training tasks are limited. To alleviate the effect of limited behavior diversity, we consistently construct pseudo-transitions to align the data distribution used to train TAE with the data distribution encountered during testing. Empirically, GENTLE significantly outperforms existing OMRL methods on both in-distribution tasks and out-of-distribution tasks across both the given-context protocol and the one-shot protocol.

JBHI Journal 2024 Journal Article

Graph-Driven Simultaneous and Proportional Estimation of Wrist Angle and Grasp Force via High-Density EMG

  • Dongxuan Li
  • Peiqi Kang
  • Yang Yu
  • Peter B. Shull

Myoelectric prostheses are generally unable to accurately control the position and force simultaneously, prohibiting natural and intuitive human-machine interaction. This issue is attributed to the limitations of myoelectric interfaces in effectively decoding multi-degree-of-freedom (multi-DoF) kinematic and kinetic information. We thus propose a novel multi-task, spatial-temporal model driven by graphical high-density electromyography (HD-EMG) for simultaneous and proportional control of wrist angle and grasp force. Twelve subjects were recruited to perform three multi-DoF movements, including wrist pronation/supination, wrist flexion/extension, and wrist abduction/adduction while varying grasp force. Experimental results demonstrated that the proposed model outperformed five baseline models, with the normalized root mean square error of 13. 2% and 9. 7% and the correlation coefficient of 89. 6% and 91. 9% for wrist angle and grasp force estimation, respectively. In addition, the proposed model still maintained comparable accuracy even with a significant reduction in the number of HD-EMG electrodes. To the best of our knowledge, this is the first study to achieve simultaneous and proportional wrist angle and grasp force control via HD-EMG and has the potential to empower prostheses users to perform a broader range of tasks with greater precision and control, ultimately enhancing their independence and quality of life.

NeurIPS Conference 2024 Conference Paper

KALM: Knowledgeable Agents by Offline Reinforcement Learning from Large Language Model Rollouts

  • Jing-Cheng Pang
  • Si-Hang Yang
  • kaiyuan Li
  • Xiong-Hui Chen
  • Nan Tang
  • Yang Yu

Reinforcement learning (RL) traditionally trains agents using interaction data, which limits their capabilities to the scope of the training data. To create more knowledgeable agents, leveraging knowledge from large language models (LLMs) has shown a promising way. Despite various attempts to combine LLMs with RL, there is commonly a semantic gap between action signals and LLM tokens, which hinders their integration. This paper introduces a novel approach, KALM (Knowledgeable Agents from Language Model Rollouts), to learn knowledgeable agents by bridging this gap. KALM extracts knowledge from LLMs in the form of imaginary rollouts, which agents can learn through offline RL. To overcome the limitation that LLMs are inherently text-based and may be incompatible with numerical environmental data, KALM fine-tunes the LLM to perform bidirectional translation between textual goals and rollouts. This process enables the LLM to understand the environment better, facilitating the generation of meaningful rollouts. Experiments on robotic manipulation tasks demonstrate that KALM allows agents to rephrase complex goals and tackle novel tasks requiring new optimal behaviors. KALM achieves a 46% success rate in completing 1400 various novel goals, significantly outperforming the 26% success rate of baseline methods. Project homepage: https: //kalmneurips2024. github. io.

NeurIPS Conference 2024 Conference Paper

Multi-Agent Domain Calibration with a Handful of Offline Data

  • Tao Jiang
  • Lei Yuan
  • Lihe Li
  • Cong Guan
  • Zongzhang Zhang
  • Yang Yu

The shift in dynamics results in significant performance degradation of policies trained in the source domain when deployed in a different target domain, posing a challenge for the practical application of reinforcement learning (RL) in real-world scenarios. Domain transfer methods aim to bridge this dynamics gap through techniques such as domain adaptation or domain calibration. While domain adaptation involves refining the policy through extensive interactions in the target domain, it may not be feasible for sensitive fields like healthcare and autonomous driving. On the other hand, offline domain calibration utilizes only static data from the target domain to adjust the physics parameters of the source domain (e. g. , a simulator) to align with the target dynamics, enabling the direct deployment of the trained policy without sacrificing performance, which emerges as the most promising for policy deployment. However, existing techniques primarily rely on evolution algorithms for calibration, resulting in low sample efficiency. To tackle this issue, we propose a novel framework Madoc (\textbf{M}ulti-\textbf{a}gent \textbf{do}main \textbf{c}alibration). Firstly, we formulate a bandit RL objective to match the target trajectory distribution by learning a couple of classifiers. We then address the challenge of a large domain parameter space by modeling domain calibration as a cooperative multi-agent reinforcement learning (MARL) problem. Specifically, we utilize a Variational Autoencoder (VAE) to automatically cluster physics parameters with similar effects on the dynamics, grouping them into distinct agents. These grouped agents train calibration policies coordinately to adjust multiple parameters using MARL. Our empirical evaluation on 21 offline locomotion tasks in D4RL and NeoRL benchmarks showcases the superior performance of our method compared to strong existing offline model-based RL, offline domain calibration, and hybrid offline-and-online RL baselines.

TMLR Journal 2024 Journal Article

One by One, Continual Coordinating with Humans via Hyper-Teammate Identification

  • Cong Guan
  • Feng Chen
  • Ke Xue
  • Chunpeng Fan
  • Lichao Zhang
  • Ziqian Zhang
  • Pengyao Zhao
  • Zongzhang Zhang

One of the primary objectives in modern artificial intelligence researches is to empower agents to effectively coordinate with diverse teammates, particularly human teammates. Previous studies focused on training agents either with a fixed population of pre-generated teammates or through the co-evolution of distinct populations of agents and teammates. However, it is challenging to enumerate all possible teammates in advance, and it is costly, or even impractical to maintain such a sufficiently diverse population and repeatedly interact with previously encountered teammates. Additional design considerations, such as prioritized sampling, are also required to ensure efficient training. To address these challenges and obtain an efficient human-AI coordination paradigm, we propose a novel approach called \textbf{Concord}. Considering that human participants tend to occur in a sequential manner, we model the training process with different teammates as a continual learning framework, akin to how humans learn and adapt in the real world. We propose a mechanism based on hyper-teammate identification to prevent catastrophic forgetting while promoting forward knowledge transfer. Concretely, we introduce a teammate recognition module that captures the identification of corresponding teammates. Leveraging the identification, a well-coordinated AI policy can be generated via the hyper-network. The entire framework is trained in a decomposed policy gradient manner, allowing for effective credit assignment among agents. This approach enables us to train agents with each generated teammate or humans one by one, ensuring that agents can coordinate effectively with concurrent teammates without forgetting previous knowledge. Our approach outperforms multiple baselines in various multi-agent benchmarks, either with generated human proxies or real human participants.

NeurIPS Conference 2024 Conference Paper

Policy Learning from Tutorial Books via Understanding, Rehearsing and Introspecting

  • Xiong-Hui Chen
  • Ziyan Wang
  • Yali Du
  • Shengyi Jiang
  • Meng Fang
  • Yang Yu
  • Jun Wang

When humans need to learn a new skill, we can acquire knowledge through written books, including textbooks, tutorials, etc. However, current research for decision-making, like reinforcement learning (RL), has primarily required numerous real interactions with the target environment to learn a skill, while failing to utilize the existing knowledge already summarized in the text. The success of Large Language Models (LLMs) sheds light on utilizing such knowledge behind the books. In this paper, we discuss a new policy learning problem called Policy Learning from tutorial Books (PLfB) upon the shoulders of LLMs’ systems, which aims to leverage rich resources such as tutorial books to derive a policy network. Inspired by how humans learn from books, we solve the problem via a three-stage framework: Understanding, Rehearsing, and Introspecting (URI). In particular, it first rehearses decision-making trajectories based on the derived knowledge after understanding the books, then introspects in the imaginary dataset to distill a policy network. We build two benchmarks for PLfB~based on Tic-Tac-Toe and Football games. In experiment, URI's policy achieves at least 44% net win rate against GPT-based agents without any real data; In Football game, which is a complex scenario, URI's policy beat the built-in AIs with a 37% while using GPT-based agent can only achieve a 6\% winning rate. The project page: https: //plfb-football. github. io.

IJCAI Conference 2024 Conference Paper

Pre-training General User Representation with Multi-type APP Behaviors

  • Yuren Zhang
  • Min Hou
  • Kai Zhang
  • Yuqing Yuan
  • Chao Song
  • Zhihao Ye
  • Enhong Chen
  • Yang Yu

In numerous user-centric services on mobile applications (apps), accurately mining user interests and generating effective user representations are paramount. Traditional approaches, which often involve training task-specific user representations, are becoming increasingly impractical due to their high computational costs and limited adaptability. This paper introduces a novel solution to this challenge: the Multi-type App-usage Fusion Network (MAFN). MAFN innovatively pre-trains universal user representations, leveraging multi-type app behaviors to overcome key limitations in existing methods. We address two primary challenges: 1) the varying frequency of user behaviors (ranging from low-frequency actions like (un)installations to high-frequency yet insightful app launches); and 2) the integration of multi-type behaviors to form a cohesive representation. Our approach involves the creation of novel pre-training tasks that harness self-supervised signals from diverse app behaviors, capturing both long-term and short-term user interests. MAFN's unique fusion approach effectively amalgamates these interests into a unified vector space, facilitating the development of a versatile, general-purpose user representation. With a practical workflow, extensive experiments with three typical downstream tasks on real-world datasets verify the effectiveness of our approach.

NeurIPS Conference 2024 Conference Paper

Provably and Practically Efficient Adversarial Imitation Learning with General Function Approximation

  • Tian Xu
  • Zhilong Zhang
  • Ruishuo Chen
  • Yihao Sun
  • Yang Yu

As a prominent category of imitation learning methods, adversarial imitation learning (AIL) has garnered significant practical success powered by neural network approximation. However, existing theoretical studies on AIL are primarily limited to simplified scenarios such as tabular and linear function approximation and involve complex algorithmic designs that hinder practical implementation, highlighting a gap between theory and practice. In this paper, we explore the theoretical underpinnings of online AIL with general function approximation. We introduce a new method called optimization-based AIL (OPT-AIL), which centers on performing online optimization for reward functions and optimism-regularized Bellman error minimization for Q-value functions. Theoretically, we prove that OPT-AIL achieves polynomial expert sample complexity and interaction complexity for learning near-expert policies. To our best knowledge, OPT-AIL is the first provably efficient AIL method with general function approximation. Practically, OPT-AIL only requires the approximate optimization of two objectives, thereby facilitating practical implementation. Empirical studies demonstrate that OPT-AIL outperforms previous state-of-the-art deep AIL methods in several challenging tasks.

AAAI Conference 2024 Conference Paper

Rethinking the Development of Large Language Models from the Causal Perspective: A Legal Text Prediction Case Study

  • Haotian Chen
  • Lingwei Zhang
  • Yiran Liu
  • Yang Yu

While large language models (LLMs) exhibit impressive performance on a wide range of NLP tasks, most of them fail to learn the causality from correlation, which disables them from learning rationales for predicting. Rethinking the whole developing process of LLMs is of great urgency as they are adopted in various critical tasks that need rationales, including legal text prediction (e.g., legal judgment prediction). In this paper, we first explain the underlying theoretical mechanism of their failure and argue that both the data imbalance and the omission of causality in model design and selection render the current training-testing paradigm failed to select the unique causality-based model from correlation-based models. Second, we take the legal text prediction task as the testbed and reconstruct the developing process of LLMs by simultaneously infusing causality into model architectures and organizing causality-based adversarial attacks for evaluation. Specifically, we base our reconstruction on our theoretical analysis and propose a causality-aware self-attention mechanism (CASAM), which prevents LLMs from entangling causal and non-causal information by restricting the interaction between causal and non-causal words. Meanwhile, we propose eight kinds of legal-specific attacks to form causality-based model selection. Our extensive experimental results demonstrate that our proposed CASAM achieves state-of-the-art (SOTA) performances and the strongest robustness on three commonly used legal text prediction benchmarks. We make our code publicly available at https://github.com/Carrot-Red/Rethink-LLM-development.

NeurIPS Conference 2024 Conference Paper

Towards an Information Theoretic Framework of Context-Based Offline Meta-Reinforcement Learning

  • Lanqing Li
  • Hai Zhang
  • Xinyu Zhang
  • Shatong Zhu
  • Yang Yu
  • Junqiao Zhao
  • Pheng-Ann Heng

As a marriage between offline RL and meta-RL, the advent of offline meta-reinforcement learning (OMRL) has shown great promise in enabling RL agents to multi-task and quickly adapt while acquiring knowledge safely. Among which, context-based OMRL (COMRL) as a popular paradigm, aims to learn a universal policy conditioned on effective task representations. In this work, by examining several key milestones in the field of COMRL, we propose to integrate these seemingly independent methodologies into a unified framework. Most importantly, we show that the pre-existing COMRL algorithms are essentially optimizing the same mutual information objective between the task variable $M$ and its latent representation $Z$ by implementing various approximate bounds. Such theoretical insight offers ample design freedom for novel algorithms. As demonstrations, we propose a supervised and a self-supervised implementation of $I(Z; M)$, and empirically show that the corresponding optimization algorithms exhibit remarkable generalization across a broad spectrum of RL benchmarks, context shift scenarios, data qualities and deep learning architectures. This work lays the information theoretic foundation for COMRL methods, leading to a better understanding of task representation learning in the context of reinforcement learning. Given itsgenerality, we envision our framework as a promising offline pre-training paradigm of foundation models for decision making.

ICRA Conference 2023 Conference Paper

A General Locomotion Approach for a Novel Multi-legged Spherical Robot

  • Dun Yang
  • Yun Fei Liu
  • Yang Yu

As a kind of ground mobile robot, deformable robots have many advantages, such as solid terrain adaptability, lightweight, and portability. Among these robots, the radial skeleton robot has better stability and controllability. However, because the friction of foot and ground is hard to be predicted, the accuracy of its gait generation algorithms that have been studied is very low. Furthermore, there is currently no closed-loop control scheme for this kind of robot. We designed a 12-legged radial skeleton robot with high extension ratio legs, proposed a high-precision gait generation algorithm for any multi-legged radial skeleton robot, and first proposed a closed-loop control scheme for this kind of robot. A dynamic model considering contact friction is established. And the robot has the advantages of omnidirectional motion, high-precision trajectory tracking, and motion robustness. By conducting prototype experiments, it is verified that our method achieves the highest accuracy when tracking trajectory and holds robustness in the unknown environment.

JBHI Journal 2023 Journal Article

A Novel and Efficient Surface Electromyography Decomposition Algorithm Using Local Spatial Information

  • Yang Xu
  • Yang Yu
  • Miaojuan Xia
  • Xinjun Sheng
  • Xiangyang Zhu

Motor unit spike trains (MUSTs) decomposed from surface electromyography (sEMG) have been an emerging solution for neural interfacing, especially for the control of upper limb prosthetics. Accurate and efficient decomposition techniques are essential and desirable. However, most decomposition methods are designed for motor units (MUs) with global maximum of single or large muscle, while in general forearm muscles are usually small and slender with low global energy. Thus, we propose a novel approach using local spatial information towards more accurate and efficient sEMG decomposition of forearm muscles. A fast spatial spike detection method is proposed to replace the time-consuming iteration process of blind source separation (BSS) methods. Here, spatial distribution characteristics of motor unit action potential are leveraged to pre-classify the candidate MUs, and further to create initial MU templates, aiming to avoid repeating convergence to high-energy MUs. The results of both simulated and experimental sEMG signals show that low-energy MUs from small muscles are more easily found compared with conventional BSS algorithm. Specifically, the proposed method can identify more 40% reliable MUs while only 30% consuming time are needed. The outcomes provide a novel solution for more efficient sEMG decomposition, potentially paving the way of MUST-based non-invasive neural interface.

IJCAI Conference 2023 Conference Paper

A Unified View of Deep Learning for Reaction and Retrosynthesis Prediction: Current Status and Future Challenges

  • Ziqiao Meng
  • Peilin Zhao
  • Yang Yu
  • Irwin King

Reaction and retrosynthesis prediction are two fundamental tasks in computational chemistry. In recent years, these two tasks have attracted great attentions from both machine learning and drug discovery communities. Various deep learning approaches have been proposed to tackle these two problems and achieved initial success. In this survey, we conduct a comprehensive investigation on advanced deep learning-based reaction and retrosynthesis prediction models. We first summarize the design mechanism, strengths and weaknesses of the state-of-the-art approaches. Then we further discuss limitations of current solutions and open challenges in the problem itself. Last but not the least, we present some promising directions to facilitate future research. To our best knowledge, this paper is the first comprehensive and systematic survey on unified understanding of reaction and retrosynthesis prediction.

NeurIPS Conference 2023 Conference Paper

AdaptSSR: Pre-training User Model with Augmentation-Adaptive Self-Supervised Ranking

  • Yang Yu
  • Qi Liu
  • Kai Zhang
  • Yuren Zhang
  • Chao Song
  • Min Hou
  • Yuqing Yuan
  • Zhihao Ye

User modeling, which aims to capture users' characteristics or interests, heavily relies on task-specific labeled data and suffers from the data sparsity issue. Several recent studies tackled this problem by pre-training the user model on massive user behavior sequences with a contrastive learning task. Generally, these methods assume different views of the same behavior sequence constructed via data augmentation are semantically consistent, i. e. , reflecting similar characteristics or interests of the user, and thus maximizing their agreement in the feature space. However, due to the diverse interests and heavy noise in user behaviors, existing augmentation methods tend to lose certain characteristics of the user or introduce noisy behaviors. Thus, forcing the user model to directly maximize the similarity between the augmented views may result in a negative transfer. To this end, we propose to replace the contrastive learning task with a new pretext task: Augmentation-Adaptive SelfSupervised Ranking (AdaptSSR), which alleviates the requirement of semantic consistency between the augmented views while pre-training a discriminative user model. Specifically, we adopt a multiple pairwise ranking loss which trains the user model to capture the similarity orders between the implicitly augmented view, the explicitly augmented view, and views from other users. We further employ an in-batch hard negative sampling strategy to facilitate model training. Moreover, considering the distinct impacts of data augmentation on different behavior sequences, we design an augmentation-adaptive fusion mechanism to automatically adjust the similarity order constraint applied to each sample based on the estimated similarity between the augmented views. Extensive experiments on both public and industrial datasets with six downstream tasks verify the effectiveness of AdaptSSR.

NeurIPS Conference 2023 Conference Paper

Adversarial Counterfactual Environment Model Learning

  • Xiong-Hui Chen
  • Yang Yu
  • Zhengmao Zhu
  • Zhihua Yu
  • Chen Zhenjun
  • Chenghe Wang
  • Yinan Wu
  • Rong-Jun Qin

An accurate environment dynamics model is crucial for various downstream tasks in sequential decision-making, such as counterfactual prediction, off-policy evaluation, and offline reinforcement learning. Currently, these models were learned through empirical risk minimization (ERM) by step-wise fitting of historical transition data. This way was previously believed unreliable over long-horizon rollouts because of the compounding errors, which can lead to uncontrollable inaccuracies in predictions. In this paper, we find that the challenge extends beyond just long-term prediction errors: we reveal that even when planning with one step, learned dynamics models can also perform poorly due to the selection bias of behavior policies during data collection. This issue will significantly mislead the policy optimization process even in identifying single-step optimal actions, further leading to a greater risk in sequential decision-making scenarios. To tackle this problem, we introduce a novel model-learning objective called adversarial weighted empirical risk minimization (AWRM). AWRM incorporates an adversarial policy that exploits the model to generate a data distribution that weakens the model's prediction accuracy, and subsequently, the model is learned under this adversarial data distribution. We implement a practical algorithm, GALILEO, for AWRM and evaluate it on two synthetic tasks, three continuous-control tasks, and \textit{a real-world application}. The experiments demonstrate that GALILEO can accurately predict counterfactual actions and improve various downstream tasks, including offline policy evaluation and improvement, as well as online decision-making.

AAAI Conference 2023 Short Paper

Anti-drifting Feature Selection via Deep Reinforcement Learning (Student Abstract)

  • Aoran Wang
  • Hongyang Yang
  • Feng Mao
  • Zongzhang Zhang
  • Yang Yu
  • Xiaoyang Liu

Feature selection (FS) is a crucial procedure in machine learning pipelines for its significant benefits in removing data redundancy and mitigating model overfitting. Since concept drift is a widespread phenomenon in streaming data and could severely affect model performance, effective FS on concept drifting data streams is imminent. However, existing state-of-the-art FS algorithms fail to adjust their selection strategy adaptively when the effective feature subset changes, making them unsuitable for drifting streams. In this paper, we propose a dynamic FS method that selects effective features on concept drifting data streams via deep reinforcement learning. Specifically, we present two novel designs: (i) a skip-mode reinforcement learning environment that shrinks action space size for high-dimensional FS tasks; (ii) a curiosity mechanism that generates intrinsic rewards to address the long-horizon exploration problem. The experiment results show that our proposed method outperforms other FS methods and can dynamically adapt to concept drifts.

NeurIPS Conference 2023 Conference Paper

CMMA: Benchmarking Multi-Affection Detection in Chinese Multi-Modal Conversations

  • Yazhou Zhang
  • Yang Yu
  • Qing Guo
  • Benyou Wang
  • Dongming Zhao
  • Sagar Uprety
  • Dawei Song
  • Qiuchi Li

Human communication has a multi-modal and multi-affection nature. The inter-relatedness of different emotions and sentiments poses a challenge to jointly detect multiple human affections with multi-modal clues. Recent advances in this field employed multi-task learning paradigms to render the inter-relatedness across tasks, but the scarcity of publicly available resources sets a limit to the potential of works. To fill this gap, we build the first Chinese Multi-modal Multi-Affection conversation (CMMA) dataset, which contains 3, 000 multi-party conversations and 21, 795 multi-modal utterances collected from various styles of TV-series. CMMA contains a wide variety of affection labels, including sentiment, emotion, sarcasm and humor, as well as the novel inter-correlations values between certain pairs of tasks. Moreover, it provides the topic and speaker information in conversations, which promotes better modeling of conversational context. On the dataset, we empirically analyze the influence of different data modalities and conversational contexts on different affection analysis tasks, and exhibit the practical benefit of inter-task correlations. The full dataset will be publicly available for research\footnote{https: //github. com/annoymity2022/Chinese-Dataset}

JBHI Journal 2023 Journal Article

Cumulative Spike Train Estimation for Muscle Excitation Assessment From Surface EMG Using Spatial Spike Detection

  • Yang Xu
  • Yang Yu
  • Zeming Zhao
  • Chen Chen
  • Xinjun Sheng

Estimating cumulative spike train (CST) of motor units (MUs) from surface electromyography (sEMG) is essential for the effective control of neural interfaces. However, the limited accuracy of existing estimation methods greatly hinders the further development of neural interface. This paper proposes a simple but effective approach for identifying CST based on spatial spike detection from high-density sEMG. Specifically, we use a spatial sliding window to detect spikes according to the spatial propagation characteristics of the motor unit action potential, focusing on the spikes of activated MUs in a local area rather than those of a specific MU. We validated the effectiveness of our proposed method through an experiment involving wrist flexion/extension and pronation/supination, comparing it with a recognized CST estimation method and an MU decomposition based method. The results demonstrated that the proposed method obtained higher accuracy on multi-DoF wrist torque estimation leveraging the estimated CST compared to the other three methods. On average, the correlation coefficient (R) and the normalized root mean square error (nRMSE) between the estimation results and recorded force were 0. 96 $\pm$ 0. 03 and 10. 1% $\pm$ 3. 7%, respectively. Moreover, there was an extremely high interpretive extent between the CSTs of proposed method and the MU decomposition method. The outcomes reveal the superiority of the proposed method in identifying CSTs and can provide promising driven signals for neural interface.

AAAI Conference 2023 Short Paper

Deep Anomaly Detection and Search via Reinforcement Learning (Student Abstract)

  • Chao Chen
  • Dawei Wang
  • Feng Mao
  • Zongzhang Zhang
  • Yang Yu

Semi-supervised anomaly detection is a data mining task which aims at learning features from partially-labeled datasets. We propose Deep Anomaly Detection and Search (DADS) with reinforcement learning. During the training process, the agent searches for possible anomalies in unlabeled dataset to enhance performance. Empirically, we compare DADS with several methods in the settings of leveraging known anomalies to detect both other known and unknown anomalies. Results show that DADS achieves good performance.

IJCAI Conference 2023 Conference Paper

Doubly Stochastic Graph-based Non-autoregressive Reaction Prediction

  • Ziqiao Meng
  • Peilin Zhao
  • Yang Yu
  • Irwin King

Organic reaction prediction is a critical task in drug discovery. Recently, researchers have achieved non-autoregressive reaction prediction by modeling the redistribution of electrons, resulting in state-of-the-art top-1 accuracy, and enabling parallel sampling. However, the current non-autoregressive decoder does not satisfy two essential rules of electron redistribution modeling simultaneously: the electron-counting rule and the symmetry rule. This violation of the physical constraints of chemical reactions impairs model performance. In this work, we propose a new framework called ReactionSink that combines two doubly stochastic self-attention mappings to obtain electron redistribution predictions that follow both constraints. We further extend our solution to a general multi-head attention mechanism with augmented constraints. To achieve this, we apply Sinkhorn's algorithm to iteratively update self-attention mappings, which imposes doubly conservative constraints as additional informative priors on electron redistribution modeling. We theoretically demonstrate that our ReactionSink can simultaneously satisfy both rules, which the current decoder mechanism cannot do. Empirical results show that our approach consistently improves the predictive performance of non-autoregressive models and does not bring an unbearable additional computational cost.

AAMAS Conference 2023 Conference Paper

How To Guide Your Learner: Imitation Learning with Active Adaptive Expert Involvement

  • Xu-Hui Liu
  • Feng Xu
  • Xinyu Zhang
  • Tianyuan Liu
  • Shengyi Jiang
  • Ruifeng Chen
  • Zongzhang Zhang
  • Yang Yu

Imitation learning aims to mimic the behavior of experts without explicit reward signals. Passive imitation learning methods which use static expert datasets typically suffer from compounding error, low sample efficiency, and high hyper-parameter sensitivity. In contrast, active imitation learning methods solicit expert interventions to address the limitations. However, recent active imitation learning methods are designed based on human intuitions or empirical experience without theoretical guarantee. In this paper, we propose a novel active imitation learning framework based on a teacher-student interaction model, in which the teacher’s goal is to identify the best teaching behavior and actively affect the student’s learning process. By solving the optimization objective of this framework, we propose a practical implementation, naming it AdapMen. Theoretical analysis shows that AdapMen can improve the error bound and avoid compounding error under mild conditions. Experiments on the MetaDrive benchmark and Atari 2600 games validate our theoretical analysis and show that our method achieves near-expert performance with much less expert involvement and total sampling steps than previous methods. The code is available at https: //github. com/liuxhym/AdapMen.

NeurIPS Conference 2023 Conference Paper

Imitation Learning from Imperfection: Theoretical Justifications and Algorithms

  • Ziniu Li
  • Tian Xu
  • Zeyu Qin
  • Yang Yu
  • Zhi-Quan Luo

Imitation learning (IL) algorithms excel in acquiring high-quality policies from expert data for sequential decision-making tasks. But, their effectiveness is hampered when faced with limited expert data. To tackle this challenge, a novel framework called (offline) IL with supplementary data has been proposed, which enhances learning by incorporating an additional yet imperfect dataset obtained inexpensively from sub-optimal policies. Nonetheless, learning becomes challenging due to the potential inclusion of out-of-expert-distribution samples. In this work, we propose a mathematical formalization of this framework, uncovering its limitations. Our theoretical analysis reveals that a naive approach—applying the behavioral cloning (BC) algorithm concept to the combined set of expert and supplementary data—may fall short of vanilla BC, which solely relies on expert data. This deficiency arises due to the distribution shift between the two data sources. To address this issue, we propose a new importance-sampling-based technique for selecting data within the expert distribution. We prove that the proposed method eliminates the gap of the naive approach, highlighting its efficacy when handling imperfect data. Empirical studies demonstrate that our method outperforms previous state-of-the-art methods in tasks including robotic locomotion control, Atari video games, and image classification. Overall, our work underscores the potential of improving IL by leveraging diverse data sources through effective data selection.

AAAI Conference 2023 Short Paper

Learning Generalizable Batch Active Learning Strategies via Deep Q-networks (Student Abstract)

  • Yi-Chen Li
  • Wen-Jie Shen
  • Boyu Zhang
  • Feng Mao
  • Zongzhang Zhang
  • Yang Yu

To handle a large amount of unlabeled data, batch active learning (BAL) queries humans for the labels of a batch of the most valuable data points at every round. Most current BAL strategies are based on human-designed heuristics, such as uncertainty sampling or mutual information maximization. However, there exists a disagreement between these heuristics and the ultimate goal of BAL, i.e., optimizing the model's final performance within the query budgets. This disagreement leads to a limited generality of these heuristics. To this end, we formulate BAL as an MDP and propose a data-driven approach based on deep reinforcement learning. Our method learns the BAL strategy by maximizing the model's final performance. Experiments on the UCI benchmark show that our method can achieve competitive performance compared to existing heuristics-based approaches.

NeurIPS Conference 2023 Conference Paper

Learning World Models with Identifiable Factorization

  • Yuren Liu
  • Biwei Huang
  • Zhengmao Zhu
  • Honglong Tian
  • Mingming Gong
  • Yang Yu
  • Kun Zhang

Extracting a stable and compact representation of the environment is crucial for efficient reinforcement learning in high-dimensional, noisy, and non-stationary environments. Different categories of information coexist in such environments -- how to effectively extract and disentangle the information remains a challenging problem. In this paper, we propose IFactor, a general framework to model four distinct categories of latent state variables that capture various aspects of information within the RL system, based on their interactions with actions and rewards. Our analysis establishes block-wise identifiability of these latent variables, which not only provides a stable and compact representation but also discloses that all reward-relevant factors are significant for policy learning. We further present a practical approach to learning the world model with identifiable blocks, ensuring the removal of redundancies but retaining minimal and sufficient information for policy optimization. Experiments in synthetic worlds demonstrate that our method accurately identifies the ground-truth latent variables, substantiating our theoretical findings. Moreover, experiments in variants of the DeepMind Control Suite and RoboDesk showcase the superior performance of our approach over baselines.

AAAI Conference 2023 Short Paper

Model-Based Offline Weighted Policy Optimization (Student Abstract)

  • Renzhe Zhou
  • Zongzhang Zhang
  • Yang Yu

A promising direction for applying reinforcement learning to the real world is learning from offline datasets. Offline reinforcement learning aims to learn policies from pre-collected datasets without online interaction with the environment. Due to the lack of further interaction, offline reinforcement learning faces severe extrapolation error, leading to policy learning failure. In this paper, we investigate the weighted Bellman update in model-based offline reinforcement learning. We explore uncertainty estimation in ensemble dynamics models, then use a variational autoencoder to fit the behavioral prior, and finally propose an algorithm called Model-Based Offline Weighted Policy Optimization (MOWPO), which uses a combination of model confidence and behavioral prior as weights to reduce the impact of inaccurate samples on policy optimization. Experiment results show that MOWPO achieves better performance than state-of-the-art algorithms, and both the model confidence weight and the behavioral prior weight can play an active role in offline policy optimization.

NeurIPS Conference 2023 Conference Paper

Natural Language Instruction-following with Task-related Language Development and Translation

  • Jing-Cheng Pang
  • Xin-Yu Yang
  • Si-Hang Yang
  • Xiong-Hui Chen
  • Yang Yu

Natural language-conditioned reinforcement learning (RL) enables agents to follow human instructions. Previous approaches generally implemented language-conditioned RL by providing the policy with human instructions in natural language (NL) and training the policy to follow instructions. In this is outside-in approach, the policy must comprehend the NL and manage the task simultaneously. However, the unbounded NL examples often bring much extra complexity for solving concrete RL tasks, which can distract policy learning from completing the task. To ease the learning burden of the policy, we investigate an inside-out scheme for natural language-conditioned RL by developing a task language (TL) that is task-related and easily understood by the policy, thus reducing the policy learning burden. Besides, we employ a translator to translate natural language into the TL, which is used in RL to achieve efficient policy training. We implement this scheme as TALAR (TAsk Language with predicAte Representation) that learns multiple predicates to model object relationships as the TL. Experiments indicate that TALAR not only better comprehends NL instructions but also leads to a better instruction-following policy that significantly improves the success rate over baselines and adapts to unseen expressions of NL instruction. Besides, the TL is also an effective sub-task abstraction compatible with hierarchical RL.

AAAI Conference 2023 Conference Paper

Policy-Independent Behavioral Metric-Based Representation for Deep Reinforcement Learning

  • Weijian Liao
  • Zongzhang Zhang
  • Yang Yu

Behavioral metrics can calculate the distance between states or state-action pairs from the rewards and transitions difference. By virtue of their capability to filter out task-irrelevant information in theory, using them to shape a state embedding space becomes a new trend of representation learning for deep reinforcement learning (RL), especially when there are explicit distracting factors in observation backgrounds. However, due to the tight coupling between the metric and the RL policy, such metric-based methods may result in less informative embedding spaces which can weaken their aid to the baseline RL algorithm and even consume more samples to learn. We resolve this by proposing a new behavioral metric. It decouples the learning of RL policy and metric owing to its independence on RL policy. We theoretically justify its scalability to continuous state and action spaces and design a practical way to incorporate it into an RL procedure as a representation learning target. We evaluate our approach on DeepMind control tasks with default and distracting backgrounds. By statistically reliable evaluation protocols, our experiments demonstrate our approach is superior to previous metric-based methods in terms of sample efficiency and asymptotic performance in both backgrounds.

AAMAS Conference 2023 Conference Paper

Prioritized Tasks Mining for Multi-Task Cooperative Multi-Agent Reinforcement Learning

  • Yang Yu
  • Qiyue Yin
  • Junge Zhang
  • Kaiqi Huang

Multi-task learning improves data efficiency in cooperative multiagent reinforcement learning, since agents can learn multiple related tasks simultaneously and the cooperation knowledge in a task can be utilized by others. However, existing methods mainly learn multiple cooperation tasks uniformly, regardless of their complexity and significance. In this paper, we propose a new framework called Prioritized Tasks Mining (PTM) for multi-task cooperation problems, which helps agents to identify and mine higher priority cooperation tasks, so as to learn more effective coordinated strategies for multiple cooperation tasks. Specially, agents will use the hindsight during training to identify the priority of different tasks, and make an exploration and exploitation in higher priority cooperative tasks to mine more sophisticated coordinated strategies. We evaluate PTM in challenging multi-task StarCraft micromanagement games with different scales, and results demonstrate that our method consistently outperforms all strong baselines.

AAAI Conference 2023 Conference Paper

Robust Multi-Agent Coordination via Evolutionary Generation of Auxiliary Adversarial Attackers

  • Lei Yuan
  • Ziqian Zhang
  • Ke Xue
  • Hao Yin
  • Feng Chen
  • Cong Guan
  • Lihe Li
  • Chao Qian

Cooperative Multi-agent Reinforcement Learning (CMARL) has shown to be promising for many real-world applications. Previous works mainly focus on improving coordination ability via solving MARL-specific challenges (e.g., non-stationarity, credit assignment, scalability), but ignore the policy perturbation issue when testing in a different environment. This issue hasn't been considered in problem formulation or efficient algorithm design. To address this issue, we firstly model the problem as a Limited Policy Adversary Dec-POMDP (LPA-Dec-POMDP), where some coordinators from a team might accidentally and unpredictably encounter a limited number of malicious action attacks, but the regular coordinators still strive for the intended goal. Then, we propose Robust Multi-Agent Coordination via Evolutionary Generation of Auxiliary Adversarial Attackers (ROMANCE), which enables the trained policy to encounter diversified and strong auxiliary adversarial attacks during training, thus achieving high robustness under various policy perturbations. Concretely, to avoid the ego-system overfitting to a specific attacker, we maintain a set of attackers, which is optimized to guarantee the attackers high attacking quality and behavior diversity. The goal of quality is to minimize the ego-system coordination effect, and a novel diversity regularizer based on sparse action is applied to diversify the behaviors among attackers. The ego-system is then paired with a population of attackers selected from the maintained attacker set, and alternately trained against the constantly evolving attackers. Extensive experiments on multiple scenarios from SMAC indicate our ROMANCE provides comparable or better robustness and generalization ability than other baselines.

AAMAS Conference 2023 Conference Paper

Self-Motivated Multi-Agent Exploration

  • Shaowei Zhang
  • Jiahan Cao
  • Lei Yuan
  • Yang Yu
  • De-Chuan Zhan

In cooperative multi-agent reinforcement learning (CMARL), it is critical for agents to achieve a balance between self-exploration and team collaboration. However, agents can hardly accomplish the team task without coordination and they would be trapped in a local optimum where easy cooperation is accessed without enough individual exploration. Recent works mainly concentrate on agents’ coordinated exploration, which brings about the exponentially grown exploration of the state space. To address this issue, we propose Self-Motivated Multi-Agent Exploration (SMMAE), which aims to achieve success in team tasks by adaptively finding a trade-off between self-exploration and team cooperation. In SM- MAE, we train an independent exploration policy for each agent to maximize their own visited state space. Each agent learns an adjustable exploration probability based on the stability of the joint team policy. The experiments on highly cooperative tasks in Star- Craft II micromanagement benchmark (SMAC) demonstrate that SMMAE can explore task-related states more efficiently, accomplish coordinated behaviours and boost the learning performance.

ICML Conference 2023 Conference Paper

Uncertainty Estimation by Fisher Information-based Evidential Deep Learning

  • Danruo Deng
  • Guangyong Chen
  • Yang Yu
  • Furui Liu
  • Pheng-Ann Heng

Uncertainty estimation is a key factor that makes deep learning reliable in practical applications. Recently proposed evidential neural networks explicitly account for different uncertainties by treating the network’s outputs as evidence to parameterize the Dirichlet distribution, and achieve impressive performance in uncertainty estimation. However, for high data uncertainty samples but annotated with the one-hot label, the evidence-learning process for those mislabeled classes is over-penalized and remains hindered. To address this problem, we propose a novel method, Fisher Information-based Evidential Deep Learning ($\mathcal{I}$-EDL). In particular, we introduce Fisher Information Matrix (FIM) to measure the informativeness of evidence carried by each sample, according to which we can dynamically reweight the objective loss terms to make the network more focus on the representation learning of uncertain classes. The generalization ability of our network is further improved by optimizing the PAC-Bayesian bound. As demonstrated empirically, our proposed method consistently outperforms traditional EDL-related algorithms in multiple uncertainty estimation tasks, especially in the more challenging few-shot classification settings.

AAAI Conference 2023 Conference Paper

Untargeted Attack against Federated Recommendation Systems via Poisonous Item Embeddings and the Defense

  • Yang Yu
  • Qi Liu
  • Likang Wu
  • Runlong Yu
  • Sanshi Lei Yu
  • Zaixi Zhang

Federated recommendation (FedRec) can train personalized recommenders without collecting user data, but the decentralized nature makes it susceptible to poisoning attacks. Most previous studies focus on the targeted attack to promote certain items, while the untargeted attack that aims to degrade the overall performance of the FedRec system remains less explored. In fact, untargeted attacks can disrupt the user experience and bring severe financial loss to the service provider. However, existing untargeted attack methods are either inapplicable or ineffective against FedRec systems. In this paper, we delve into the untargeted attack and its defense for FedRec systems. (i) We propose ClusterAttack, a novel untargeted attack method. It uploads poisonous gradients that converge the item embeddings into several dense clusters, which make the recommender generate similar scores for these items in the same cluster and perturb the ranking order. (ii) We propose a uniformity-based defense mechanism (UNION) to protect FedRec systems from such attacks. We design a contrastive learning task that regularizes the item embeddings toward a uniform distribution. Then the server filters out these malicious gradients by estimating the uniformity of updated item embeddings. Experiments on two public datasets show that ClusterAttack can effectively degrade the performance of FedRec systems while circumventing many defense methods, and UNION can improve the resistance of the system against various untargeted attacks, including our ClusterAttack.

AAAI Conference 2022 Conference Paper

Adapt to Environment Sudden Changes by Learning a Context Sensitive Policy

  • Fan-Ming Luo
  • Shengyi Jiang
  • Yang Yu
  • Zongzhang Zhang
  • Yi-Feng Zhang

Dealing with real-world reinforcement learning (RL) tasks, we shall be aware that the environment may have sudden changes. We expect that a robust policy is able to handle such changes and adapt to the new environment rapidly. Contextbased meta reinforcement learning aims at learning environment adaptable policies. These methods adopt a context encoder to perceive the environment on-the-fly, following which a contextual policy makes environment adaptive decisions according to the context. However, previous methods show lagged and unstable context extraction, which are hard to handle sudden changes well. This paper proposes an environment sensitive contextual policy learning (ESCP) approach, in order to improve both the sensitivity and the robustness of context encoding. ESCP is composed of three key components: variance minimization that forces a rapid and stable encoding of the environment context, relational matrix determinant maximization that avoids trivial solutions, and a history-truncated recurrent neural network model that avoids old memory interference. We use a grid-world task and 5 locomotion controlling tasks with changing parameters to empirically assess our algorithm. Experiment results show that in environments with both in-distribution and out-ofdistribution parameter changes, ESCP can not only better recover the environment encoding, but also adapt more rapidly to the post-change environment (10× faster in the grid-world) while the return performance is kept or improved, compared with state-of-the-art meta RL methods.

NeurIPS Conference 2022 Conference Paper

Bayesian Optimistic Optimization: Optimistic Exploration for Model-based Reinforcement Learning

  • Chenyang Wu
  • Tianci Li
  • Zongzhang Zhang
  • Yang Yu

Reinforcement learning (RL) is a general framework for modeling sequential decision making problems, at the core of which lies the dilemma of exploitation and exploration. An agent failing to explore systematically will inevitably fail to learn efficiently. Optimism in the face of uncertainty (OFU) is a conventionally successful strategy for efficient exploration. An agent following the OFU principle explores actively and efficiently. However, when applied to model-based RL, it involves specifying a confidence set of the underlying model and solving a series of nonlinear constrained optimization, which can be computationally intractable. This paper proposes an algorithm, Bayesian optimistic optimization (BOO), which adopts a dynamic weighting technique for enforcing the constraint rather than explicitly solving a constrained optimization problem. BOO is a general algorithm proved to be sample-efficient for models in a finite-dimensional reproducing kernel Hilbert space. We also develop techniques for effective optimization and show through some simulation experiments that BOO is competitive with the existing algorithms.

JMLR Journal 2022 Journal Article

Distributed Bootstrap for Simultaneous Inference Under High Dimensionality

  • Yang Yu
  • Shih-Kang Chao
  • Guang Cheng

We propose a distributed bootstrap method for simultaneous inference on high-dimensional massive data that are stored and processed with many machines. The method produces an $\ell_\infty$-norm confidence region based on a communication-efficient de-biased lasso, and we propose an efficient cross-validation approach to tune the method at every iteration. We theoretically prove a lower bound on the number of communication rounds $\tau_{\min}$ that warrants the statistical accuracy and efficiency. Furthermore, $\tau_{\min}$ only increases logarithmically with the number of workers and the intrinsic dimensionality, while nearly invariant to the nominal dimensionality. We test our theory by extensive simulation studies, and a variable screening task on a semi-synthetic dataset based on the US Airline On-Time Performance dataset. The code to reproduce the numerical results is available in Supplementary Material. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Efficient Multi-agent Communication via Self-supervised Information Aggregation

  • Cong Guan
  • Feng Chen
  • Lei Yuan
  • Chenghe Wang
  • Hao Yin
  • Zongzhang Zhang
  • Yang Yu

Utilizing messages from teammates can improve coordination in cooperative Multi-agent Reinforcement Learning (MARL). To obtain meaningful information for decision-making, previous works typically combine raw messages generated by teammates with local information as inputs for policy. However, neglecting the aggregation of multiple messages poses great inefficiency for policy learning. Motivated by recent advances in representation learning, we argue that efficient message aggregation is essential for good coordination in MARL. In this paper, we propose Multi-Agent communication via Self-supervised Information Aggregation (MASIA), with which agents can aggregate the received messages into compact representations with high relevance to augment the local policy. Specifically, we design a permutation invariant message encoder to generate common information aggregated representation from raw messages and optimize it via reconstructing and shooting future information in a self-supervised manner. Each agent would utilize the most relevant parts of the aggregated representation for decision-making by a novel message extraction mechanism. Empirical results demonstrate that our method significantly outperforms strong baselines on multiple cooperative MARL tasks for various task settings.

IJCAI Conference 2022 Conference Paper

Efficient Multi-Agent Communication via Shapley Message Value

  • Di Xue
  • Lei Yuan
  • Zongzhang Zhang
  • Yang Yu

Utilizing messages from teammates is crucial in cooperative multi-agent tasks due to the partially observable nature of the environment. Naively asking messages from all teammates without pruning may confuse individual agents, hindering the learning process and impairing the whole system's performance. Most previous work either utilizes a gate or employs an attention mechanism to extract relatively important messages. However, they do not explicitly evaluate each message's value, failing to learn an efficient communication protocol in more complex scenarios. To tackle this issue, we model the teammates of an agent as a message coalition and calculate the Shapley Message Value (SMV) of each agent within it. SMV reflects the contribution of each message to an agent and redundant messages can be spotted in this way effectively. On top of that, we design a novel framework named Shapley Message Selector (SMS), which learns to predict the SMVs of teammates for an agent solely based on local information so that the agent can only query those teammates with positive SMVs. Empirically, we demonstrate that our method can prune redundant messages and achieve comparable or better performance in various multi-agent cooperative scenarios than full communication settings and existing strong baselines.

AAAI Conference 2022 Conference Paper

Invariant Action Effect Model for Reinforcement Learning

  • Zheng-Mao Zhu
  • Shengyi Jiang
  • Yu-Ren Liu
  • Yang Yu
  • Kun Zhang

Good representations can help RL agents perform concise modeling of their surroundings, and thus support effective decision-making in complex environments. Previous methods learn good representations by imposing extra constraints on dynamics. However, in the causal perspective, the causation between the action and its effect is not fully considered in those methods, which leads to the ignorance of the underlying relations among the action effects on the transitions. Based on the intuition that the same action always causes similar effects among different states, we induce such causation by taking the invariance of action effects among states as the relation. By explicitly utilizing such invariance, in this paper, we show that a better representation can be learned and potentially improves the sample efficiency and the generalization ability of the learned policy. We propose Invariant Action Effect Model (IAEM) to capture the invariance in action effects, where the effect of an action is represented as the residual of representations from neighboring states. IAEM is composed of two parts: (1) a new contrastive-based loss to capture the underlying invariance of action effects; (2) an individual action effect and provides a self-adapted weighting strategy to tackle the corner cases where the invariance does not hold. The extensive experiments on two benchmarks, i. e. Grid-World and Atari, show that the representations learned by IAEM preserve the invariance of action effects. Moreover, with the invariant action effect, IAEM can accelerate the learning process by 1. 6x, rapidly generalize to new environments by finetuning on a few components, and outperform other dynamicsbased representation methods by 1. 4x in limited steps.

IJCAI Conference 2022 Conference Paper

Multi-Agent Concentrative Coordination with Decentralized Task Representation

  • Lei Yuan
  • Chenghe Wang
  • Jianhao Wang
  • Fuxiang Zhang
  • Feng Chen
  • Cong Guan
  • Zongzhang Zhang
  • Chongjie Zhang

Value-based multi-agent reinforcement learning (MARL) methods hold the promise of promoting coordination in cooperative settings. Popular MARL methods mainly focus on the scalability or the representational capacity of value functions. Such a learning paradigm can reduce agents' uncertainties and promote coordination. However, they fail to leverage the task structure decomposability, which generally exists in real-world multi-agent systems (MASs), leading to a significant amount of time exploring the optimal policy in complex scenarios. To address this limitation, we propose a novel framework Multi-Agent Concentrative Coordination (MACC) based on task decomposition, with which an agent can implicitly form local groups to reduce the learning space to facilitate coordination. In MACC, agents first learn representations for subtasks from their local information and then implement an attention mechanism to concentrate on the most relevant ones. Thus, agents can pay targeted attention to specific subtasks and improve coordination. Extensive experiments on various complex multi-agent benchmarks demonstrate that MACC achieves remarkable performance compared to existing methods.

NeurIPS Conference 2022 Conference Paper

Multi-agent Dynamic Algorithm Configuration

  • Ke Xue
  • Jiacheng Xu
  • Lei Yuan
  • Miqing Li
  • Chao Qian
  • Zongzhang Zhang
  • Yang Yu

Automated algorithm configuration relieves users from tedious, trial-and-error tuning tasks. A popular algorithm configuration tuning paradigm is dynamic algorithm configuration (DAC), in which an agent learns dynamic configuration policies across instances by reinforcement learning (RL). However, in many complex algorithms, there may exist different types of configuration hyperparameters, and such heterogeneity may bring difficulties for classic DAC which uses a single-agent RL policy. In this paper, we aim to address this issue and propose multi-agent DAC (MA-DAC), with one agent working for one type of configuration hyperparameter. MA-DAC formulates the dynamic configuration of a complex algorithm with multiple types of hyperparameters as a contextual multi-agent Markov decision process and solves it by a cooperative multi-agent RL (MARL) algorithm. To instantiate, we apply MA-DAC to a well-known optimization algorithm for multi-objective optimization problems. Experimental results show the effectiveness of MA-DAC in not only achieving superior performance compared with other configuration tuning approaches based on heuristic rules, multi-armed bandits, and single-agent RL, but also being capable of generalizing to different problem classes. Furthermore, we release the environments in this paper as a benchmark for testing MARL algorithms, with the hope of facilitating the application of MARL.

AAAI Conference 2022 Conference Paper

Multi-Agent Incentive Communication via Decentralized Teammate Modeling

  • Lei Yuan
  • Jianhao Wang
  • Fuxiang Zhang
  • Chenghe Wang
  • Zongzhang Zhang
  • Yang Yu
  • Chongjie Zhang

Effective communication can improve coordination in cooperative multi-agent reinforcement learning (MARL). One popular communication scheme is exchanging agents’ local observations or latent embeddings and using them to augment individual local policy input. Such a communication paradigm can reduce uncertainty for local decision-making and induce implicit coordination. However, it enlarges agents’ local policy spaces and increases learning complexity, leading to poor coordination in complex settings. To handle this limitation, this paper proposes a novel framework named Multi-Agent Incentive Communication (MAIC) that allows each agent to learn to generate incentive messages and bias other agents’ value functions directly, resulting in effective explicit coordination. Our method firstly learns targeted teammate models, with which each agent can anticipate the teammate’s action selection and generate tailored messages to specific agents. We further introduce a novel regularization to leverage interaction sparsity and improve communication efficiency. MAIC is agnostic to specific MARL algorithms and can be flexibly integrated with different value function factorization methods. Empirical results demonstrate that our method significantly outperforms baselines and achieves excellent performance on multiple cooperative MARL tasks.

NeurIPS Conference 2022 Conference Paper

NeoRL: A Near Real-World Benchmark for Offline Reinforcement Learning

  • Rong-Jun Qin
  • Xingyuan Zhang
  • Songyi Gao
  • Xiong-Hui Chen
  • Zewen Li
  • Weinan Zhang
  • Yang Yu

Offline reinforcement learning (RL) aims at learning effective policies from historical data without extra environment interactions. During our experience of applying offline RL, we noticed that previous offline RL benchmarks commonly involve significant reality gaps, which we have identified include rich and overly exploratory datasets, degraded baseline, and missing policy validation. In many real-world situations, to ensure system safety, running an overly exploratory policy to collect various data is prohibited, thus only a narrow data distribution is available. The resulting policy is regarded as effective if it is better than the working behavior policy; the policy model can be deployed only if it has been well validated, rather than accomplished the training. In this paper, we present a Near real-world offline RL benchmark, named NeoRL, to reflect these properties. NeoRL datasets are collected with a more conservative strategy. Moreover, NeoRL contains the offline training and offline validation pipeline before the online test, corresponding to real-world situations. We then evaluate recent state-of-the-art offline RL algorithms in NeoRL. The empirical results demonstrate that some offline RL algorithms are less competitive to the behavior cloning and the deterministic behavior policy, implying that they could be less effective in real-world tasks than in the previous benchmarks. We also disclose that current offline policy evaluation methods could hardly select the best policy. We hope this work will shed some light on future research and deploying RL in real-world systems.

JBHI Journal 2022 Journal Article

Non-Invasive Analysis of Motor Unit Activation During Simultaneous and Continuous Wrist Movements

  • Chen Chen
  • Yang Yu
  • Xinjun Sheng
  • Xiangyang Zhu

Surface electromyography (EMG) signals have shown promising applications in human-machine interfacing (HMI) systems such as orthotics, prosthetics, and exoskeletons. Nevertheless, existing myoelectric control methods, generally based on time-domain or frequency-domain features, could not directly interpret neural commands. EMG decomposition techniques have become a prevailing solution to decode the motor neuron discharges from the spinal cord, whereas only single degree-of-freedom (DoF) movements are primarily involved in the current neural-based interfaces, resulting in limited intuitiveness and functionality. Here, we propose a non-invasive framework to analyze motor unit activities and estimate wrist torques during simultaneous contractions of multiple DoFs. Motor unit discharges were decoded from surface EMG signals and pooled into groups during sequential wrist movements. Then three neural features were extracted and linearly projected to the torques of multi-DoF tasks. On average, there were 44 $\pm$ 13 motor units identified for each motion with a PNR value of 25. 8 $\pm$ 2. 9 dB. The neural features outperformed the classic EMG feature on the estimation accuracy with higher correlation coefficients and smoothness. These results demonstrate the feasibility and superiority of the proposed framework in kinetics estimation of simultaneous movements, extending the potential applications of surface EMG decomposition in human-machine interfaces.

JAIR Journal 2022 Journal Article

On Efficient Reinforcement Learning for Full-length Game of StarCraft II

  • Ruo-Ze Liu
  • Zhen-Jia Pang
  • Zhou-Yu Meng
  • Wenhai Wang
  • Yang Yu
  • Tong Lu

StarCraft II (SC2) poses a grand challenge for reinforcement learning (RL), of which the main difficulties include huge state space, varying action space, and a long time horizon. In this work, we investigate a set of RL techniques for the full-length game of StarCraft II. We investigate a hierarchical RL approach, where the hierarchy involves two. One is the extracted macro-actions from experts’ demonstration trajectories to reduce the action space in an order of magnitude. The other is a hierarchical architecture of neural networks, which is modular and facilitates scale. We investigate a curriculum transfer training procedure that trains the agent from the simplest level to the hardest level. We train the agent on a single machine with 4 GPUs and 48 CPU threads. On a 64x64 map and using restrictive units, we achieve a win rate of 99% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat models, we achieve a 93% win rate against the most difficult non-cheating level built-in AI (level-7). In this extended version of the paper, we improve our architecture to train the agent against the most difficult cheating level AIs (level-8, level-9, and level-10). We also test our method on different maps to evaluate the extensibility of our approach. By a final 3-layer hierarchical architecture and applying significant tricks to train SC2 agents, we increase the win rate against the level-8, level-9, and level-10 to 96%, 97%, and 94%, respectively. Our codes and models are all open-sourced now at https://github.com/liuruoze/HierNet-SC2. To provide a baseline referring the AlphaStar for our work as well as the research and open-source community, we reproduce a scaled-down version of it, mini-AlphaStar (mAS). The latest version of mAS is 1.07, which can be trained using supervised learning and reinforcement learning on the raw action space which has 564 actions. It is designed to run training on a single common machine, by making the hyper-parameters adjustable and some settings simplified. We then can compare our work with mAS using the same computing resources and training time. By experiment results, we show that our method is more effective when using limited resources. The inference and training codes of mini-AlphaStar are all open-sourced at https://github.com/liuruoze/mini-AlphaStar. We hope our study could shed some light on the future research of efficient reinforcement learning on SC2 and other large-scale games.

NeurIPS Conference 2021 Conference Paper

Adaptive Online Packing-guided Search for POMDPs

  • Chenyang Wu
  • Guoyu Yang
  • Zongzhang Zhang
  • Yang Yu
  • Dong Li
  • Wulong Liu
  • Jianye Hao

The partially observable Markov decision process (POMDP) provides a general framework for modeling an agent's decision process with state uncertainty, and online planning plays a pivotal role in solving it. A belief is a distribution of states representing state uncertainty. Methods for large-scale POMDP problems rely on the same idea of sampling both states and observations. That is, instead of exact belief updating, a collection of sampled states is used to approximate the belief; instead of considering all possible observations, only a set of sampled observations are considered. Inspired by this, we take one step further and propose an online planning algorithm, Adaptive Online Packing-guided Search (AdaOPS), to better approximate beliefs with adaptive particle filter technique and balance estimation bias and variance by fusing similar observation branches. Theoretically, our algorithm is guaranteed to find an $\epsilon$-optimal policy with a high probability given enough planning time under some mild assumptions. We evaluate our algorithm on several tricky POMDP domains, and it outperforms the state-of-the-art in all of them.

AAAI Conference 2021 Conference Paper

Circles are like Ellipses, or Ellipses are like Circles? Measuring the Degree of Asymmetry of Static and Contextual Word Embeddings and the Implications to Representation Learning

  • Wei Zhang
  • Murray Campbell
  • Yang Yu
  • Sadhana Kumaravel

Human judgments of word similarity have been a popular method of evaluating the quality of word embedding. But it fails to measure the geometry properties such as asymmetry. For example, it is more natural to say “Ellipses are like Circles” than “Circles are like Ellipses”. Such asymmetry has been observed from the word evocation experiment, where one word is used to recall another. This association data have been understudied for measuring embedding quality. In this paper, we use three well-known evocation datasets for the purpose and study both static embedding as well as contextual embedding, such as BERT. To fight for the dynamic nature of BERT embedding, we probe BERT’s conditional probabilities as a language model, using a large number of Wikipedia contexts to derive a theoretically justifiable Bayesian asymmetry score. The result shows that the asymmetry judgment and similarity judgments disagree, and asymmetry judgment aligns with its strong performance on “extrinsic evaluations”. This is the first time we can show contextual embeddings’s strength on intrinsic evaluation, and the asymmetry judgment provides a new perspective to evaluate contextual embedding and new insights for representation learning.

NeurIPS Conference 2021 Conference Paper

Cross-modal Domain Adaptation for Cost-Efficient Visual Reinforcement Learning

  • Xiong-Hui Chen
  • Shengyi Jiang
  • Feng Xu
  • Zongzhang Zhang
  • Yang Yu

In visual-input sim-to-real scenarios, to overcome the reality gap between images rendered in simulators and those from the real world, domain adaptation, i. e. , learning an aligned representation space between simulators and the real world, then training and deploying policies in the aligned representation, is a promising direction. Previous methods focus on same-modal domain adaptation. However, those methods require building and running simulators that render high-quality images, which can be difficult and costly. In this paper, we consider a more cost-efficient setting of visual-input sim-to-real where only low-dimensional states are simulated. We first point out that the objective of learning mapping functions in previous methods that align the representation spaces is ill-posed, prone to yield an incorrect mapping. When the mapping crosses modalities, previous methods are easier to fail. Our algorithm, Cross-mOdal Domain Adaptation with Sequential structure (CODAS), mitigates the ill-posedness by utilizing the sequential nature of the data sampling process in RL tasks. Experiments on MuJoCo and Hand Manipulation Suite tasks show that the agents deployed with our method achieve similar performance as it has in the source domain, while those deployed with previous methods designed for same-modal domain adaptation suffer a larger performance gap.

AAAI Conference 2021 Short Paper

Enhancing Context-Based Meta-Reinforcement Learning Algorithms via An Efficient Task Encoder (Student Abstract)

  • Feng Xu
  • Shengyi Jiang
  • Hao Yin
  • Zongzhang Zhang
  • Yang Yu
  • Ming Li
  • Dong Li
  • Wulong Liu

Meta-Reinforcement Learning (meta-RL) algorithms enable agents to adapt to new tasks from small amounts of exploration, based on the experience of similar tasks. Recent studies have pointed out that a good representation of a task is key to the success of off-policy context-based meta-RL. Inspired by contrastive methods in unsupervised representation learning, we propose a new method to learn the task representation based on the mutual information between transition tuples in a trajectory and the task embedding. We also propose a new estimation for task similarity based on Q-function, which can be used to form a constraint on the distribution of the encoded task variables, making the task encoder encode the task variables more effective on new tasks. Experiments on meta-RL tasks show that the newly proposed method outperforms existing meta-RL algorithms.

IJCAI Conference 2021 Conference Paper

Fast Pareto Optimization for Subset Selection with Dynamic Cost Constraints

  • Chao Bian
  • Chao Qian
  • Frank Neumann
  • Yang Yu

Subset selection with cost constraints is a fundamental problem with various applications such as influence maximization and sensor placement. The goal is to select a subset from a ground set to maximize a monotone objective function such that a monotone cost function is upper bounded by a budget. Previous algorithms with bounded approximation guarantees include the generalized greedy algorithm, POMC and EAMC, all of which can achieve the best known approximation guarantee. In real-world scenarios, the resources often vary, i. e. , the budget often changes over time, requiring the algorithms to adapt the solutions quickly. However, when the budget changes dynamically, all these three algorithms either achieve arbitrarily bad approximation guarantees, or require a long running time. In this paper, we propose a new algorithm FPOMC by combining the merits of the generalized greedy algorithm and POMC. That is, FPOMC introduces a greedy selection strategy into POMC. We prove that FPOMC can maintain the best known approximation guarantee efficiently.

AAAI Conference 2021 Short Paper

Incorporating Bidirection-Interactive Information and Semantic Features for Relational Facts Extraction (Student Abstract)

  • Yang Yu
  • Guohua Wang
  • Haopeng Ren
  • Yi Cai

The interaction between named entity recognition and relation classification is quite essential for the extraction of relational triplets. However, most of jointly extraction works only consider unidirectional interaction between the two subtasks. They even neglect the interactive information totally. In order to tackle these problems, we propose a novel unified joint extraction model which considers bidirection-interactive information between the two subtasks. Our model consists of two modules. The first module utilizes Bi-LSTM and GCN to capture the sequential and the structure-semantic features of a sentence, The second module utilizes two layers to capture bidirection-interactive information between the two subtasks and generates relational triplets respectively. The experimental results show that our proposed model outperforms the stateof-the-art models on two public datasets.

AAAI Conference 2021 Short Paper

LB-DESPOT: Efficient Online POMDP Planning Considering Lower Bound in Action Selection (Student Abstract)

  • Chenyang Wu
  • Rui Kong
  • Guoyu Yang
  • Xianghan Kong
  • Zongzhang Zhang
  • Yang Yu
  • Dong Li
  • Wulong Liu

Partially observable Markov decision process (POMDP) is an extension to MDP. It handles the state uncertainty by specifying the probability of getting a particular observation given the current state. DESPOT is one of the most popular scalable online planning algorithms for POMDPs, which manages to significantly reduce the size of the decision tree while deriving a near-optimal policy by considering only K scenarios. Nevertheless, there is a gap in action selection criteria between planning and execution in DESPOT. During the planning stage, it keeps choosing the action with the highest upper bound, whereas when the planning ends, the action with the highest lower bound is chosen for execution. Here, we propose LB-DESPOT to alleviate this issue, which utilizes the lower bound in selecting an action branch to expand. Empirically, our method has attained better performance than DESPOT and POMCP, which is another state-of-the-art, on several challenging POMDP benchmark tasks.

NeurIPS Conference 2021 Conference Paper

Offline Model-based Adaptable Policy Learning

  • Xiong-Hui Chen
  • Yang Yu
  • Qingyang Li
  • Fan-Ming Luo
  • Zhiwei Qin
  • Wenjie Shang
  • Jieping Ye

In reinforcement learning, a promising direction to avoid online trial-and-error costs is learning from an offline dataset. Current offline reinforcement learning methods commonly learn in the policy space constrained to in-support regions by the offline dataset, in order to ensure the robustness of the outcome policies. Such constraints, however, also limit the potential of the outcome policies. In this paper, to release the potential of offline policy learning, we investigate the decision-making problems in out-of-support regions directly and propose offline Model-based Adaptable Policy LEarning (MAPLE). By this approach, instead of learning in in-support regions, we learn an adaptable policy that can adapt its behavior in out-of-support regions when deployed. We conduct experiments on MuJoCo controlling tasks with offline datasets. The results show that the proposed method can make robust decisions in out-of-support regions and achieve better performance than SOTA algorithms.

NeurIPS Conference 2021 Conference Paper

Regret Minimization Experience Replay in Off-Policy Reinforcement Learning

  • Xu-Hui Liu
  • Zhenghai Xue
  • Jingcheng Pang
  • Shengyi Jiang
  • Feng Xu
  • Yang Yu

In reinforcement learning, experience replay stores past samples for further reuse. Prioritized sampling is a promising technique to better utilize these samples. Previous criteria of prioritization include TD error, recentness and corrective feedback, which are mostly heuristically designed. In this work, we start from the regret minimization objective, and obtain an optimal prioritization strategy for Bellman update that can directly maximize the return of the policy. The theory suggests that data with higher hindsight TD error, better on-policiness and more accurate Q value should be assigned with higher weights during sampling. Thus most previous criteria only consider this strategy partially. We not only provide theoretical justifications for previous criteria, but also propose two new methods to compute the prioritization weight, namely ReMERN and ReMERT. ReMERN learns an error network, while ReMERT exploits the temporal ordering of states. Both methods outperform previous prioritized sampling algorithms in challenging RL benchmarks, including MuJoCo, Atari and Meta-World.

JBHI Journal 2021 Journal Article

Wrist Torque Estimation via Electromyographic Motor Unit Decomposition and Image Reconstruction

  • Yang Yu
  • Chen Chen
  • Xinjun Sheng
  • Xiangyang Zhu

Neural interface using decomposed motor units (MUs) from surface electromyography (sEMG) has allowed non-invasive access to the neural control signals, and provided a novel approach for intuitive human-machine interaction. However, most of the existing methods based on decomposed MUs merely adopted the discharge rate (DR) as the feature representations, which may lack local information around the discharge instant and ignore the subtle interactions of different MUs. In this study, we proposed an MU-specific image-based scheme for wrist torque estimation. Specifically, the high-density sEMG signals were decoded into motor unit spike trains (MUSTs), and then MU-specific images were reconstructed with MUSTs and corresponding motor unit action potential (MUAP). A convolutional neural network was used to learn representative features from MU-specific images automatically, and further to estimate wrist torques. The results demonstrated that the proposed method outperformed three conventional and a deep-learning regression approaches using DR features, with the estimation accuracy R 2 of 0. 82 ± 0. 09, 0. 89 ± 0. 06, and nRMSE of 12. 6 ± 2. 5%, 11. 0 ± 3. 1% for pronation/supination and flexion/extension, respectively. Further, the analysis of the extracted features from MU-specific images showed a higher correlation than DR for recorded torques, indicating the effectiveness of the proposed method. The outcomes of this study provide a novel and promising perspective for the intuitive control of neural interfacing.

AAAI Conference 2020 Conference Paper

An Efficient Evolutionary Algorithm for Subset Selection with General Cost Constraints

  • Chao Bian
  • Chao Feng
  • Chao Qian
  • Yang Yu

In this paper, we study the problem of selecting a subset from a ground set to maximize a monotone objective function f such that a monotone cost function c is bounded by an upper limit. State-of-the-art algorithms include the generalized greedy algorithm and POMC. The former is an efficient fixed time algorithm, but the performance is limited by the greedy nature. The latter is an anytime algorithm that can find better subsets using more time, but without any polynomial-time approximation guarantee. In this paper, we propose a new anytime algorithm EAMC, which employs a simple evolutionary algorithm to optimize a surrogate objective integrating f and c. We prove that EAMC achieves the best known approximation guarantee in polynomial expected running time. Experimental results on the applications of maximum coverage, influence maximization and sensor placement show the excellent performance of EAMC.

NeurIPS Conference 2020 Conference Paper

Error Bounds of Imitating Policies and Environments

  • Tian Xu
  • Ziniu Li
  • Yang Yu

Imitation learning trains a policy by mimicking expert demonstrations. Various imitation methods were proposed and empirically evaluated, meanwhile, their theoretical understanding needs further studies. In this paper, we firstly analyze the value gap between the expert policy and imitated policies by two imitation methods, behavioral cloning and generative adversarial imitation. The results support that generative adversarial imitation can reduce the compounding errors compared to behavioral cloning, and thus has a better sample complexity. Noticed that by considering the environment transition model as a dual agent, imitation learning can also be used to learn the environment model. Therefore, based on the bounds of imitating policies, we further analyze the performance of imitating environments. The results show that environment models can be more effectively imitated by generative adversarial imitation than behavioral cloning, suggesting a novel application of adversarial imitation for model-based reinforcement learning. We hope these results could inspire future advances in imitation learning and model-based reinforcement learning.

NeurIPS Conference 2020 Conference Paper

Offline Imitation Learning with a Misspecified Simulator

  • Shengyi Jiang
  • Jingcheng Pang
  • Yang Yu

In real-world decision-making tasks, learning an optimal policy without a trial-and-error process is an appealing challenge. When expert demonstrations are available, imitation learning that mimics expert actions can learn a good policy efficiently. Learning in simulators is another commonly adopted approach to avoid real-world trials-and-errors. However, neither sufficient expert demonstrations nor high-fidelity simulators are easy to obtain. In this work, we investigate policy learning in the condition of a few expert demonstrations and a simulator with misspecified dynamics. Under a mild assumption that local states shall still be partially aligned under a dynamics mismatch, we propose imitation learning with horizon-adaptive inverse dynamics (HIDIL) that matches the simulator states with expert states in a $H$-step horizon and accurately recovers actions based on inverse dynamics policies. In the real environment, HIDIL can effectively derive adapted actions from the matched states. Experiments are conducted in four MuJoCo locomotion environments with modified friction, gravity, and density configurations. Experiment results show that HIDIL achieves significant improvement in terms of performance and stability in all of the real environments, compared with imitation learning methods and transferring methods in reinforcement learning.

NeurIPS Conference 2020 Conference Paper

RetroXpert: Decompose Retrosynthesis Prediction Like A Chemist

  • Chaochao Yan
  • Qianggang Ding
  • Peilin Zhao
  • Shuangjia Zheng
  • Jinyu Yang
  • Yang Yu
  • Junzhou Huang

Retrosynthesis is the process of recursively decomposing target molecules into available building blocks. It plays an important role in solving problems in organic synthesis planning. To automate or assist in the retrosynthesis analysis, various retrosynthesis prediction algorithms have been proposed. However, most of them are cumbersome and lack interpretability about their predictions. In this paper, we devise a novel template-free algorithm for automatic retrosynthetic expansion inspired by how chemists approach retrosynthesis prediction. Our method disassembles retrosynthesis into two steps: i) identify the potential reaction center of the target molecule through a novel graph neural network and generate intermediate synthons, and ii) generate the reactants associated with synthons via a robust reactant generation model. While outperforming the state-of-the-art baselines by a significant margin, our model also provides chemically reasonable interpretation.

NeurIPS Conference 2019 Conference Paper

Bridging Machine Learning and Logical Reasoning by Abductive Learning

  • Wang-Zhou Dai
  • Qiuling Xu
  • Yang Yu
  • Zhi-Hua Zhou

Perception and reasoning are two representative abilities of intelligence that are integrated seamlessly during human problem-solving processes. In the area of artificial intelligence (AI), the two abilities are usually realised by machine learning and logic programming, respectively. However, the two categories of techniques were developed separately throughout most of the history of AI. In this paper, we present the abductive learning targeted at unifying the two AI paradigms in a mutually beneficial way, where the machine learning model learns to perceive primitive logic facts from data, while logical reasoning can exploit symbolic domain knowledge and correct the wrongly perceived facts for improving the machine learning models. Furthermore, we propose a novel approach to optimise the machine learning model and the logical reasoning model jointly. We demonstrate that by using abductive learning, machines can learn to recognise numbers and resolve unknown mathematical operations simultaneously from images of simple hand-written equations. Moreover, the learned models can be generalised to longer equations and adapted to different tasks, which is beyond the capability of state-of-the-art deep learning models.

IJCAI Conference 2019 Conference Paper

Cascaded Algorithm-Selection and Hyper-Parameter Optimization with Extreme-Region Upper Confidence Bound Bandit

  • Yi-Qi Hu
  • Yang Yu
  • Jun-Da Liao

An automatic machine learning (AutoML) task is to select the best algorithm and its hyper-parameters simultaneously. Previously, the hyper-parameters of all algorithms are joint as a single search space, which is not only huge but also redundant, because many dimensions of hyper-parameters are irrelevant with the selected algorithms. In this paper, we propose a cascaded approach for algorithm selection and hyper-parameter optimization. While a search procedure is employed at the level of hyper-parameter optimization, a bandit strategy runs at the level of algorithm selection to allocate the budget based on the search feedbacks. Since the bandit is required to select the algorithm with the maximum performance, instead of the average performance, we thus propose the extreme-region upper confidence bound (ER-UCB) strategy, which focuses on the extreme region of the underlying feedback distribution. We show theoretically that the ER-UCB has a regret upper bound O(K ln n) with independent feedbacks, which is as efficient as the classical UCB bandit. We also conduct experiments on a synthetic problem as well as a set of AutoML tasks. The results verify the effectiveness of the proposed method.

AAAI Conference 2019 Conference Paper

Multi-Fidelity Automatic Hyper-Parameter Tuning via Transfer Series Expansion

  • Yi-Qi Hu
  • Yang Yu
  • Wei-Wei Tu
  • Qiang Yang
  • Yuqiang Chen
  • Wenyuan Dai

Automatic machine learning (AutoML) aims at automatically choosing the best configuration for machine learning tasks. However, a configuration evaluation can be very time consuming particularly on learning tasks with large datasets. This limitation usually restrains derivative-free optimization from releasing its full power for a fine configuration search using many evaluations. To alleviate this limitation, in this paper, we propose a derivative-free optimization framework for AutoML using multi-fidelity evaluations. It uses many lowfidelity evaluations on small data subsets and very few highfidelity evaluations on the full dataset. However, the lowfidelity evaluations can be badly biased, and need to be corrected with only a very low cost. We thus propose the Transfer Series Expansion (TSE) that learns the low-fidelity correction predictor efficiently by linearly combining a set of base predictors. The base predictors can be obtained cheaply from down-scaled and experienced tasks. Experimental results on real-world AutoML problems verify that the proposed framework can accelerate derivative-free configuration search significantly by making use of the multi-fidelity evaluations.

AAAI Conference 2019 Conference Paper

On Reinforcement Learning for Full-Length Game of StarCraft

  • Zhen-Jia Pang
  • Ruo-Ze Liu
  • Zhou-Yu Meng
  • Yi Zhang
  • Yang Yu
  • Tong Lu

StarCraft II poses a grand challenge for reinforcement learning. The main difficulties include huge state space, varying action space, long horizon, etc. In this paper, we investigate a set of techniques of reinforcement learning for the full-length game of StarCraft II. We investigate a hierarchical approach, where the hierarchy involves two levels of abstraction. One is the macro-actions extracted from expert’s demonstration trajectories, which can reduce the action space in an order of magnitude yet remain effective. The other is a two-layer hierarchical architecture, which is modular and easy to scale. We also investigate a curriculum transfer learning approach that trains the agent from the simplest opponent to harder ones. On a 64×64 map and using restrictive units, we train the agent on a single machine with 4 GPUs and 48 CPU threads. We achieve a winning rate of more than 99% against the difficulty level-1 built-in AI. Through the curriculum transfer learning algorithm and a mixture of combat model, we can achieve over 93% winning rate against the most difficult noncheating built-in AI (level-7) within days. We hope this study could shed some light on the future research of large-scale reinforcement learning.

IJCAI Conference 2019 Conference Paper

Reinforcement Learning Experience Reuse with Policy Residual Representation

  • WenJi Zhou
  • Yang Yu
  • Yingfeng Chen
  • Kai Guan
  • Tangjie Lv
  • Changjie Fan
  • Zhi-Hua Zhou

Experience reuse is key to sample-efficient reinforcement learning. One of the critical issues is how the experience is represented and stored. Previously, the experience can be stored in the forms of features, individual models, and the average model, each lying at a different granularity. However, new tasks may require experience across multiple granularities. In this paper, we propose the policy residual representation (PRR) network, which can extract and store multiple levels of experience. PRR network is trained on a set of tasks with a multi-level architecture, where a module in each level corresponds to a subset of the tasks. Therefore, the PRR network represents the experience in a spectrum-like way. When training on a new task, PRR can provide different levels of experience for accelerating the learning. We experiment with the PRR network on a set of grid world navigation tasks, locomotion tasks, and fighting tasks in a video game. The results show that the PRR network leads to better reuse of experience and thus outperforms some state-of-the-art approaches.

AAMAS Conference 2019 Conference Paper

Reinforcement Learning with Derivative-Free Exploration

  • Xiong-Hui Chen
  • Yang Yu

Effective exploration is key to sample-efficient reinforcement learning. While the most popular general approaches (e. g. , ϵ-greedy) for exploration are still of low efficiency, derivative-free optimization also invents efficient ways of exploration for better global search, which reinforcement learning usually desires for. In this paper, we introduce a derivative-free based exploration called DFE as a general efficient exploration method for early-stage reinforcement learning. DFE overcomes the disadvantage of optimization inefficiency and pool scalability in pure derivative-free optimization based reinforcement learning methods. Our experiments show DFE is an efficient and general exploration method through exploring trajectories with DFE in deterministic off-policy method DDPG and stochastic off-policy method ACER algorithms, and applying in Atari and Mujoco, which represent a high-dimensional discreteaction environment and a continuous control environment.

AAAI Conference 2019 Conference Paper

Virtual-Taobao: Virtualizing Real-World Online Retail Environment for Reinforcement Learning

  • Jing-Cheng Shi
  • Yang Yu
  • Qing Da
  • Shi-Yong Chen
  • An-Xiang Zeng

Applying reinforcement learning in physical-world tasks is extremely challenging. It is commonly infeasible to sample a large number of trials, as required by current reinforcement learning methods, in a physical environment. This paper reports our project on using reinforcement learning for better commodity search in Taobao, one of the largest online retail platforms and meanwhile a physical environment with a high sampling cost. Instead of training reinforcement learning in Taobao directly, we present our environment-building approach: we build Virtual-Taobao, a simulator learned from historical customer behavior data, and then we train policies in Virtual-Taobao with no physical sampling costs. To improve the simulation precision, we propose GAN-SD (GAN for Simulating Distributions) for customer feature generation with better matched distribution; we propose MAIL (Multiagent Adversarial Imitation Learning) for generating better generalizable customer actions. To further avoid overfitting the imperfection of the simulator, we propose ANC (Action Norm Constraint) strategy to regularize the policy model. In experiments, Virtual-Taobao is trained from hundreds of millions of real Taobao customers’ records. Compared with the real Taobao, Virtual-Taobao faithfully recovers important properties of the real environment. We further show that the policies trained purely in Virtual-Taobao, which has zero physical sampling cost, can have significantly superior real-world performance to the traditional supervised approaches, through online A/B tests. We hope this work may shed some light on applying reinforcement learning in complex physical environments.

IJCAI Conference 2018 Conference Paper

Approximation Guarantees of Stochastic Greedy Algorithms for Subset Selection

  • Chao Qian
  • Yang Yu
  • Ke Tang

Subset selection is a fundamental problem in many areas, which aims to select the best subset of size at most $k$ from a universe. Greedy algorithms are widely used for subset selection, and have shown good approximation performances in deterministic situations. However, their behaviors are stochastic in many realistic situations (e. g. , large-scale and noisy). For general stochastic greedy algorithms, bounded approximation guarantees were obtained only for subset selection with monotone submodular objective functions, while real-world applications often involve non-monotone or non-submodular objective functions and can be subject to a more general constraint than a size constraint. This work proves their approximation guarantees in these cases, and thus largely extends the applicability of stochastic greedy algorithms.

IJCAI Conference 2018 Conference Paper

Experienced Optimization with Reusable Directional Model for Hyper-Parameter Search

  • Yi-Qi Hu
  • Yang Yu
  • Zhi-Hua Zhou

Hyper-parameter selection is a crucial yet difficult issue in machine learning. For this problem, derivative-free optimization has being playing an irreplaceable role. However, derivative-free optimization commonly requires a lot of hyper-parameter samples, while each sample could have a high cost for hyper-parameter selection due to the costly evaluation of a learning model. To tackle this issue, in this paper, we propose an experienced optimization approach, i. e. , learning how to optimize better from a set of historical optimization processes. From the historical optimization processes on previous datasets, a directional model is trained to predict the direction of the next good hyper-parameter. The directional model is then reused to guide the optimization in learning new datasets. We implement this mechanism within a state-of-the-art derivative-free optimization method SRacos, and conduct experiments on learning the hyper-parameters of heterogeneous ensembles and neural network architectures. Experimental results verify that the proposed approach can significantly improve the learning accuracy within a limited hyper-parameter sample budget.

IJCAI Conference 2018 Conference Paper

Learning Environmental Calibration Actions for Policy Self-Evolution

  • Chao Zhang
  • Yang Yu
  • Zhi-Hua Zhou

Reinforcement learning in physical world is often expensive. Simulators are commonly employed to train policies. Due to the simulation error, trained-in-simulator policies are hard to be directly deployed in physical world. Therefore, how to efficiently reuse these policies to the real environment is a key issue. To address this issue, this paper presents a policy self-evolution process: in the target environment, the agent firstly executes a few calibration actions to perceive the environment, and then reuses the previous policies according to the observation of the environment. In this way, the mission of policy learning in the target environment is reduced to the task of environment identification through executing the calibration actions, which needs much less samples than learning a policy from scratch. We propose the POSEC (POlicy Self-Evolution by Calibration) approach, which learns the most informative calibration actions for policy self-evolution. Taking three robotic arm controlling tasks as the test beds, we show that the proposed method can learn a fine policy for a new arm with only a few (e. g. five) samples of the target environment.

IJCAI Conference 2018 Conference Paper

Mixture of GANs for Clustering

  • Yang Yu
  • Wen-Ji Zhou

For data clustering, Gaussian mixture model (GMM) is a typical method that trains several Gaussian models to capture the data. Each Gaussian model then provides the distribution information of a cluster. For clustering of high dimensional and complex data, more flexible models rather than Gaussian models are desired. Recently, the generative adversarial networks (GANs) have shown effectiveness in capturing complex data distribution. Therefore, GAN mixture model (GANMM) would be a promising alternative of GMM. However, we notice that the non-flexibility of the Gaussian model is essential in the expectation-maximization procedure for training GMM. GAN can have much higher flexibility, which disables the commonly employed expectation-maximization procedure, as that the maximization cannot change the result of the expectation. In this paper, we propose to use the epsilon-expectation-maximization procedure for training GANMM. The experiments show that the proposed GANMM can have good performance on complex data as well as simple data.

NeurIPS Conference 2018 Conference Paper

Multi-Layered Gradient Boosting Decision Trees

  • Ji Feng
  • Yang Yu
  • Zhi-Hua Zhou

Multi-layered distributed representation is believed to be the key ingredient of deep neural networks especially in cognitive tasks like computer vision. While non-differentiable models such as gradient boosting decision trees (GBDTs) are still the dominant methods for modeling discrete or tabular data, they are hard to incorporate with such representation learning ability. In this work, we propose the multi-layered GBDT forest (mGBDTs), with an explicit emphasis on exploring the ability to learn hierarchical distributed representations by stacking several layers of regression GBDTs as its building block. The model can be jointly trained by a variant of target propagation across layers, without the need to derive backpropagation nor differentiability. Experiments confirmed the effectiveness of the model in terms of performance and representation learning ability.

AAAI Conference 2018 Conference Paper

Noisy Derivative-Free Optimization With Value Suppression

  • Hong Wang
  • Hong Qian
  • Yang Yu

Derivative-free optimization has shown advantage in solving sophisticated problems such as policy search, when the environment is noise-free. Many real-world environments are noisy, where solution evaluations are inaccurate due to the noise. Noisy evaluation can badly injure derivative-free optimization, as it may make a worse solution looks better. Sampling is a straightforward way to reduce noise, while previous studies have shown that delay the noise handling to the comparison time point (i. e. , threshold selection) can be helpful for derivative-free optimization. This work further delays the noise handling, and proposes a simple noise handling mechanism, i. e. , value suppression. By value suppression, we do nothing about noise until the best-so-far solution has not been improved for a period, and then suppress the value of the best-so-far solution and continue the optimization. On synthetic problems as well as reinforcement learning tasks, experiments verify that value suppression can be significantly more effective than the previous methods.

IJCAI Conference 2018 Conference Paper

Towards Sample Efficient Reinforcement Learning

  • Yang Yu

Reinforcement learning is a major tool to realize intelligent agents that can be autonomously adaptive to the environment. With deep models, reinforcement learning has shown great potential in complex tasks such as playing games from pixels. However, current reinforcement learning techniques are still suffer from requiring a huge amount of interaction data, which could result in unbearable cost in real-world applications. In this article, we share our understanding of the problem, and discuss possible ways to alleviate the sample cost of reinforcement learning, from the aspects of exploration, optimization, environment modeling, experience transfer, and abstraction. We also discuss some challenges in real-world applications, with the hope of inspiring future researches.

IJCAI Conference 2017 Conference Paper

AGRA: An Analysis-Generation-Ranking Framework for Automatic Abbreviation from Paper Titles

  • Jianbing Zhang
  • Yixin Sun
  • Shujian Huang
  • Cam-Tu Nguyen
  • Xiaoliang Wang
  • Xinyu Dai
  • Jiajun Chen
  • Yang Yu

People sometimes choose word-like abbreviations to refer to items with a long description. These abbreviations usually come from the descriptive text of the item and are easy to remember and pronounce, while preserving the key idea of the item. Coming up with a nice abbreviation is not an easy job, even for human. Previous assistant naming systems compose names by applying hand-written rules, which may not perform well. In this paper, we propose to view the naming task as an artificial intelligence problem and create a data set in the domain of academic naming. To generate more delicate names, we propose a three-step framework, including description analysis, candidate generation and abbreviation ranking, each of which is parameterized and optimizable. We conduct experiments to compare different settings of our framework with several analysis approaches from different perspectives. Compared to online or baseline systems, our framework could achieve the best results.

IJCAI Conference 2017 Conference Paper

Binary Linear Compression for Multi-label Classification

  • Wen-Ji Zhou
  • Yang Yu
  • Min-Ling Zhang

In multi-label classification tasks, labels are commonly related with each other. It has been well recognized that utilizing label relationship is essential to multi-label learning. One way to utilizing label relationship is to map labels to a lower-dimensional space of uncorrelated labels, where the relationship could be encoded in the mapping. Previous linear mapping methods commonly result in regression subproblems in the lower-dimensional label space. In this paper, we disclose that mappings to a low-dimensional multi-label regression problem can be worse than mapping to a classification problem, since regression requires more complex model than classification. We then propose the binary linear compression (BILC) method that results in a binary label space, leading to classification subproblems. Experiments on several multi-label datasets show that, employing classification in the embedded space results in much simpler models than regression, leading to smaller structure risk. The proposed methods are also shown to be superior to some state-of-the-art approaches.

IJCAI Conference 2017 Conference Paper

Life-Stage Modeling by Customer-Manifold Embedding

  • Jing-Wen Yang
  • Yang Yu
  • Xiao-Peng Zhang

A person experiences different stages throughout the life, causing dramatically varying behavior patterns. In applications such as online-shopping, it has been observed that customer behaviors are largely affected by their stages and are evolving over time. Although this phenomena has been recognized previously, very few studies tried to model the life-stage and make use of it. In this paper, we propose to discover a latent space, called customer-manifold, on which a position corresponds to a customer stage. The customer-manifold allows us to train a static prediction model that captures dynamic customer behavior patterns. We further embed the learned customer-manifold into a neural network model as a hidden layer output, resulting in an efficient and accurate customer behavior prediction system. We apply this system to online-shopping recommendation. Experiments in real world data show that taking customer-manifold into account can improve the performance of the recommender system. Moreover, visualization of the customer-manifold space may also be helpful to understand the evolutionary customer behaviors.

IJCAI Conference 2017 Conference Paper

On Subset Selection with General Cost Constraints

  • Chao Qian
  • Jing-Cheng Shi
  • Yang Yu
  • Ke Tang

This paper considers the subset selection problem with a monotone objective function and a monotone cost constraint, which relaxes the submodular property of previous studies. We first show that the approximation ratio of the generalized greedy algorithm is $\frac{\alpha}{2}(1 \textendash \frac{1}{e^{\alpha}})$ (where $\alpha$ is the submodularity ratio); and then propose POMC, an anytime randomized iterative approach that can utilize more time to find better solutions than the generalized greedy algorithm. We show that POMC can obtain the same general approximation guarantee as the generalized greedy algorithm, but can achieve better solutions in cases and applications.

IJCAI Conference 2017 Conference Paper

Open Category Classification by Adversarial Sample Generation

  • Yang Yu
  • Wei-Yang Qu
  • Nan Li
  • Zimin Guo

In real-world classification tasks, it is difficult to collect training samples from all possible categories of the environment. Therefore, when an instance of an unseen class appears in the prediction stage, a robust classifier should be able to tell that it is from an unseen class, instead of classifying it to be any known category. In this paper, adopting the idea of adversarial learning, we propose the ASG framework for open-category classification. ASG generates positive and negative samples of seen categories in the unsupervised manner via an adversarial learning strategy. With the generated samples, ASG then learns to tell seen from unseen in the supervised manner. Experiments performed on several datasets show the effectiveness of ASG.

IJCAI Conference 2017 Conference Paper

Optimizing Ratio of Monotone Set Functions

  • Chao Qian
  • Jing-Cheng Shi
  • Yang Yu
  • Ke Tang
  • Zhi-Hua Zhou

This paper considers the problem of minimizing the ratio of two set functions, i. e. , $f/g$. Previous work assumed monotone and submodular of the two functions, while we consider a more general situation where $g$ is not necessarily submodular. We derive that the greedy approach GreedRatio, as a fixed time algorithm, achieves a $\frac{|X^*|}{(1+(|X^*| \textendash 1)(1 \textendash \kappa_f))\gamma(g)}$ approximation ratio, which also improves the previous bound for submodular $g$. If more time can be spent, we present the PORM algorithm, an anytime randomized iterative approach minimizing $f$ and $\textendash g$ simultaneously. We show that PORM using reasonable time has the same general approximation guarantee as GreedRatio, but can achieve better solutions in cases and applications.

AAAI Conference 2017 Conference Paper

Sequential Classification-Based Optimization for Direct Policy Search

  • Yi-Qi Hu
  • Hong Qian
  • Yang Yu

Direct policy search often results in high-quality policies in complex reinforcement learning problems, which employs some optimization algorithms to search the parameters of the policy for maximizing the its total reward. Classificationbased optimization is a recently developed framework for derivative-free optimization, which has shown to be effective and efficient for non-convex optimization problems with many local optima, and may provide a power optimization tool for direct policy search. However, this framework requires to sample a batch of solutions for every update of the search model, while in reinforcement learning, the environment often offers only sequential policy evaluation. Thus the classification-based optimization may not efficient for direct policy search, where solutions have to be sampled sequentially. In this paper, we adapt the classification-based optimization for sequential sampled solutions by forming the sample batch via reusing historical solutions. Experiments on a helicopter hovering task and controlling tasks in OpenAI Gym show that the new algorithm significantly improve the performance from several state-of-the-art derivative-free optimization approaches.

AAAI Conference 2017 Conference Paper

Solving High-Dimensional Multi-Objective Optimization Problems with Low Effective Dimensions

  • Hong Qian
  • Yang Yu

Multi-objective (MO) optimization problems require simultaneously optimizing two or more objective functions. An MO algorithm needs to find solutions that reach different optimal balances of the objective functions, i. e. , optimal Pareto front, therefore, high dimensionality of the solution space can hurt MO optimization much severer than single-objective optimization, which was little addressed in previous studies. This paper proposes a general, theoretically-grounded yet simple approach ReMO, which can scale current derivativefree MO algorithms to the high-dimensional non-convex MO functions with low effective dimensions, using random embedding. We prove the conditions under which an MO function has a low effective dimension, and for such functions, we prove that ReMO possesses the desirable properties of optimal Pareto front preservation, time complexity reduction, and rotation perturbation invariance. Experimental results indicate that ReMO is effective for optimizing the highdimensional MO functions with low effective dimensions, and is even effective for the high-dimensional MO functions where all dimensions are effective but most only have a small and bounded effect on the function value.

NeurIPS Conference 2017 Conference Paper

Subset Selection under Noise

  • Chao Qian
  • Jing-Cheng Shi
  • Yang Yu
  • Ke Tang
  • Zhi-Hua Zhou

The problem of selecting the best $k$-element subset from a universe is involved in many applications. While previous studies assumed a noise-free environment or a noisy monotone submodular objective function, this paper considers a more realistic and general situation where the evaluation of a subset is a noisy monotone function (not necessarily submodular), with both multiplicative and additive noises. To understand the impact of the noise, we firstly show the approximation ratio of the greedy algorithm and POSS, two powerful algorithms for noise-free subset selection, in the noisy environments. We then propose to incorporate a noise-aware strategy into POSS, resulting in the new PONSS algorithm. We prove that PONSS can achieve a better approximation ratio under some assumption such as i. i. d. noise distribution. The empirical results on influence maximization and sparse regression problems show the superior performance of PONSS.

AAMAS Conference 2016 Conference Paper

Boosting Nonparametric Policies

  • Yang Yu
  • Peng-Fei Hou
  • Qing Da
  • Yu Qian

Learning complex policies is a key step toward real-world applications of reinforcement learning. While boosting approaches have been widely applied in state-of-the-art supervised learning techniques to adaptively learn nonparametric functions, in reinforcement learning the boosting-style approaches have been little investigated. Only a few pieces of previous work explored this direction, however theoretical properties are still unclear and empirical performance is quite limited. In this paper, we propose the PolicyBoost method. It optimizes a finite-sample objective function, which leads to maximization of the expected total reward, by employing the GradientBoost approach. Experimental results verify the effectiveness as well as the robustness of PolicyBoost, even without feature engineering.

AAAI Conference 2016 Conference Paper

Decentralized Robust Subspace Clustering

  • Bo Liu
  • Xiao-Tong Yuan
  • Yang Yu
  • Qingshan Liu
  • Dimitris Metaxas

We consider the problem of subspace clustering using the SSC (Sparse Subspace Clustering) approach, which has several desirable theoretical properties and has been shown to be effective in various computer vision applications. We develop a large scale distributed framework for the computation of SSC via an alternating direction method of multiplier (ADMM) algorithm. The proposed framework solves SSC in column blocks and only involves parallel multivariate Lasso regression subproblems and sample-wise operations. This appealing property allows us to allocate multiple cores/machines for the processing of individual column blocks. We evaluate our algorithm on a shared-memory architecture. Experimental results on real-world datasets confirm that the proposed block-wise ADMM framework is substantially more efficient than its matrix counterpart used by SSC, without sacrificing accuracy. Moreover, our approach is directly applicable to decentralized neighborhood selection for Gaussian graphical models structure estimation.

IJCAI Conference 2016 Conference Paper

Derivative-Free Optimization of High-Dimensional Non-Convex Functions by Sequential Random Embeddings

  • Hong Qian
  • Yi-Qi Hu
  • Yang Yu

Derivative-free optimization methods are suitable for sophisticated optimization problems, while are hard to scale to high dimensionality (e. g. , larger than 1, 000). Previously, the random embedding technique has been shown successful for solving high-dimensional problems with low effective dimensions. However, it is unrealistic to assume a low effective dimension in many applications. This paper turns to study high-dimensional problems with low optimal epsilon-effective dimensions, which allow all dimensions to be effective but many of them only have a small bounded effect. We characterize the properties of random embedding for this kind of problems, and propose the sequential random embeddings (SRE) to reduce the embedding gap while running optimization algorithms in the low-dimensional spaces. We apply SRE to several state-of-the-art derivative-free optimization methods, and conduct experiments on synthetic functions as well as non-convex classification tasks with up to 100, 000 variables. Experiment results verify the effectiveness of SRE.

AAAI Conference 2016 Conference Paper

Derivative-Free Optimization via Classification

  • Yang Yu
  • Hong Qian
  • Yi-Qi Hu

Many randomized heuristic derivative-free optimization methods share a framework that iteratively learns a model for promising search areas and samples solutions from the model. This paper studies a particular setting of such framework, where the model is implemented by a classification model discriminating good solutions from bad ones. This setting allows a general theoretical characterization, where critical factors to the optimization are discovered. We also prove that optimization problems with Local Lipschitz continuity can be solved in polynomial time by proper configurations of this framework. Following the critical factors, we propose the randomized coordinate shrinking classification algorithm to learn the model, forming the RACOS algorithm, for optimization in continuous and discrete domains. Experiments on the testing functions as well as on the machine learning tasks including spectral clustering and classification with Ramp loss demonstrate the effectiveness of RACOS.

AAAI Conference 2016 Conference Paper

MicroScholar: Mining Scholarly Information from Chinese Microblogs

  • Yang Yu
  • Xiaojun Wan

For many researchers, one of the biggest issues is the lack of an efficient method to obtain latest academic progresses in related research fields. We notice that many researchers tend to share their research progresses or recommend scholarly information they have known on their microblogs. In order to exploit microblogging to benefit scientific research, we build a system called MicroScholar to automatically collecting and mining scholarly information from Chinese microblogs. In this paper, we briefly introduce the system framework and focus on the component of scholarly microblog categorization. Several kinds of features have been used in the component and experimental results demonstrate their usefulness.

IJCAI Conference 2016 Conference Paper

Parallel Pareto Optimization for Subset Selection

  • Chao Qian
  • Jing-Cheng Shi
  • Yang Yu
  • Ke Tang
  • Zhi-Hua Zhou

Subset selection that selects a few variables from a large set is a fundamental problem in many areas. The recently emerged Pareto Optimization for Subset Selection (POSS) method is a powerful approximation solver for this problem. However, POSS is not readily parallelizable, restricting its large-scale applications on modern computing architectures. In this paper, we propose PPOSS, a parallel version of POSS. Our theoretical analysis shows that PPOSS has good properties for parallelization while preserving the approximation quality: when the number of processors is limited (less than the total number of variables), the running time of PPOSS can be reduced almost linearly with respect to the number of processors; with increasing number of processors, the running time can be further reduced, eventually to a constant. Empirical studies verify the effectiveness of PPOSS, and moreover suggest that the asynchronous implementation is more efficient with little quality loss.

AAAI Conference 2016 Conference Paper

Scaling Simultaneous Optimistic Optimization for High-Dimensional Non-Convex Functions with Low Effective Dimensions

  • Hong Qian
  • Yang Yu

Simultaneous optimistic optimization (SOO) is a recently proposed global optimization method with a strong theoretical foundation. Previous studies have shown that SOO has a good performance in lowdimensional optimization problems, however, its performance is unsatisfactory when the dimensionality is high. This paper adapts random embedding to scaling SOO, resulting in the RESOO algorithm. We prove that the simple regret of RESOO depends only on the effective dimension of the problem, while that of SOO depends on the dimension of the solution space. Empirically, on some high-dimensional non-convex testing functions as well as hyper-parameter tuning tasks for multi-class support vector machines, RESOO shows significantly improved performance from SOO.

IJCAI Conference 2015 Conference Paper

On Constrained Boolean Pareto Optimization

  • Chao Qian
  • Yang Yu
  • Zhi-Hua Zhou

Pareto optimization solves a constrained optimization task by reformulating the task as a bi-objective problem. Pareto optimization has been shown quite effective in applications; however, it has little theoretical support. This work theoretically compares Pareto optimization with a penalty approach, which is a common method transforming a constrained optimization into an unconstrained optimization. We prove that on two large classes of constrained Boolean optimization problems, minimum matroid optimization (P-solvable) and minimum cost coverage (NP-hard), Pareto optimization is more efficient than the penalty function method for obtaining the optimal and approximate solutions, respectively. Furthermore, on a minimum cost coverage instance, we also show the advantage of Pareto optimization over a greedy algorithm.

AAAI Conference 2015 Conference Paper

Pareto Ensemble Pruning

  • Chao Qian
  • Yang Yu
  • Zhi-Hua Zhou

Ensemble learning is among the state-of-the-art learning techniques, which trains and combines many base learners. Ensemble pruning removes some of the base learners of an ensemble, and has been shown to be able to further improve the generalization performance. However, the two goals of ensemble pruning, i. e. , maximizing the generalization performance and minimizing the number of base learners, can conflict when being pushed to the limit. Most previous ensemble pruning approaches solve objectives that mix the two goals. In this paper, motivated by the recent theoretical advance of evolutionary optimization, we investigate solving the two goals explicitly in a bi-objective formulation and propose the PEP (Pareto Ensemble Pruning) approach. We disclose that PEP does not only achieve significantly better performance than the state-of-the-art approaches, and also gains theoretical support.

NeurIPS Conference 2015 Conference Paper

Subset Selection by Pareto Optimization

  • Chao Qian
  • Yang Yu
  • Zhi-Hua Zhou

Selecting the optimal subset from a large set of variables is a fundamental problem in various learning tasks such as feature selection, sparse regression, dictionary learning, etc. In this paper, we propose the POSS approach which employs evolutionary Pareto optimization to find a small-sized subset with good performance. We prove that for sparse regression, POSS is able to achieve the best-so-far theoretically guaranteed approximation performance efficiently. Particularly, for the \emph{Exponential Decay} subclass, POSS is proven to achieve an optimal solution. Empirical study verifies the theoretical results, and exhibits the superior performance of POSS to greedy and convex relaxation methods.

AAAI Conference 2014 Conference Paper

Learning with Augmented Class by Exploiting Unlabeled Data

  • Qing Da
  • Yang Yu
  • Zhi-Hua Zhou

In many real-world applications of learning, the environment is open and changes gradually, which requires the learning system to have the ability of detecting and adapting to the changes. Class-incremental learning (C- IL) is an important and practical problem where data from unseen augmented classes are fed, but has not been studied well in the past. In C-IL, the system should beware of predicting instances from augmented classes as a seen class, and thus faces the challenge that no such instances were observed during training stage. In this paper, we tackle the challenge by using unlabeled data, which can be cheaply collected in many real-world applications. We propose the LACU framework as well as the LACU-SVM approach to learn the concept of seen classes while incorporating the structure presented in the unlabeled data, so that the misclassification risks among the seen classes as well as between the augmented and the seen classes are minimized simultaneously. Experiments on diverse datasets show the effectiveness of the proposed approach.

IJCAI Conference 2013 Conference Paper

On the Approximation Ability of Evolutionary Optimization with Application to Minimum Set Cover: Extended Abstract

  • Yang Yu
  • Xin Yao
  • Zhi-Hua Zhou

Evolutionary algorithms (EAs) are a large family of heuristic optimization algorithms inspired by natural phenomena, and are often used in practice to obtain satisficing instead of optimal solutions. In this work, we investigate a largely underexplored issue: the approximation performance of EAs in terms of how close the obtained solution is to an optimal solution. We study an EA framework named simple EA with isolated population (SEIP) that can be implemented as a single- or multi-objective EA. We present general approximation results of SEIP, and specifically on the minimum set cover problem, we find that SEIP achieves the currently bestachievable approximation ratio. Moreover, on an instance class of the k-set cover problem, we disclose how SEIP can overcome the difficulty that limits the greedy algorithm.

IJCAI Conference 2011 Conference Paper

Diversity Regularized Machine

  • Yang Yu
  • Yu-Feng Li
  • Zhi-Hua Zhou

Ensemble methods, which train multiple learners for a task, are among the state-of-the-art learning approaches. The diversity of the component learners has been recognized as a key to a good ensemble, and existing ensemble methods try different ways to encourage diversity, mostly by heuristics. In this paper, we propose the diversity regularized machine (DRM) in a mathematical programming framework, which efficiently generates an ensemble of diverse support vector machines (SVMs). Theoretical analysis discloses that the diversity constraint used in DRM can lead to an effective reduction on its hypothesis space complexity, implying that the diversity control in ensemble methods indeed plays a role of regularization as in popular statistical learning approaches. Experiments show that DRM can significantly improve generalization ability and is superior to some state-of-the-art SVM ensemble methods.

AAAI Conference 2006 Conference Paper

A New Approach to Estimating the Expected First Hitting Time of Evolutionary Algorithms

  • Yang Yu

The expected first hitting time is an important issue in theoretical analyses of evolutionary algorithms since it implies the average computational time complexity. In this paper, by exploiting the relationship between the convergence rate and the expected first hitting time, a new approach to estimating the expected first hitting time is proposed. This approach is then applied to four evolutionary algorithms which involve operators of mutation, mutation with population, mutation with recombination, and time-variant mutation, respectively. The results show that the proposed approach is helpful for analyzing a broad range of evolutionary algorithms.