Author name cluster

Bo An

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

162 papers

1 author row

TMLR Journal 2026 Journal Article

A Tighter Bound for Reward Learning in Reinforcement Learning from Human Feedback

Guoxi Chen
Xing Chen
Bo An
Ya Zhang

As a key component of reinforcement learning from human feedback (RLHF), reward learning directly influences the final learned policy. Unfortunately, existing theoretical estimation error bounds in reward learning rely on the complexity of the reward function class, unattainable optimal parameters, or non-zero constants independent of sample size, leading to uncomputable bounds that are meaningless for reward function classes with unknown complexity. To address this issue, this paper presents an analysis of parameter estimation for reward learning in RLHF under general function approximation, without imposing restrictions on the complexity of the reward function class. A tighter bound is provided without non-zero terms independent of the sample size. The optimal parameters are eliminated by applying linear approximation around the learned parameters. Additionally, the relationship between the preference dataset and the learned parameters is further examined to demonstrate how to efficiently collect data based on the current learned parameters. Inspired by the theoretical results, a novel offline RLHF algorithm with parameter constraints is proposed, restricting parameters to the valid space defined by the dataset. Furthermore, an online RLHF algorithm is proposed to iteratively optimize parameter learning and improve data collection efficiency. This work provides a tighter bound than previous studies and offers theoretical guidance for online data collection under general function approximation.

AAAI Conference 2026 Conference Paper

ArchetypeTrader: Reinforcement Learning for Selecting and Refining Learnable Strategic Archetypes in Quantitative Trading

Chuqiao Zong
Molei Qin
Haochong Xia
Bo An

Quantitative trading using mathematical models and automated execution to generate trading decisions has been widely applied acorss financial markets. Recently, reinforcement learning (RL) has emerged as a promising approach for developing profitable trading strategies, especially in highly volatile markets like cryptocurrency. However, existing RL methods for cryptocurrency trading face two critical drawbacks: 1) Prior RL algorithms segment markets using handcrafted indicators (e.g., trend or volatility) to train specialized sub-policies. However, these coarse labels oversimplify market dynamics into rigid categories, biasing policies toward obvious patterns like trend-following and neglecting nuanced but lucrative opportunities. 2) Current RL methods fail to systematically use demonstration data. While some approaches ignore demonstrations altogether, others rely on “optimal” yet overly granular trajectories or human-crafted strategies, both of which can overwhelm learning and introduce significant bias, resulting in high variance and significant profit losses. To address these problems, we propose ArchetypeTrader, a novel reinforcement learning framework that automatically selects and refines data-driven trading archetypes distilled from demonstrations. The framework operates in three phases: 1) We use dynamic programming (DP) to generate representative expert trajectories and train a vector-quantized encoder-decoder architecture to distill these demonstrations into discrete, reusable strategic archetypes through self-supervised learning, capturing nuanced market-behavior patterns without human heuristics. 2) We then train an RL agent to select contextually appropriate archetypes from the learned codebook and reconstruct action sequences for the upcoming horizons, effectively performing demonstration-guided strategy reuse. 3) We finally train a policy adapter that leverages hindsight-informed rewards to dynamically refine the archetype actions based on real-time market observations and performance, enabling more fine-grained decision-making and yielding profitable and robust trading strategies. Extensive experiments on four popular cryptocurrency trading pairs demonstrate that ArchetypeTrader significantly outperforms state-of-the-art approaches in both profit generation and risk management.

PDF Details DOI

IS Journal 2026 Journal Article

From Symbols to Synapses: The Reemergence of Agency in the Large Language Model Era

Bo An

Building intelligent agents capable for autonomously perceiving, reasoning, and acting to achieve goals has been a central pursuit of artificial intelligence (AI) since its inception. For decades, the notion of agency was dominated by symbolic architectures that represent information as formal knowledge and use deliberative reasoning to derive rational actions, offering reliability but at the cost of brittleness and limited generalizability. The recent advancements in large language models (LLMs) and their integration into tool-using, environment-interacting “agentic” systems have reignited interest in AI agents. However, while LLM-based agents provide the flexibility that symbolic systems lacked, they introduce new challenges in reliability and control. We posit that the future of AI agents lies not in indefinitely scaling the model size, but in synthesizing the methods and theories developed by the autonomous agents and multiagent systems community with modern neural architectures to create neuro-symbolic agents capable of trustworthy autonomy.

AAAI Conference 2026 Conference Paper

GDBA Revisited: Unleashing the Power of Guided Local Search for Distributed Constraint Optimization

Yanchen Deng
Xinrun Wang
Bo An

Local search is an important class of incomplete algorithms for solving Distributed Constraint Optimization Problems (DCOPs) but it often converges to poor local optima. While Generalized Distributed Breakout Algorithm (GDBA) provides a comprehensive rule set to escape premature convergence, its empirical benefits remain marginal on general-valued problems. In this work, we systematically examine GDBA and identify three factors that potentially lead to its inferior performance, i.e., over-aggressive constraint violation conditions, unbounded penalty accumulation, and uncoordinated penalty updates. To address these issues, we propose Distributed Guided Local Search (DGLS), a novel GLS framework for DCOPs that incorporates an adaptive violation condition to selectively penalize constraints with high cost, a penalty evaporation mechanism to control the magnitude of penalization, and a synchronization scheme for coordinated penalty updates. We theoretically show that the penalty values are bounded, and agents play a potential game in DGLS. Extensive empirical results on various benchmarks demonstrate the great superiority of DGLS over state-of-the-art baselines. Compared to Damped Max-sum with high damping factors, our DGLS achieves competitive performance on general-valued problems, and outperforms by significant margins on structured problems in terms of anytime results.

PDF Details DOI

IS Journal 2026 Journal Article

Graph-Augmented Large Language Model Agents: Current Progress and Future Prospects

Yixin Liu
Guibin Zhang
Kun Wang
Shiyuan Li
Shirui Pan
Bo An

Autonomous agents based on large language models (LLMs) have demonstrated impressive capabilities in numerous real-world applications. While most LLMs are limited in several key agentic procedures, graphs can serve as a powerful auxiliary structure to enhance structure, continuity, and coordination in complex agent workflows. Given the rapid growth and fragmentation of research on Graph-augmented LLM Agents (GLA), this article offers a timely and comprehensive overview of recent advances and highlights key directions for future work. Specifically, we categorize existing GLA methods by their primary functions in LLM agent systems, including planning, memory, and tool usage, and then analyze how graphs and graph learning algorithms contribute to each. For multiagent systems, we further discuss how GLA solutions facilitate the orchestration, efficiency optimization, and trustworthiness of MAS. Finally, we highlight key future directions to advance this field, from improving structural adaptability to enabling unified, scalable, and multimodal GLA systems.

TMLR Journal 2026 Journal Article

MEMO: Memory-Guided Diffusion for Expressive Talking Video Generation

Longtao Zheng
Yifan Zhang
Hanzhong Guo
Jiachun Pan
Zhenxiong Tan
Jiahao Lu
Chuanxin Tang
Bo An

Recent advances in video diffusion models have unlocked new potential for realistic audio-driven talking video generation. However, maintaining long-term identity consistency, achieving seamless lip-audio synchronization, and producing natural, audio-aligned expressions in generated talking videos remain significant challenges. To address these challenges, we propose Memory-guided EMOtion-aware diffusion (MEMO), an end-to-end audio-driven portrait animation approach to generate identity-consistent and expressive talking videos. Our approach is built around two key modules: (1) a memory-guided temporal module, which enhances long-term identity consistency and motion smoothness by developing causal motion memory to store information from an extended past context to guide temporal modeling; and (2) an emotion-aware audio module, which replaces traditional cross attention with multi-modal attention to enhance audio-video interaction, while detecting emotions from audio to refine facial expressions via emotion-adaptive layer norm. Extensive quantitative and qualitative results demonstrate that MEMO generates more realistic talking videos across diverse image and audio types, outperforming state-of-the-art methods in overall quality, lip-audio synchronization, identity consistency, and expression-audio alignment. Our model and video demos are available at https://memoavatar.github.io.

IS Journal 2026 Journal Article

Top 10 Most Influential Papers in Artificial Intelligence Since 2000

Bo An
Sarit Kraus
Michael Wooldridge

To commemorate the 70th anniversary of artificial intelligence (AI), IEEE Intelligent Systems has identified the 10 most impactful articles in AI published since 2000, marking the field’s evolution from its 1956 origins into a foundational pillar of modern science. The selection process combined rigorous quantitative indicators with the informed judgment of a panel of distinguished experts. The resulting list reflects a broad consensus on the milestones that have redefined AI, spanning the diverse and multifaceted landscape of the discipline. Together, these landmark articles laid the foundations for modern AI and continue to influence its evolution.

NeurIPS Conference 2025 Conference Paper

Deciphering the Extremes: A Novel Approach for Pathological Long-tailed Recognition in Scientific Discovery

Zhe Zhao
Haibin Wen
Xianfu Liu
Rui Mao
Pengkun Wang
Liheng Yu
Linjiang Chen
Bo An

Scientific discovery across diverse fields increasingly grapples with datasets exhibiting pathological long-tailed distributions: a few common phenomena overshadow a multitude of rare yet scientifically critical instances. Unlike standard benchmarks, these scientific datasets often feature extreme imbalance coupled with a modest number of classes and limited overall sample volume, rendering existing long-tailed recognition (LTR) techniques ineffective. Such methods, biased by majority classes or prone to overfitting on scarce tail data, frequently fail to identify the very instances—novel materials, rare disease biomarkers, faint astronomical signals—that drive scientific breakthroughs. This paper introduces a novel, end-to-end framework explicitly designed to address pathological long-tailed recognition in scientific contexts. Our approach synergizes a Balanced Supervised Contrastive Learning (B-SCL) mechanism, which enhances the representation of tail classes by dynamically re-weighting their contributions, with a Smooth Objective Regularization (SOR) strategy that manages the inherent tension between tail-class focus and overall classification performance. We introduce and analyze the real-world ZincFluor chemical dataset ($\mathcal{T}=137. 54$) and synthetic benchmarks with controllable extreme imbalances (CIFAR-LT variants). Extensive evaluations demonstrate our method's superior ability to decipher these extremes. Notably, on ZincFluor, our approach achieves a Tail Top-2 accuracy of $66. 84\%$, significantly outperforming existing techniques. On CIFAR-10-LT with an imbalance ratio of $1000$ ($\mathcal{T}=100$), our method achieves a tail-class accuracy of $38. 99\%$, substantially leading the next best. These results underscore our framework's potential to unlock novel insights from complex, imbalanced scientific datasets, thereby accelerating discovery.

NeurIPS Conference 2025 Conference Paper

EconGym: A Scalable AI Testbed with Diverse Economic Tasks

Qirui Mi
Qipeng Yang
Zijun Fan
Wentian Fan
Heyang Ma
Chengdong Ma
Siyu Xia
Bo An

Artificial intelligence (AI) has become a powerful tool for economic research, enabling large-scale simulation and policy optimization. However, applying AI effectively requires simulation platforms for scalable training and evaluation—yet existing environments remain limited to simplified, narrowly scoped tasks, falling short of capturing complex economic challenges such as demographic shifts, multi-government coordination, and large-scale agent interactions. To address this gap, we introduce EconGym, a scalable and modular testbed that connects diverse economic tasks with AI algorithms. Grounded in rigorous economic modeling, EconGym implements 11 heterogeneous role types (e. g. , households, firms, banks, governments), their interaction mechanisms, and agent models with well-defined observations, actions, and rewards. Users can flexibly compose economic roles with diverse agent algorithms to simulate rich multi-agent trajectories across 25+ economic tasks for AI-driven policy learning and analysis. Experiments show that EconGym supports diverse and cross-domain tasks—such as coordinating fiscal, pension, and monetary policies—and enables benchmarking across AI, economic methods, and hybrids. Results indicate that richer task composition and algorithm diversity expand the policy space, while AI agents guided by classical economic methods perform best in complex settings. EconGym also scales to 100k agents with high realism and efficiency.

NeurIPS Conference 2025 Conference Paper

Efficient Last-Iterate Convergence in Solving Extensive-Form Games

Linjian Meng
Tianpei Yang
Youzhi Zhang
Zhenxing Ge
Shangdong Yang
Tianyu Ding
Wenbin Li
Bo An

To establish last-iterate convergence for Counterfactual Regret Minimization (CFR) algorithms in learning a Nash equilibrium (NE) of extensive-form games (EFGs), recent studies reformulate learning an NE of the original EFG as learning the NEs of a sequence of (perturbed) regularized EFGs. Hence, proving last-iterate convergence in solving the original EFG reduces to proving last-iterate convergence in solving (perturbed) regularized EFGs. However, these studies only establish last-iterate convergence for Online Mirror Descent (OMD)-based CFR algorithms instead of Regret Matching (RM)-based CFR algorithms in solving perturbed regularized EFGs, resulting in a poor empirical convergence rate, as RM-based CFR algorithms typically outperform OMD-based CFR algorithms. In addition, as solving multiple perturbed regularized EFGs is required, fine-tuning across multiple perturbed regularized EFGs is infeasible, making parameter-free algorithms highly desirable. This paper show that CFR$^+$, a classical parameter-free RM-based CFR algorithm, achieves last-iterate convergence in learning an NE of perturbed regularized EFGs. This is the first parameter-free last-iterate convergence for RM-based CFR algorithms in perturbed regularized EFGs. Leveraging CFR$^+$ to solve perturbed regularized EFGs, we get Reward Transformation CFR$^+$ (RTCFR$^+$). Importantly, we extend prior work on the parameter-free property of CFR$^+$, enhancing its stability, which is vital for the empirical convergence of RTCFR$^+$. Experiments show that RTCFR$^+$ exhibits a significantly faster empirical convergence rate than existing algorithms that achieve theoretical last-iterate convergence. Interestingly, RTCFR$^+$ show performance no worse than average-iterate convergence CFR algorithms. It is the first last-iterate convergence algorithm to achieve such performance. Our code is available at https: //github. com/menglinjian/NeurIPS-2025-RTCFR.

NeurIPS Conference 2025 Conference Paper

Empirical Study on Robustness and Resilience in Cooperative Multi-Agent Reinforcement Learning

Simin Li
Zihao Mao
Hanxiao Li
Zonglei Jing
Zhuohang bian
Jun Guo
Li Wang
Zhuoran Han

In cooperative Multi-Agent Reinforcement Learning (MARL), it is a common practice to tune hyperparameters in ideal simulated environments to maximize cooperative performance. However, policies tuned for cooperation often fail to maintain robustness and resilience under real-world uncertainties. Building trustworthy MARL systems requires a deep understanding of \emph{robustness}, which ensures stability under uncertainties, and \emph{resilience}, the ability to recover from disruptions—a concept extensively studied in control systems but largely overlooked in MARL. In this paper, we present a large-scale empirical study comprising over 82, 620 experiments to evaluate cooperation, robustness, and resilience in MARL across 4 real-world environments, 13 uncertainty types, and 15 hyperparameters. Our key findings are: (1) Under mild uncertainty, optimizing cooperation improves robustness and resilience, but this link weakens as perturbations intensify. Robustness and resilience also varies by algorithm and uncertainty type. (2) Robustness and resilience do not generalize across uncertainty modalities or agent scopes: policies robust to action noise for all agents may fail under observation noise on a single agent. (3) Hyperparameter tuning is critical for trustworthy MARL: surprisingly, standard practices like parameter sharing, GAE, and PopArt can hurt robustness, while early stopping, high critic learning rates, and Leaky ReLU consistently help. By optimizing hyperparameters only, we observe substantial improvement in cooperation, robustness and resilience across all MARL backbones, with the phenomenon also generalizing to robust MARL methods across these backbones.

AAMAS Conference 2025 Conference Paper

Enhancing Sub-Optimal Trajectory Stitching: Spatial Composition RvS for Offline RL

Sheng Zang
Zhiguang Cao
Bo An
Senthilnath Jayavelu
Xiaoli Li

Reinforcement learning via supervised learning (RvS) has been known as a burgeoning paradigm for offline reinforcement learning (RL). While return-conditioned RvS (RvS-R) predominates across a wide range of datasets pertaining to the offline RL tasks, recent findings suggest that goal-conditioned RvS (RvS-G) outperforms in specific sub-optimal datasets where trajectory stitching is crucial for achieving optimal performance. However, the underlying reasons for this superiority remain insufficiently explored. In this paper, employing didactic experiments and theoretical analysis, we reveal that the proficiency of RvS-G in stitching trajectories arises from its adeptness in generalizing to unknown goals during evaluation. Building on this insight, we introduce a novel RvS-G approach, Spatial Composition RvS (SC-RvS), to enhance its ability to generalize to unknown goals. This, in turn, augments the trajectory stitching performance on sub-optimal datasets. Specifically, by harnessing the power of advantage weight and maximum-entropy regularized weight, our approach adeptly balances the promotion of optimistic goal sampling with the preservation of a nuanced level of pessimism in action selection compared to existing RvS- G methods. Extensive experimental results on D4RL benchmarks show that our SC-RvS performed favorably against the baselines in most cases, especially on the sub-optimal datasets that demand trajectory stitching.

NeurIPS Conference 2025 Conference Paper

Establishing Linear Surrogate Regret Bounds for Convex Smooth Losses via Convolutional Fenchel–Young Losses

Yuzhou Cao
Han Bao
Lei Feng
Bo An

Surrogate regret bounds, also known as excess risk bounds, bridge the gap between the convergence rates of surrogate and target losses. The regret transfer is lossless if the surrogate regret bound is linear. While convex smooth surrogate losses are appealing in particular due to the efficient estimation and optimization, the existence of a trade-off between the loss smoothness and linear regret bound has been believed in the community. Under this scenario, the better optimization and estimation properties of convex smooth surrogate losses may inevitably deteriorate after undergoing the regret transfer onto a target loss. We overcome this dilemma for arbitrary discrete target losses by constructing a convex smooth surrogate loss, which entails a linear surrogate regret bound composed with a tailored prediction link. The construction is based on Fenchel--Young losses generated by the convolutional negentropy, which are equivalent to the infimal convolution of a generalized negentropy and the target Bayes risk. Consequently, the infimal convolution enables us to derive a smooth loss while maintaining the surrogate regret bound linear. We additionally benefit from the infimal convolution to have a consistent estimator of the underlying class probability. Our results are overall a novel demonstration of how convex analysis penetrates into optimization and statistical efficiency in risk minimization.

NeurIPS Conference 2025 Conference Paper

Group-in-Group Policy Optimization for LLM Agent Training

Lang Feng
Zhenghai Xue
Tingcong Liu
Bo An

Recent advances in group-based reinforcement learning (RL) have driven frontier large language models (LLMs) in single-turn tasks like mathematical reasoning. However, their scalability to multi-turn LLM agent training remains limited. Unlike static tasks, agent-environment interactions unfold over many steps and often yield sparse or delayed rewards, making credit assignment across individual steps significantly more challenging. In this work, we propose Group-in-Group Policy Optimization (GiGPO), a novel RL algorithm that achieves fine-grained credit assignment for LLM agents while preserving the appealing properties of group-based RL: critic-free, low memory, and stable convergence. GiGPO introduces a two-level structure for estimating relative advantage: (i) At the episode-level, GiGPO computes macro relative advantages based on groups of complete trajectories; (ii) At the step-level, GiGPO introduces an anchor state grouping mechanism that retroactively constructs step-level groups by identifying repeated environment states across trajectories. Actions stemming from the same state are grouped together, enabling micro relative advantage estimation. This hierarchical structure effectively captures both global trajectory quality and local step effectiveness without relying on auxiliary models or additional rollouts. We evaluate GiGPO on challenging agent benchmarks, including ALFWorld and WebShop, as well as tool-integrated reasoning on search-augmented QA tasks, using Qwen2. 5-1. 5B/3B/7B-Instruct. Crucially, GiGPO delivers fine-grained per-step credit signals, achieves performance gains of > 12\% on ALFWorld and > 9\% on WebShop over GRPO, and obtains superior performance on QA tasks (42. 1\% on 3B and 47. 2\% on 7B): all while maintaining the same GPU memory overhead, identical LLM rollout, and incurring little to no additional time cost.

NeurIPS Conference 2025 Conference Paper

Improving Reward Models with Proximal Policy Exploration for Preference-Based Reinforcement Learning

Yiwen Zhu
Jinyi Liu
Pengjie Gu
Yifu Yuan
Zhenxing Ge
Wenya Wei
Zhou Fang
Yujing Hu

Reinforcement learning (RL) heavily depends on well-designed reward functions, which are often biased and difficult to design for complex behaviors. Preference-based RL (PbRL) addresses this by learning reward models from human feedback, but its practicality is constrained by a critical dilemma: while existing methods reduce human effort through query optimization, they neglect the preference buffer's restricted coverage — a factor that fundamentally determines the reliability of reward model. We systematically demonstrate this limitation creates distributional mismatch: reward models trained on static buffers reliably assess in-distribution trajectories but falter with out-of-distribution (OOD) trajectories from policy exploration. Crucially, such failures in policy-proximal regions directly misguide iterative policy updates. To address this, we propose Proximal Policy Exploration (PPE) with two key components: (1) a proximal-policy extension method that expands exploration in undersampled policy-proximal regions, and (2) a mixture distribution query method that balances in-distribution and OOD trajectory sampling. By enhancing buffer coverage while preserving evaluation accuracy in policy-proximal regions, PPE enables more reliable policy updates. Experiments across continuous control tasks demonstrate that PPE enhances preference feedback utilization efficiency and RL sample efficiency over baselines, highlighting preference buffer coverage management's vital role in PbRL.

NeurIPS Conference 2025 Conference Paper

Incentivizing LLMs to Self-Verify Their Answers

Fuxiang Zhang
Jiacheng Xu
Chaojie Wang
Ce Cui
Yang Liu
Bo An

Large Language Models (LLMs) have demonstrated remarkable progress in complex reasoning tasks through both post-training and test-time scaling laws. While prevalent test-time scaling approaches are often realized by using external reward models to guide the model generation process, we find that only marginal gains can be acquired when scaling a model post-trained on specific reasoning tasks. We identify that the limited improvement stems from distribution discrepancies between the specific post-trained generator and the general reward model. To address this, we propose a framework that incentivizes LLMs to self-verify their own answers. By unifying answer generation and verification within a single reinforcement learning (RL) process, we train models that can effectively assess the correctness of their own solutions. The trained model can further scale its performance at inference time by verifying its generations, without the need for external verifiers. We train our self-verification models based on Qwen2. 5-Math-7B and DeepSeek-R1-Distill-Qwen-1. 5B, demonstrating their capabilities across varying reasoning context lengths. Experiments on multiple mathematical reasoning benchmarks show that our models can not only improve post-training performance but also enable effective test-time scaling. Our code is available at https: //github. com/mansicer/self-verification.

AAAI Conference 2025 Conference Paper

Influence-Based Fair Selection for Sample-Discriminative Backdoor Attack

Qi Wei
Shuo He
Jiahan Zhang
Lei Feng
Bo An

Backdoor attacks have posed a serious threat in machine learning models, wherein adversaries can poison training samples with maliciously crafted triggers to compromise the victim model. Advanced backdoor attack methods have focused on selectively poisoning more vulnerable training samples, achieving a higher attack success rate (ASR). However, we found that when the manipulation strength of the trigger is constrained to a very small value for imperceptible attacks, they suffer from extremely uneven class-wise ASR due to the unequal selection of instances per class. To solve this issue, we propose a novel backdoor attack method based on Influence-based Fair Selection (IFS), including two objectives: 1) selecting samples that significantly contribute to ASR and 2) ensuring class balance during the selection process. Specifically, we adapt Influence Functions, a classic technique in robust statistics, to evaluate the influence of trigger-embedded training samples on ASR. In this case, training samples contributing to reducing the backdoored test risk could possess higher influence scores. Further, a group-based pruning strategy is designed to avoid calculating the influence on ASR for all training samples, thereby significantly reducing the computational cost. Then, based on the influence score, we design an adaptive thresholding scheme to dynamically select samples with higher influence while maintaining class balance. Extensive experiments on four datasets verify the effectiveness of IFS compared with advanced methods.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Let's Revise Step-by-Step: A Unified Local Search Framework for Code Generation with LLMs

Zhiyi Lyu
Jianguo Huang
Yanchen Deng
Steven Hoi
Bo An

Large Language Models (LLMs) with inference-time scaling techniques show promise for code generation, yet face notable efficiency and scalability challenges. Construction-based tree-search methods suffer from rapid growth in tree size, high token consumption, and lack of anytime property. In contrast, improvement-based methods offer better performance but often struggle with uninformative reward signals and inefficient search strategies. In this work, we propose $\textbf{ReLoc}$, a unified local search framework which effectively performs step-by-step code revision. Specifically, ReLoc explores a series of local revisions through four key algorithmic components: initial code drafting, neighborhood code generation, candidate evaluation, and incumbent code updating, each of which can be instantiated with specific decision rules to realize different local search algorithms such as Hill Climbing (HC) or Genetic Algorithm (GA). Furthermore, we develop a specialized revision reward model that evaluates code quality based on revision distance to produce fine-grained preferences that guide the local search toward more promising candidates. Finally, our extensive experimental results demonstrate that our approach achieves superior performance across diverse code generation tasks, significantly outperforming both construction-based tree search as well as the state-of-the-art improvement-based code generation methods.

TMLR Journal 2025 Journal Article

Leveraging Gradients for Unsupervised Accuracy Estimation under Distribution Shift

Renchunzi Xie
Ambroise Odonnat
Vasilii Feofanov
Ievgen Redko
Jianfeng Zhang
Bo An

Estimating the test performance of a model, possibly under distribution shift, without having access to the ground-truth labels is a challenging, yet very important problem for the safe deployment of machine learning algorithms in the wild. Existing works mostly rely on information from either the outputs or the extracted features of neural networks to estimate a score that correlates with the ground-truth test accuracy. In this paper, we investigate -- both empirically and theoretically -- how the information provided by the gradients can be predictive of the ground-truth test accuracy even under distribution shifts. More specifically, we use the norm of classification-layer gradients, backpropagated from the cross-entropy loss after only one gradient step over test data. Our intuition is that these gradients should be of higher magnitude when the model generalizes poorly. We provide the theoretical insights behind our approach and the key ingredients that ensure its empirical success. Extensive experiments conducted with various architectures on diverse distribution shifts demonstrate that our method significantly outperforms current state-of-the-art approaches. The code is available at \url{https://github.com/Renchunzi-Xie/GdScore}.

NeurIPS Conference 2025 Conference Paper

MF-LLM: Simulating Population Decision Dynamics via a Mean-Field Large Language Model Framework

Qirui Mi
Mengyue Yang
Xiangning Yu
Zhiyu Zhao
Cheng Deng
Bo An
Haifeng Zhang
Xu Chen

Simulating collective decision-making involves more than aggregating individual behaviors; it emerges from dynamic interactions among individuals. While large language models (LLMs) offer strong potential for social simulation, achieving quantitative alignment with real-world data remains a key challenge. To bridge this gap, we propose the \textbf{M}ean-\textbf{F}ield \textbf{LLM} (\textbf{MF-LLM}) framework, the first to incorporate mean field theory into LLM-based social simulation. MF-LLM models bidirectional interactions between individuals and the population through an iterative process, generating population signals to guide individual decisions, which in turn update the signals. This interplay produces coherent trajectories of collective behavior. To improve alignment with real-world data, we introduce \textbf{IB-Tune}, a novel fine-tuning method inspired by the \textbf{I}nformation \textbf{B}ottleneck principle, which retains population signals most predictive of future actions while filtering redundant history. Evaluated on a real-world social dataset, MF-LLM reduces KL divergence to human population distributions by \textbf{47\%} compared to non-mean-field baselines, enabling accurate trend forecasting and effective intervention planning. Generalizing across 7 domains and 4 LLM backbones, MF-LLM provides a scalable, high-fidelity foundation for social simulation.

NeurIPS Conference 2025 Conference Paper

OPHR: Mastering Volatility Trading with Multi-Agent Deep Reinforcement Learning

Zeting Chen
Xinyu Cai
Molei Qin
Bo An

Options markets represent one of the most sophisticated segments of the financial ecosystem, with prices that directly reflect market uncertainty. In this paper, we introduce the first reinforcement learning (RL) framework specifically designed for volatility trading through options, focusing on profit from the difference between implied volatility and realized volatility. Our multi-agent architecture consists of an Option Position Agent (OP-Agent) responsible for volatility timing by controlling long/short volatility positions, and a Hedger Routing Agent (HR-Agent) that manages risk and maximizes path-dependent profits by selecting optimal hedging strategies with different risk preferences. Evaluating our approach using cryptocurrency options data from 2021-2024, we demonstrate superior performance on BTC and ETH, significantly outperforming traditional strategies and machine learning baselines across all profit and risk-adjusted metrics while exhibiting sophisticated trading behavior. The code framework and sample data of this paper have been released on https: //github. com/Edwicn/OPHR-MasteringVolatilityTradingwithMultiAgentDeepReinforcementLearning

IS Journal 2025 Journal Article

Pioneering the Future: Advancing Innovation and Collaboration

Bo An

AAAI Conference 2024 Conference Paper

EarnHFT: Efficient Hierarchical Reinforcement Learning for High Frequency Trading

Molei Qin
Shuo Sun
Wentao Zhang
Haochong Xia
Xinrun Wang
Bo An

High-frequency trading (HFT) is using computer algorithms to make trading decisions in short time scales (e.g., second-level), which is widely used in the Cryptocurrency (Crypto) market, (e.g., Bitcoin). Reinforcement learning (RL) in financial research has shown stellar performance on many quantitative trading tasks. However, most methods focus on low-frequency trading, e.g., day-level, which cannot be directly applied to HFT because of two challenges. First, RL for HFT involves dealing with extremely long trajectories (e.g., 2.4 million steps per month), which is hard to optimize and evaluate. Second, the dramatic price fluctuations and market trend changes of Crypto make existing algorithms fail to maintain satisfactory performances. To tackle these challenges, we propose an Efficient hieArchical Reinforcement learNing method for High Frequency Trading (EarnHFT), a novel three-stage hierarchical RL framework for HFT. In stage I, we compute a Q-teacher, i.e., the optimal action value based on dynamic programming, for enhancing the performance and training efficiency of second level RL agents. In stage II, we construct a pool of diverse RL agents for different market trends, distinguished by return rates, where hundreds of RL agents are trained with different preferences of return rates and only a tiny fraction of them will be selected into the pool based on their profitability. In stage III, we train a minute-level router which dynamically picks a second-level agent from the pool to achieve stable performance across different markets. Through extensive experiments in various market trends on Crypto markets in a high-fidelity simulation trading environment, we demonstrate that EarnHFT significantly outperforms 6 state-of-art baselines in 6 popular financial criteria, exceeding the runner-up by 30% in profitability.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

Grasper: A Generalist Pursuer for Pursuit-Evasion Problems

Pengdeng Li
Shuxin Li
Xinrun Wang
Jakub Černý
Youzhi Zhang
Stephen McAleer
Hau Chan
Bo An

Pursuit-evasion games (PEGs) model interactions between a team of pursuers and an evader in graph-based environments such as urban street networks. Recent advancements have demonstrated the effectiveness of the pre-training and fine-tuning paradigm in Policy-Space Response Oracles (PSRO) to improve scalability in solving large-scale PEGs. However, these methods primarily focus on specific PEGs with fixed initial conditions that may vary substantially in real-world scenarios, which significantly hinders the applicability of the traditional methods. To address this issue, we introduce Grasper, a GeneRAlist purSuer for Pursuit-Evasion pRoblems, capable of efficiently generating pursuer policies tailored to specific PEGs. Our contributions are threefold: First, we present a novel architecture that offers high-quality solutions for diverse PEGs, comprising critical components such as (i) a graph neural network (GNN) to encode PEGs into hidden vectors, and (ii) a hypernetwork to generate pursuer policies based on these hidden vectors. As a second contribution, we develop an efficient three-stage training method involving (i) a pre-pretraining stage for learning robust PEG representations through self-supervised graph learning techniques like graph masked auto-encoder (Graph- MAE), (ii) a pre-training stage utilizing heuristic-guided multi-task pre-training (HMP) where heuristic-derived reference policies (e. g. , through Dijkstra’s algorithm) regularize pursuer policies, and (iii) a fine-tuning stage that employs PSRO to generate pursuer policies on designated PEGs. Finally, we perform extensive experiments on synthetic and real-world maps, showcasing Grasper’s significant superiority over baselines in terms of solution quality and generalizability. We demonstrate that Grasper provides a versatile ∗Equal contribution. †Corresponding author. This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). approach for solving pursuit-evasion problems across a broad range of scenarios, enabling practical deployment in real-world situations.

IJCAI Conference 2024 Conference Paper

IMM: An Imitative Reinforcement Learning Approach with Predictive Representation Learning for Automatic Market Making

Hui Niu
Siyuan Li
Jiahao Zheng
Zhouchi Lin
Bo An
Jian Li
Jian Guo

Market making (MM) via Reinforcement Learning (RL) has attracted significant attention in financial trading. Most existing RL-based MM methods focus on optimizing single-price level strategies which fail at frequent order cancellations and loss of queue priority. By comparison, strategies involving multiple price levels align better with actual trading scenarios. However, given the complexity that multi-price level RL strategies involve a comprehensive trading action space, the challenge of effectively training RL persists. Inspired by the effective workflow of professional human market makers, we propose Imitative Market Maker (IMM), a novel RL framework leveraging knowledge from both suboptimal signal-based experts and direct policy interactions. Our framework starts with introducing effective state and action formulations that well encode information about multiprice level orders. Furthermore, IMM integrates a representation learning unit capable of capturing both short- and long-term market trends to mitigate adverse selection risk. Subsequently, IMM designs an expert strategy based on predictive signals, and trains the agent through the integration of RL and imitation learning techniques to achieve efficient learning. Extensive experimental results on four real-world market datasets demonstrate the superiority of IMM against current RL-based MM strategies.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

MaNo: Exploiting Matrix Norm for Unsupervised Accuracy Estimation Under Distribution Shifts

Renchunzi Xie
Ambroise Odonnat
Vasilii Feofanov
Weijian Deng
Jianfeng Zhang
Bo An

Leveraging the model’s outputs, specifically the logits, is a common approach to estimating the test accuracy of a pre-trained neural network on out-of-distribution (OOD) samples without requiring access to the corresponding ground-truth labels. Despite their ease of implementation and computational efficiency, current logit-based methods are vulnerable to overconfidence issues, leading to prediction bias, especially under the natural shift. In this work, we first study the relationship between logits and generalization performance from the view of low-density separation assumption. Our findings motivate our proposed method \method{} that \textbf{(1)}~applies a data-dependent normalization on the logits to reduce prediction bias, and \textbf{(2)} takes the $L_p$ norm of the matrix of normalized logits as the estimation score. Our theoretical analysis highlights the connection between the provided score and the model's uncertainty. We conduct an extensive empirical study on common unsupervised accuracy estimation benchmarks and demonstrate that \method{} achieves state-of-the-art performance across various architectures in the presence of synthetic, natural, or subpopulation shifts. The code is available at https: //github. com/Renchunzi-Xie/MaNo.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Market-GAN: Adding Control to Financial Market Data Generation with Semantic Context

Haochong Xia
Shuo Sun
Xinrun Wang
Bo An

Financial simulators play an important role in enhancing forecasting accuracy, managing risks, and fostering strategic financial decision-making. Despite the development of financial market simulation methodologies, existing frameworks often struggle with adapting to specialized simulation context. We pinpoint the challenges as i) current financial datasets do not contain context labels; ii) current techniques are not designed to generate financial data with context as control, which demands greater precision compared to other modalities; iii) the inherent difficulties in generating context-aligned, high-fidelity data given the non-stationary, noisy nature of financial data. To address these challenges, our contributions are: i) we proposed the Contextual Market Dataset with market dynamics, stock ticker, and history state as context, leveraging a market dynamics modeling method that combines linear regression and clustering to extract market dynamics; ii) we present Market-GAN, a novel architecture incorporating a Generative Adversarial Networks (GAN) for the controllable generation with context, an autoencoder for learning low-dimension features, and supervisors for knowledge transfer; iii) we introduce a two-stage training scheme to ensure that Market-GAN captures the intrinsic market distribution with multiple objectives. In the pertaining stage, with the use of the autoencoder and supervisors, we prepare the generator with a better initialization for the adversarial training stage. We propose a set of holistic evaluation metrics that consider alignment, fidelity, data usability on downstream tasks, and market facts. We evaluate Market-GAN with the Dow Jones Industrial Average data from 2000 to 2023 and showcase superior performance in comparison to 4 state-of-the-art time-series generative models.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

PoRank: A Practical Framework for Learning to Rank Policies

Pengjie Gu
Mengchen Zhao
Xu He
Yi Cai
Bo An

In many real-world scenarios, we need to select from a set of candidate policies before online deployment. Although existing Off-policy evaluation (OPE) methods can be used to estimate the online performance, they suffer from high variance. Fortunately, we care only about the ranking of the candidate policies, rather than their exact online rewards. Based on this, we propose a novel framework PoRank for learning to rank policies. In practice, learning to rank policies faces two main challenges: 1) generalization over the huge policy space and 2) lack of supervision signals. To overcome the first challenge, PoRank uses a Policy Comparison Transformer (PCT) for learning cross-policy representations, which capture the core discrepancies between policies and generalizes well across the whole policy space. The second challenge arises because learning to rank requires online comparisons of policies as ground-truth labels, whereas deploying policies online might be highly expensive. To overcome this, PoRank adopts a crowdsourcing based learning-to-rank (LTR) framework, where a set of OPE algorithms are employed to provide weak comparison labels. Experimental results show that PoRank not only outperforms baselines when the ground-truth labels are provided, but also achieves competitive performance when the ground-truth labels are unavailable.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Reinforcement Learning from Diverse Human Preferences

Wanqi Xue
Bo An
Shuicheng Yan
Zhongwen Xu

The complexity of designing reward functions has been a major obstacle to the wide application of deep reinforcement learning (RL) techniques. Describing an agent's desired behaviors and properties can be difficult, even for experts. A new paradigm called reinforcement learning from human preferences (or preference-based RL) has emerged as a promising solution, in which reward functions are learned from human preference labels among behavior trajectories. However, existing methods for preference-based RL are limited by the need for accurate oracle preference labels. This paper addresses this limitation by developing a method for learning from diverse human preferences. The key idea is to stabilize reward learning through regularization and correction in a latent space. To ensure temporal consistency, a strong constraint is imposed on the reward model that forces its latent space to be close to a non-parameterized distribution. Additionally, a confidence-based reward model ensembling method is designed to generate more stable and reliable predictions. The proposed method is tested on a variety of tasks in DMcontrol and Meta-world and has shown consistent and significant improvements over existing preference-based RL algorithms when learning from diverse feedback, paving the way for real-world applications of RL methods.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

Reinforcement Nash Equilibrium Solver

Xinrun Wang
Chang Yang
Shuxin Li
Pengdeng Li
Xiao Huang
Hau Chan
Bo An

Nash Equilibrium (NE) is the canonical solution concept of game theory, which provides an elegant tool to understand the rationalities. Computing NE in two- or multi-player general-sum games is PPAD-Complete. Therefore, in this work, we propose REinforcement Nash Equilibrium Solver (RENES), which trains a single policy to modify the games with different sizes and applies the solvers on the modified games where the obtained solution is evaluated on the original games. Specifically, our contributions are threefold. i) We represent the games as 𝛼-rank response graphs and leverage graph neural network (GNN) to handle the games with different sizes as inputs; ii) We use tensor decomposition, e. g. , canonical polyadic (CP), to make the dimension of modifying actions fixed for games with different sizes; iii) We train the modifying strategy for games with the widely-used proximal policy optimization (PPO) and apply the solvers to solve the modified games, where the obtained solution is evaluated on original games. Extensive experiments on large-scale normal-form games show that our method can further improve the approximation of NE of different solvers, i. e. , 𝛼-rank, CE, FP and PRD, and can be generalized to unseen games.

IJCAI Conference 2024 Conference Paper

Reinforcement Nash Equilibrium Solver

Xinrun Wang
Chang Yang
Shuxin Li
Pengdeng Li
Xiao Huang
Hau Chan
Bo An

Nash Equilibrium (NE) is the canonical solution concept of game theory, which provides an elegant tool to understand the rationalities. Though mixed strategy NE exists in any game with finite players and actions, computing NE in two- or multi-player general-sum games is PPAD-Complete. Various alternative solutions, e. g. , Correlated Equilibrium (CE), and learning methods, e. g. , fictitious play (FP), are proposed to approximate NE. For convenience, we call these methods as ``inexact solvers'', or ``solvers'' for short. However, the alternative solutions differ from NE and the learning methods generally fail to converge to NE. Therefore, in this work, we propose REinforcement Nash Equilibrium Solver (RENES), which trains a single policy to modify the games with different sizes and applies the solvers on the modified games where the obtained solution is evaluated on the original games. Specifically, our contributions are threefold. i) We represent the games as alpha-rank response graphs and leverage graph neural network (GNN) to handle the games with different sizes as inputs; ii) We use tensor decomposition, e. g. , canonical polyadic (CP), to make the dimension of modifying actions fixed for games with different sizes; iii) We train the modifying strategy for games with the widely-used proximal policy optimization (PPO) and apply the solvers to solve the modified games, where the obtained solution is evaluated on original games. Extensive experiments on large-scale normal-form games show that our method can further improve the approximation of NE of different solvers, i. e. , alpha-rank, CE, FP and PRD, and can be generalized to unseen games.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

Self-adaptive PSRO: Towards an Automatic Population-based Game Solver

Pengdeng Li
Shuxin Li
Chang Yang
Xinrun Wang
Xiao Huang
Hau Chan
Bo An

Policy-Space Response Oracles (PSRO) as a general algorithmic framework has achieved state-of-the-art performance in learning equilibrium policies of two-player zero-sum games. However, the hand-crafted hyperparameter value selection in most of the existing works requires extensive domain knowledge, forming the main barrier to applying PSRO to different games. In this work, we make the first attempt to investigate the possibility of self-adaptively determining the optimal hyperparameter values in the PSRO framework. Our contributions are three-fold: (1) Using several hyperparameters, we propose a parametric PSRO that unifies the gradient descent ascent (GDA) and different PSRO variants. (2) We propose the self-adaptive PSRO (SPSRO) by casting the hyperparameter value selection of the parametric PSRO as a hyperparameter optimization (HPO) problem where our objective is to learn an HPO policy that can self-adaptively determine the optimal hyperparameter values during the running of the parametric PSRO. (3) To overcome the poor performance of online HPO methods, we propose a novel offline HPO approach to optimize the HPO policy based on the Transformer architecture. Experiments on various two-player zero-sum games demonstrate the superiority of SPSRO over different baselines.

PDF Details DOI

AAAI Conference 2024 Conference Paper

Transition-Informed Reinforcement Learning for Large-Scale Stackelberg Mean-Field Games

Pengdeng Li
Runsheng Yu
Xinrun Wang
Bo An

Many real-world scenarios including fleet management and Ad auctions can be modeled as Stackelberg mean-field games (SMFGs) where a leader aims to incentivize a large number of homogeneous self-interested followers to maximize her utility. Existing works focus on cases with a small number of heterogeneous followers, e.g., 5-10, and suffer from scalability issue when the number of followers increases. There are three major challenges in solving large-scale SMFGs: i) classical methods based on solving differential equations fail as they require exact dynamics parameters, ii) learning by interacting with environment is data-inefficient, and iii) complex interaction between the leader and followers makes the learning performance unstable. We address these challenges through transition-informed reinforcement learning. Our main contributions are threefold: i) we first propose an RL framework, the Stackelberg mean-field update, to learn the leader's policy without priors of the environment, ii) to improve the data efficiency and accelerate the learning process, we then propose the Transition-Informed Reinforcement Learning (TIRL) by leveraging the instantiated empirical Fokker-Planck equation, and iii) we develop a regularized TIRL by employing various regularizers to alleviate the sensitivity of the learning performance to the initialization of the leader's policy. Extensive experiments on fleet management and food gathering demonstrate that our approach can scale up to 100,000 followers and significantly outperform existing baselines.

PDF Details DOI

IJCAI Conference 2024 Conference Paper

vMFER: Von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement

Yiwen Zhu
Jinyi Liu
Wenya Wei
Qianyi Fu
Yujing Hu
Zhou Fang
Bo An
Jianye Hao

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations -- policy evaluation and policy improvement. Enhancing learning efficiency remains a key challenge in RL, with many efforts focused on using ensemble critics to boost policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance learning efficiency. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

PDF Details DOI

AAMAS Conference 2024 Conference Paper

vMFER: von Mises-Fisher Experience Resampling Based on Uncertainty of Gradient Directions for Policy Improvement of Actor-Critic Algorithms

Yiwen Zhu
Jinyi Liu
Wenya Wei
Qianyi Fu
Yujing Hu
Zhou Fang
Bo An
Jianye Hao

Reinforcement Learning (RL) is a widely employed technique in decision-making problems, encompassing two fundamental operations – policy evaluation and policy improvement. Actor-critic algorithms dominate the field of RL, but there is a challenge in improving their learning efficiency. To address this, ensemble critics are often employed to enhance policy evaluation efficiency. However, when using multiple critics, the actor in the policy improvement process can obtain different gradients. Previous studies have combined these gradients without considering their disagreements. Therefore, optimizing the policy improvement process is crucial to enhance the learning efficiency of actor-critic algorithms. This study focuses on investigating the impact of gradient disagreements caused by ensemble critics on policy improvement. We introduce the concept of uncertainty of gradient directions as a means to measure the disagreement among gradients utilized in the policy improvement process. Through measuring the disagreement among gradients, we find that transitions with lower uncertainty of gradient directions are more reliable in the policy improvement process. Building on this analysis, we propose a method called von Mises-Fisher Experience Resampling (vMFER), which optimizes This work is licensed under a Creative Commons Attribution International 4. 0 License. Proc. of the 23rd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2024), N. Alechina, V. Dignum, M. Dastani, J. S. Sichman (eds.), May 6 – 10, 2024, Auckland, New Zealand. © 2024 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). the policy improvement process by resampling transitions and assigning higher confidence to transitions with lower uncertainty of gradient directions. Our experiments on Mujoco robotic control tasks and robotic arm tasks with sparse rewards demonstrate that vMFER significantly outperforms the benchmark and is particularly well-suited for ensemble structures in RL.

AAMAS Conference 2023 Conference Paper

A Learning Approach to Complex Contagion Influence Maximization

Haipeng Chen
Bryan Wilder
Wei Qiu
Bo An
Eric Rice
Milind Tambe

Influence maximization (IM) aims to find a set of seed nodes in a social network that maximizes the influence spread. While most IM problems focus on classical influence cascades (e. g. , Independent Cascade and Linear Threshold) which assume individual influence cascade probability is independent of the number of neighbors, recent studies by sociologists show that many influence cascades follow a pattern called complex contagion (CC), where influence cascade probability is much higher when more neighbors are influenced. Nonetheless, there are very limited studies on complex contagion influence maximization (CCIM) problems. This is partly because CC is non-submodular, the solution of which has been an open challenge. In this study, we propose the first reinforcement learning (RL) approach to CCIM. We find that a key obstacle in applying existing RL approaches to CCIM is the reward sparseness issue, which comes from two distinct sources. We then design a new RL algorithm that uses the CCIM problem structure to address the issue. Empirical results show that our approach achieves the state-of-the-art performance on four real-world networks.

AAAI Conference 2023 Conference Paper

An Efficient Deep Reinforcement Learning Algorithm for Solving Imperfect Information Extensive-Form Games

Linjian Meng
Zhenxing Ge
Pinzhuo Tian
Bo An
Yang Gao

One of the most popular methods for learning Nash equilibrium (NE) in large-scale imperfect information extensive-form games (IIEFGs) is the neural variants of counterfactual regret minimization (CFR). CFR is a special case of Follow-The-Regularized-Leader (FTRL). At each iteration, the neural variants of CFR update the agent's strategy via the estimated counterfactual regrets. Then, they use neural networks to approximate the new strategy, which incurs an approximation error. These approximation errors will accumulate since the counterfactual regrets at iteration t are estimated using the agent's past approximated strategies. Such accumulated approximation error causes poor performance. To address this accumulated approximation error, we propose a novel FTRL algorithm called FTRL-ORW, which does not utilize the agent's past strategies to pick the next iteration strategy. More importantly, FTRL-ORW can update its strategy via the trajectories sampled from the game, which is suitable to solve large-scale IIEFGs since sampling multiple actions for each information set is too expensive in such games. However, it remains unclear which algorithm to use to compute the next iteration strategy for FTRL-ORW when only such sampled trajectories are revealed at iteration t. To address this problem and scale FTRL-ORW to large-scale games, we provide a model-free method called Deep FTRL-ORW, which computes the next iteration strategy using model-free Maximum Entropy Deep Reinforcement Learning. Experimental results on two-player zero-sum IIEFGs show that Deep FTRL-ORW significantly outperforms existing model-free neural methods and OS-MCCFR.

PDF Details DOI

IJCAI Conference 2023 Conference Paper

Complex Contagion Influence Maximization: A Reinforcement Learning Approach

Haipeng Chen
Bryan Wilder
Wei Qiu
Bo An
Eric Rice
Milind Tambe

In influence maximization (IM), the goal is to find a set of seed nodes in a social network that maximizes the influence spread. While most IM problems focus on classical influence cascades (e. g. , Independent Cascade and Linear Threshold) which assume individual influence cascade probability is independent of the number of neighbors, recent studies by sociologists show that many influence cascades follow a pattern called complex contagion (CC), where influence cascade probability is much higher when more neighbors are influenced. Nonetheless, there are very limited studies for complex contagion influence maximization (CCIM) problems. This is partly because CC is non-submodular, the solution of which has been an open challenge. In this study, we propose the first reinforcement learning (RL) approach to CCIM. We find that a key obstacle in applying existing RL approaches to CCIM is the reward sparseness issue, which comes from two distinct sources. We then design a new RL algorithm that uses the CCIM problem structure to address the issue. Empirical results show that our approach achieves the state-of-the-art performance on 9 real-world networks.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Computing Optimal Nash Equilibria in Multiplayer Games

Youzhi Zhang
Bo An
Venkatramanan Subrahmanian

Designing efficient algorithms to compute a Nash Equilibrium (NE) in multiplayer games is still an open challenge. In this paper, we focus on computing an NE that optimizes a given objective function. For example, when there is a team of players independently playing against an adversary in a game (e. g. , several groups in a forest trying to interdict illegal loggers in green security games), these team members may need to find an NE minimizing the adversary’s utility. Finding an optimal NE in multiplayer games can be formulated as a mixed-integer bilinear program by introducing auxiliary variables to represent bilinear terms, leading to a huge number of bilinear terms, making it hard to solve. To overcome this challenge, we first propose a general framework for this formulation based on a set of correlation plans. We then develop a novel algorithm called CRM based on this framework, which uses correlation plans with their relations to strictly reduce the feasible solution space after the convex relaxation of bilinear terms while minimizing the number of correlation plans to significantly reduce the number of bilinear terms. We show that our techniques can significantly reduce the time complexity and CRM can be several orders of magnitude faster than the state-of-the-art baseline.

IS Journal 2023 Journal Article

Effective Interpretable Policy Distillation via Critical Experience Point Identification

Xiao Liu
Shuyang Liu
Bo An
Yang Gao
Shangdong Yang
Wenbin Li

Interpretable policy distillation aims to imitate a deep reinforcement learning (DRL) policy into a self-explainable model. However, the distilled policy usually does not generalize well to complex tasks. To investigate this phenomenon, we examine the experience pools of DRL tasks and find that these interactive experience distributions are heavy tailed. However, this critical issue is largely ignored by existing approaches, and, thus, they do not fully unitize the less frequent but very critical experience points. To address this issue, we propose characterizing decision boundaries via the minimum experience retention to deal with the heavy-tailed experience distributions. Our method identifies critical experience points that are close to the model’s decision boundaries, and such experience points are more critical because they portray the prerequisite of a model to take an action. As a result, our method distills the DRL policy to a self-explainable structure without a neural structure and ambiguous intermediate parameters. Through experiments on six games, we show that our method outperforms the state-of-the-art baselines in cumulative rewards, stability, and faithfulness.

IJCAI Conference 2023 Conference Paper

Exploring Leximin Principle for Fair Core-Selecting Combinatorial Auctions: Payment Rule Design and Implementation

Hao Cheng
Shufeng Kong
Yanchen Deng
Caihua Liu
Xiaohu Wu
Bo An
Chongjun Wang

Core-selecting combinatorial auctions (CAs) restrict the auction result in the core such that no coalitions could improve their utilities by engaging in collusion. The minimum-revenue-core (MRC) rule is a widely used core-selecting payment rule to maximize the total utilities of all bidders. However, the MRC rule can suffer from severe unfairness since it ignores individuals' utilities. To address this limitation, we propose to explore the leximin principle to achieve fairness in core-selecting CAs since the leximin principle prefers to maximize the utility of the worst-off; the resulting bidder-leximin-optimal (BLO) payment rule is then theoretically analyzed and an effective algorithm is further provided to compute the BLO outcome. Moreover, we conduct extensive experiments to show that our algorithm returns fairer utility distributions and is faster than existing algorithms of core-selecting payment rules.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

Few-shot Generation via Recalling Brain-Inspired Episodic-Semantic Memory

Zhibin Duan
Zhiyi Lv
Chaojie Wang
Bo Chen
Bo An
Mingyuan Zhou

Aimed at adapting a generative model to a novel generation task with only a few given data samples, the capability of few-shot generation is crucial for many real-world applications with limited data, \emph{e. g. }, artistic domains. Instead of training from scratch, recent works tend to leverage the prior knowledge stored in previous datasets, which is quite similar to the memory mechanism of human intelligence, but few of these works directly imitate the memory-recall mechanism that humans make good use of in accomplishing creative tasks, \emph{e. g. }, painting and writing. Inspired by the memory mechanism of human brain, in this work, we carefully design a variational structured memory module (VSM), which can simultaneously store both episodic and semantic memories to assist existing generative models efficiently recall these memories during sample generation. Meanwhile, we introduce a bionic memory updating strategy for the conversion between episodic and semantic memories, which can also model the uncertainty during conversion. Then, we combine the developed VSM with various generative models under the Bayesian framework, and evaluate these memory-augmented generative models with few-shot generation tasks, demonstrating the effectiveness of our methods.

AAMAS Conference 2023 Conference Paper

Finding Optimal Nash Equilibria in Multiplayer Games via Correlation Plans

Youzhi Zhang
Bo An
V. S. Subrahmanian

Designing efficient algorithms to compute a Nash Equilibrium (NE) in multiplayer games is still an open challenge. In this paper, we focus on computing an NE that optimizes a given objective function. Finding an optimal NE in multiplayer games can be formulated as a mixed-integer bilinear program by introducing auxiliary variables to represent bilinear terms, leading to a huge number of bilinear terms, making it hard to solve. To overcome this challenge, we propose a novel algorithm called CRM based on a novel mixedinteger bilinear program with correlation plans for some subsets of players, which uses Correlation plans with their Relations to strictly reduce the feasible solution space after the convex relaxation of bilinear terms while Minimizing the number of correlation plans to significantly reduce the number of bilinear terms. Experimental results show that our algorithm can be several orders of magnitude faster than the state-of-the-art baseline.

NeurIPS Conference 2023 Conference Paper

In Defense of Softmax Parametrization for Calibrated and Consistent Learning to Defer

Yuzhou Cao
Hussein Mozannar
Lei Feng
Hongxin Wei
Bo An

Enabling machine learning classifiers to defer their decision to a downstream expert when the expert is more accurate will ensure improved safety and performance. This objective can be achieved with the learning-to-defer framework which aims to jointly learn how to classify and how to defer to the expert. In recent studies, it has been theoretically shown that popular estimators for learning to defer parameterized with softmax provide unbounded estimates for the likelihood of deferring which makes them uncalibrated. However, it remains unknown whether this is due to the widely used softmax parameterization and if we can find a softmax-based estimator that is both statistically consistent and possesses a valid probability estimator. In this work, we first show that the cause of the miscalibrated and unbounded estimator in prior literature is due to the symmetric nature of the surrogate losses used and not due to softmax. We then propose a novel statistically consistent asymmetric softmax-based surrogate loss that can produce valid estimates without the issue of unboundedness. We further analyze the non-asymptotic properties of our proposed method and empirically validate its performance and calibration on benchmark datasets.

AAMAS Conference 2023 Conference Paper

Off-Beat Multi-Agent Reinforcement Learning

Wei Qiu
Weixun Wang
Rundong Wang
Bo An
Yujing Hu
Svetlana Obraztsova
Zinovi Rabinovich
Jianye Hao

We investigate cooperative multi-agent reinforcement learning in environments with off-beat actions, i. e. , all actions have execution durations. During execution durations, the environmental changes are not synchronised with action executions. To learn efficient multi-agent coordination in environments with off-beat actions, we propose a novel reward redistribution method built on our novel graph-based episodic memory. We name our solution method as LeGEM. Empirical results on stag-hunter game show that it significantly boosts multi-agent coordination.

NeurIPS Conference 2023 Conference Paper

Offline RL with Discrete Proxy Representations for Generalizability in POMDPs

Pengjie Gu
Xinyu Cai
Dong Xing
Xinrun Wang
Mengchen Zhao
Bo An

Offline Reinforcement Learning (RL) has demonstrated promising results in various applications by learning policies from previously collected datasets, reducing the need for online exploration and interactions. However, real-world scenarios usually involve partial observability, which brings crucial challenges of the deployment of offline RL methods: i) the policy trained on data with full observability is not robust against the masked observations during execution, and ii) the information of which parts of observations are masked is usually unknown during training. In order to address these challenges, we present Offline RL with DiscrEte pRoxy representations (ORDER), a probabilistic framework which leverages novel state representations to improve the robustness against diverse masked observabilities. Specifically, we propose a discrete representation of the states and use a proxy representation to recover the states from masked partial observable trajectories. The training of ORDER can be compactly described as the following three steps. i) Learning the discrete state representations on data with full observations, ii) Training the decision module based on the discrete representations, and iii) Training the proxy discrete representations on the data with various partial observations, aligning with the discrete representations. We conduct extensive experiments to evaluate ORDER, showcasing its effectiveness in offline RL for diverse partially observable scenarios and highlighting the significance of discrete proxy representations in generalization performance. ORDER is a flexible framework to employ any offline RL algorithms and we hope that ORDER can pave the way for the deployment of RL policy against various partial observabilities in the real world.

NeurIPS Conference 2023 Conference Paper

On the Importance of Feature Separability in Predicting Out-Of-Distribution Error

Renchunzi Xie
Hongxin Wei
Lei Feng
Yuzhou Cao
Bo An

Estimating the generalization performance is practically challenging on out-of-distribution (OOD) data without ground-truth labels. While previous methods emphasize the connection between distribution difference and OOD accuracy, we show that a large domain gap not necessarily leads to a low test accuracy. In this paper, we investigate this problem from the perspective of feature separability empirically and theoretically. Specifically, we propose a dataset-level score based upon feature dispersion to estimate the test accuracy under distribution shift. Our method is inspired by desirable properties of features in representation learning: high inter-class dispersion and high intra-class compactness. Our analysis shows that inter-class dispersion is strongly correlated with the model accuracy, while intra-class compactness does not reflect the generalization performance on OOD data. Extensive experiments demonstrate the superiority of our method in both prediction performance and computational efficiency.

AAAI Conference 2023 Conference Paper

Partial-Label Regression

Xin Cheng
Deng-Bao Wang
Lei Feng
Min-Ling Zhang
Bo An

Partial-label learning is a popular weakly supervised learning setting that allows each training example to be annotated with a set of candidate labels. Previous studies on partial-label learning only focused on the classification setting where candidate labels are all discrete, which cannot handle continuous labels with real values. In this paper, we provide the first attempt to investigate partial-label regression, where each training example is annotated with a set of real-valued candidate labels. To solve this problem, we first propose a simple baseline method that takes the average loss incurred by candidate labels as the predictive loss. The drawback of this method lies in that the loss incurred by the true label may be overwhelmed by other false labels. To overcome this drawback, we propose an identification method that takes the least loss incurred by candidate labels as the predictive loss. We further improve it by proposing a progressive identification method to differentiate candidate labels using progressively updated weights for incurred losses. We prove that the latter two methods are model-consistent and provide convergence analysis showing the optimal parametric convergence rate. Our proposed methods are theoretically grounded and can be compatible with any models, optimizers, and losses. Experiments validate the effectiveness of our proposed methods.

PDF Details DOI

TMLR Journal 2023 Journal Article

PRUDEX-Compass: Towards Systematic Evaluation of Reinforcement Learning in Financial Markets

Shuo Sun
Molei Qin
Xinrun Wang
Bo An

The financial markets, which involve more than $90 trillion market capitals, attract the attention of innumerable investors around the world. Recently, reinforcement learning in financial markets (FinRL) has emerged as a promising direction to train agents for making profitable investment decisions. However, the evaluation of most FinRL methods only focuses on profit-related measures and ignores many critical axes, which are far from satisfactory for financial practitioners to deploy these methods into real-world financial markets. Therefore, we introduce PRUDEX-Compass, which has 6 axes, i.e., Profitability, Risk-control, Universality, Diversity, rEliability, and eXplainability, with a total of 17 measures for a systematic evaluation. Specifically, i) we propose AlphaMix+ as a strong FinRL baseline, which leverages mixture-of-experts (MoE) and risk-sensitive approaches to make diversified risk-aware investment decisions, ii) we evaluate 8 FinRL methods in 4 long-term real-world datasets of influential financial markets to demonstrate the usage of our PRUDEX-Compass, iii) PRUDEX-Compass together with 4 real-world datasets, standard implementation of 8 FinRL methods and a portfolio management environment is released as public resources to facilitate the design and comparison of new FinRL methods. We hope that PRUDEX-Compass can not only shed light on future FinRL research to prevent untrustworthy results from stagnating FinRL into successful industry deployment but also provide a new challenging algorithm evaluation scenario for the reinforcement learning (RL) community.

NeurIPS Conference 2023 Conference Paper

Regression with Cost-based Rejection

Xin Cheng
Yuzhou Cao
Haobo Wang
Hongxin Wei
Bo An
Lei Feng

Learning with rejection is an important framework that can refrain from making predictions to avoid critical mispredictions by balancing between prediction and rejection. Previous studies on cost-based rejection only focused on the classification setting, which cannot handle the continuous and infinite target space in the regression setting. In this paper, we investigate a novel regression problem called regression with cost-based rejection, where the model can reject to make predictions on some examples given certain rejection costs. To solve this problem, we first formulate the expected risk for this problem and then derive the Bayes optimal solution, which shows that the optimal model should reject to make predictions on the examples whose variance is larger than the rejection cost when the mean squared error is used as the evaluation metric. Furthermore, we propose to train the model by a surrogate loss function that considers rejection as binary classification and we provide conditions for the model consistency, which implies that the Bayes optimal solution can be recovered by our proposed surrogate loss. Extensive experiments demonstrate the effectiveness of our proposed method.

TIST Journal 2023 Journal Article

Reinforcement Learning for Quantitative Trading

Shuo Sun
Rundong Wang
Bo An

Quantitative trading (QT), which refers to the usage of mathematical models and data-driven techniques in analyzing the financial market, has been a popular topic in both academia and financial industry since 1970s. In the last decade, reinforcement learning (RL) has garnered significant interest in many domains such as robotics and video games, owing to its outstanding ability on solving complex sequential decision making problems. RL’s impact is pervasive, recently demonstrating its ability to conquer many challenging QT tasks. It is a flourishing research direction to explore RL techniques’ potential on QT tasks. This paper aims at providing a comprehensive survey of research efforts on RL-based methods for QT tasks. More concretely, we devise a taxonomy of RL-based QT models, along with a comprehensive summary of the state of the art. Finally, we discuss current challenges and propose future research directions in this exciting field.

AAAI Conference 2023 Conference Paper

Solving Large-Scale Pursuit-Evasion Games Using Pre-trained Strategies

Shuxin Li
Xinrun Wang
Youzhi Zhang
Wanqi Xue
Jakub Černý
Bo An

Pursuit-evasion games on graphs model the coordination of police forces chasing a fleeing felon in real-world urban settings, using the standard framework of imperfect-information extensive-form games (EFGs). In recent years, solving EFGs has been largely dominated by the Policy-Space Response Oracle (PSRO) methods due to their modularity, scalability, and favorable convergence properties. However, even these methods quickly reach their limits when facing large combinatorial strategy spaces of the pursuit-evasion games. To improve their efficiency, we integrate the pre-training and fine-tuning paradigm into the core module of PSRO -- the repeated computation of the best response. First, we pre-train the pursuer's policy base model against many different strategies of the evader. Then we proceed with the PSRO loop and fine-tune the pre-trained policy to attain the pursuer's best responses. The empirical evaluation shows that our approach significantly outperforms the baselines in terms of speed and scalability, and can solve even games on street maps of megalopolises with tens of thousands of crossroads -- a scale beyond the effective reach of previous methods.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

State Regularized Policy Optimization on Data with Dynamics Shift

Zhenghai Xue
Qingpeng Cai
Shuchang Liu
Dong Zheng
Peng Jiang
Kun Gai
Bo An

In many real-world scenarios, Reinforcement Learning (RL) algorithms are trained on data with dynamics shift, i. e. , with different underlying environment dynamics. A majority of current methods address such issue by training context encoders to identify environment parameters. Data with dynamics shift are separated according to their environment parameters to train the corresponding policy. However, these methods can be sample inefficient as data are used \textit{ad hoc}, and policies trained for one dynamics cannot benefit from data collected in all other environments with different dynamics. In this paper, we find that in many environments with similar structures and different dynamics, optimal policies have similar stationary state distributions. We exploit such property and learn the stationary state distribution from data with dynamics shift for efficient data reuse. Such distribution is used to regularize the policy trained in a new environment, leading to the SRPO (\textbf{S}tate \textbf{R}egularized \textbf{P}olicy \textbf{O}ptimization) algorithm. To conduct theoretical analyses, the intuition of similar environment structures is characterized by the notion of homomorphous MDPs. We then demonstrate a lower-bound performance guarantee on policies regularized by the stationary state distribution. In practice, SRPO can be an add-on module to context-based algorithms in both online and offline RL settings. Experimental results show that SRPO can make several context-based algorithms far more data efficient and significantly improve their overall performance.

AAMAS Conference 2023 Conference Paper

Structural Credit Assignment-Guided Coordinated MCTS: An Efficient and Scalable Method for Online Multiagent Planning

Qian Che
Wanyuan Wang
Fengchen Wang
Tianchi Qiao
Xiang Liu
Jiuchuan Jiang
Bo An
Yichuan Jiang

Online planning has been widely focused in many areas, such as industry chain and collective intelligence. Due to the trade-off nature of trading computation time for solution quality, Monte-Carlo tree search (MCTS) methods have shown great success in online planning. However, the exponential growth of global joint-action space makes it challenging to apply MCTS to online multiagent planning (MAP). Our goal in this paper is to design an efficient and scalable coordinated MCTS method for online MAP. Combining with coordination graphs, recent Factored Value MCTS (FV-MCTS) has attempted to recover the trade-off property for MCTS-based online MAP. However, FV-MCTS directly uses the global payoff to reward each agent, and has difficulty in finding coordination actions in multiagent MCTS settings where other agents are also taking exploratory actions. We overcome this limitation by designing a generalized structural credit assignment (SCA)-guided coordinated MCTS, where SCA is used to promote coordination and MCTS is used to search promising global joint-actions. Specially, we use the Shapley value to provide a fair SCA, which can be efficiently computed by exploiting locality of interaction between agents. Moreover, theoretical analysis shows that the proposed method can bound the bias of the estimated value of the global join-action under certain conditions. Finally, we conduct extensive experiments in some typical sequential multiagent coordination domains such as multi-robot warehouse patrolling in industry chain, etc. to validate the efficiency and scalability of the proposed method over other benchmarks. ∗corresponding author. Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved.

NeurIPS Conference 2023 Conference Paper

TradeMaster: A Holistic Quantitative Trading Platform Empowered by Reinforcement Learning

Shuo Sun
Molei Qin
Wentao Zhang
Haochong Xia
Chuqiao Zong
Jie Ying
Yonggang Xie
Lingxuan Zhao

The financial markets, which involve over \$90 trillion market capitals, attract the attention of innumerable profit-seeking investors globally. Recent explosion of reinforcement learning in financial trading (RLFT) research has shown stellar performance on many quantitative trading tasks. However, it is still challenging to deploy reinforcement learning (RL) methods into real-world financial markets due to the highly composite nature of this domain, which entails design choices and interactions between components that collect financial data, conduct feature engineering, build market environments, make investment decisions, evaluate model behaviors and offers user interfaces. Despite the availability of abundant financial data and advanced RL techniques, a remarkable gap still exists between the potential and realized utilization of RL in financial trading. In particular, orchestrating an RLFT project lifecycle poses challenges in engineering (i. e. hard to build), benchmarking (i. e. hard to compare) and usability (i. e. hard to optimize, maintain and use). To overcome these challenges, we introduce TradeMaster, a holistic open-source RLFT platform that serves as a i) software toolkit, ii) empirical benchmark, and iii) user interface. Our ultimate goal is to provide infrastructures for transparent and reproducible RLFT research and facilitate their real-world deployment with industry impact. TradeMaster will be updated continuously and welcomes contributions from both RL and finance communities.

NeurIPS Conference 2022 Conference Paper

Alleviating "Posterior Collapse'' in Deep Topic Models via Policy Gradient

Yewen Li
Chaojie Wang
Zhibin Duan
Dongsheng Wang
Bo Chen
Bo An
Mingyuan Zhou

Deep topic models have been proven as a promising way to extract hierarchical latent representations from documents represented as high-dimensional bag-of-words vectors. However, the representation capability of existing deep topic models is still limited by the phenomenon of "posterior collapse", which has been widely criticized in deep generative models, resulting in the higher-level latent representations exhibiting similar or meaningless patterns. To this end, in this paper, we first develop a novel deep-coupling generative process for existing deep topic models, which incorporates skip connections into the generation of documents, enforcing strong links between the document and its multi-layer latent representations. After that, utilizing data augmentation techniques, we reformulate the deep-coupling generative process as a Markov decision process and develop a corresponding Policy Gradient (PG) based training algorithm, which can further alleviate the information reduction at higher layers. Extensive experiments demonstrate that our developed methods can effectively alleviate "posterior collapse" in deep topic models, contributing to providing higher-quality latent document representations.

IJCAI Conference 2022 Conference Paper

Correlation-Based Algorithm for Team-Maxmin Equilibrium in Multiplayer Extensive-Form Games

Youzhi Zhang
Bo An
V. S. Subrahmanian

Efficient algorithms computing a Nash equilibrium have been successfully applied to large zero- sum two-player extensive-form games (e. g. , poker). However, in multiplayer games, computing a Nash equilibrium is generally hard, and the equilibria are not exchangeable, which makes players face the problem of selecting one of many different Nash equilibria. In this paper, we focus on an alternative solution concept in zero-sum multiplayer extensive-form games called Team-Maxmin Equilibrium (TME). It is a Nash equilibrium that maximizes each team member’s utility. As TME is unique in general, it avoids the equilibrium selection problem. However, it is still difficult (FNP- hard) to find a TME. Computing it can be formulated as a non-convex program, but existing algorithms are capable of solving this program for only very small games. In this paper, we first refine the complexity result for computing a TME by using a correlation plan to show that a TME can be found in polynomial time in a specific class of games according to our boundary for complexity. Second, we propose an efficient correlation-based algorithm to solve the non-convex program for TME in games not belonging to this class. The algorithm combines two special correlation plans based on McCormick envelopes for convex relaxation and von Stengel-Forges polytope for correlated equilibria. We show that restricting the feasible solution space to von Stengel-Forges polytope will strictly reduce the feasible solution space after convex re- laxation of nonlinear terms. Finally, experiments show that our algorithm is about four orders of magnitude faster than the prior state of the art and can solve many previously unsolvable games.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

Deep Attentive Belief Propagation: Integrating Reasoning and Learning for Solving Constraint Optimization Problems

Yanchen Deng
Shufeng Kong
Caihua Liu
Bo An

Belief Propagation (BP) is an important message-passing algorithm for various reasoning tasks over graphical models, including solving the Constraint Optimization Problems (COPs). It has been shown that BP can achieve state-of-the-art performance on various benchmarks by mixing old and new messages before sending the new one, i. e. , damping. However, existing methods on tuning a static damping factor for BP not only is laborious but also harms their performance. Moreover, existing BP algorithms treat each variable node's neighbors equally when composing a new message, which also limits their exploration ability. To address these issues, we seamlessly integrate BP, Gated Recurrent Units (GRUs), and Graph Attention Networks (GATs) within the massage-passing framework to reason about dynamic weights and damping factors for composing new BP messages. Our model, Deep Attentive Belief Propagation (DABP), takes the factor graph and the BP messages in each iteration as the input and infers the optimal weights and damping factors through GRUs and GATs, followed by a multi-head attention layer. Furthermore, unlike existing neural-based BP variants, we propose a novel self-supervised learning algorithm for DABP with a smoothed solution cost, which does not require expensive training labels and also avoids the common out-of-distribution issue through efficient online learning. Extensive experiments show that our model significantly outperforms state-of-the-art baselines.

IS Journal 2022 Journal Article

Deep Reinforcement Learning for Quantitative Trading: Challenges and Opportunities

Bo An
Shuo Sun
Rundong Wang

Quantitative trading (QT) has been a popular topic in both academia and the financial industry since the 1970s. In the last decade, deep reinforcement learning (DRL) has garnered significant research interest with stellar performance in solving complex sequential decision-making problems, such as Go and video games. The impact of DRL is pervasive, recently demonstrating its ability to conquer some challenging QT tasks. In this article, we outline several key challenges and opportunities that manifest in DRL-based QT to shed light on future research in this field.

AAAI Conference 2022 Conference Paper

GearNet: Stepwise Dual Learning for Weakly Supervised Domain Adaptation

Renchunzi Xie
Hongxin Wei
Lei Feng
Bo An

This paper studies a weakly supervised domain adaptation (WSDA) problem, where we only have access to the source domain with noisy labels, from which we need to transfer useful information to the unlabeled target domain. Although there have been a few studies on this problem, most of them only exploit unidirectional relationships from the source domain to the target domain. In this paper, we propose a universal paradigm called GearNet to exploit bilateral relationships between the two domains. Specifically, we take the two domains as different inputs to train two models alternately, and a symmetrical Kullback-Leibler loss is used for selectively matching the predictions of the two models in the same domain. This interactive learning schema enables implicit label noise canceling and exploit correlations between the source and target domains. Therefore, our GearNet has the great potential to boost the performance of a wide range of existing WSDA methods. Comprehensive experimental results show that the performance of existing methods can be significantly improved by equipping with our GearNet.

NeurIPS Conference 2022 Conference Paper

Generalizing Consistent Multi-Class Classification with Rejection to be Compatible with Arbitrary Losses

Yuzhou Cao
Tianchi Cai
Lei Feng
Lihong Gu
Jinjie Gu
Bo An
Gang Niu
Masashi Sugiyama

\emph{Classification with rejection} (CwR) refrains from making a prediction to avoid critical misclassification when encountering test samples that are difficult to classify. Though previous methods for CwR have been provided with theoretical guarantees, they are only compatible with certain loss functions, making them not flexible enough when the loss needs to be changed with the dataset in practice. In this paper, we derive a novel formulation for CwR that can be equipped with arbitrary loss functions while maintaining the theoretical guarantees. First, we show that $K$-class CwR is equivalent to a $(K\! +\! 1)$-class classification problem on the original data distribution with an augmented class, and propose an empirical risk minimization formulation to solve this problem with an estimation error bound. Then, we find necessary and sufficient conditions for the learning \emph{consistency} of the surrogates constructed on our proposed formulation equipped with any classification-calibrated multi-class losses, where consistency means the surrogate risk minimization implies the target risk minimization for CwR. Finally, experiments on benchmark datasets validate the effectiveness of our proposed method.

AAMAS Conference 2022 Conference Paper

Mis-spoke or mis-lead: Achieving Robustness in Multi-Agent Communicative Reinforcement Learning

Wanqi Xue
Wei Qiu
Bo An
Zinovi Rabinovich
Svetlana Obraztsova
Chai Kiat Yeo

Recent studies in multi-agent communicative reinforcement learning (MACRL) have demonstrated that multi-agent coordination can be greatly improved by allowing communication between agents. Meanwhile, adversarial machine learning (ML) has shown that ML models are vulnerable to attacks. Despite the increasing concern about the robustness of ML algorithms, how to achieve robust communication in multi-agent reinforcement learning has been largely neglected. In this paper, we systematically explore the problem of adversarial communication in MACRL. Our main contributions are threefold. First, we propose an effective method to perform attacks in MACRL, by learning a model to generate optimal malicious messages. Second, we develop a defence method based on message reconstruction, to maintain multi-agent coordination under message attacks. Third, we formulate the adversarial communication problem as a two-player zero-sum game and propose a game-theoretical method ℜ-MACRL to improve the worst-case defending performance. Empirical results demonstrate that many state-of-the-art MACRL methods are vulnerable to message attacks, and our method can significantly improve their robustness.

AAAI Conference 2022 Conference Paper

NSGZero: Efficiently Learning Non-exploitable Policy in Large-Scale Network Security Games with Neural Monte Carlo Tree Search

Wanqi Xue
Bo An
Chai Kiat Yeo

How resources are deployed to secure critical targets in networks can be modelled by Network Security Games (NSGs). While recent advances in deep learning (DL) provide a powerful approach to dealing with large-scale NSGs, DL methods such as NSG-NFSP suffer from the problem of data inefficiency. Furthermore, due to centralized control, they cannot scale to scenarios with a large number of resources. In this paper, we propose a novel DL-based method, NSGZero, to learn a non-exploitable policy in NSGs. NSGZero improves data efficiency by performing planning with neural Monte Carlo Tree Search (MCTS). Our main contributions are threefold. First, we design deep neural networks (DNNs) to perform neural MCTS in NSGs. Second, we enable neural MCTS with decentralized control, making NSGZero applicable to NSGs with many resources. Third, we provide an efficient learning paradigm, to achieve joint training of the DNNs in NSGZero. Compared to state-of-the-art algorithms, our method achieves significantly better data efficiency and scalability.

AAMAS Conference 2022 Conference Paper

Online Collective Multiagent Planning by Offline Policy Reuse with Applications to City-Scale Mobility-on-Demand Systems

Wanyuan Wang
Gerong Wu
Weiwei Wu
Yichuan Jiang
Bo An

The popularity of mobility-on-demand (MoD) systems boosts the need for online collective multiagent planning, where spatially distributed servicing agents are planned to meet dynamically arriving demands. For city-scale MoDs with a population of agents, it is necessary to find a balance between computation time (i. e. , realtime) and solution quality (i. e. , the number of demands served). Directly using an offline policy can guarantee real-time, but cannot be dynamically adjusted to real agent and demand distributions. On the other hand, search-based online planning methods are adaptive. However, they are computationally expensive and cannot scale up. In this paper, we propose a principled online multiagent planning method, which reuses and improves the offline policy in an anytime manner. We first model MoDs as a collective Markov Decision Process (C-MDP) where the history collective behavior of agents affects the joint reward. We propose a novel state value function to evaluate the policy, and a gradient ascent (GA) technique to improve the policy. We show that GA-based policy iteration (GA-PI) on local policy can converge. Finally, given real-time information, the offline policy is used as the default plan and GA-PI is used to improve it and generate an online plan. Experimentally, the proposed offline policy reuse method significantly outperforms standard online multiagent planning methods on MoD systems like ride-sharing and security traffic patrolling in terms of computation time and solution quality.

NeurIPS Conference 2022 Conference Paper

Out-of-Distribution Detection with An Adaptive Likelihood Ratio on Informative Hierarchical VAE

Yewen Li
Chaojie Wang
Xiaobo Xia
Tongliang Liu
Xin Miao
Bo An

Unsupervised out-of-distribution (OOD) detection is essential for the reliability of machine learning. In the literature, existing work has shown that higher-level semantics captured by hierarchical VAEs can be used to detect OOD instances. However, we empirically show that, the inherent issue of hierarchical VAEs, i. e. , ` posterior collapse'', would seriously limit their capacity for OOD detection. Based on a thorough analysis for posterior collapse'', we propose a novel informative hierarchical VAE to alleviate this issue through enhancing the connections between the data sample and its multi-layer stochastic latent representations during training. Furthermore, we propose a novel score function for unsupervised OOD detection, referred to as Adaptive Likelihood Ratio. With this score function, one can selectively aggregate the semantic information on multiple hidden layers of hierarchical VAEs, leading to a strong separability between in-distribution and OOD samples. Experimental results demonstrate that our method can significantly outperform existing state-of-the-art unsupervised OOD detection approaches.

AAAI Conference 2022 Conference Paper

Pretrained Cost Model for Distributed Constraint Optimization Problems

Yanchen Deng
Shufeng Kong
Bo An

Distributed Constraint Optimization Problems (DCOPs) are an important subclass of combinatorial optimization problems, where information and controls are distributed among multiple autonomous agents. Previously, Machine Learning (ML) has been largely applied to solve combinatorial optimization problems by learning effective heuristics. However, existing ML-based heuristic methods are often not generalizable to different search algorithms. Most importantly, these methods usually require full knowledge about the problems to be solved, which are not suitable for distributed settings where centralization is not realistic due to geographical limitations or privacy concerns. To address the generality issue, we propose a novel directed acyclic graph representation schema for DCOPs and leverage the Graph Attention Networks (GATs) to embed graph representations. Our model, GAT-PCM, is then pretrained with optimally labelled data in an offline manner, so as to construct effective heuristics to boost a broad range of DCOP algorithms where evaluating the quality of a partial assignment is critical, such as local search or backtracking search. Furthermore, to enable decentralized model inference, we propose a distributed embedding schema of GAT-PCM where each agent exchanges only embedded vectors, and show its soundness and complexity. Finally, we demonstrate the effectiveness of our model by combining it with a local search or a backtracking search algorithm. Extensive empirical evaluations indicate that the GAT- PCM-boosted algorithms significantly outperform the stateof-the-art methods in various benchmarks.

TMLR Journal 2022 Journal Article

SemiNLL: A Framework of Noisy-Label Learning by Semi-Supervised Learning

Zhuowei Wang
Jing Jiang
Bo Han
Lei Feng
Bo An
Gang Niu
Guodong Long

Deep learning with noisy labels is a challenging task, which has received much attention from the machine learning and computer vision communities. Recent prominent methods that build on a specific sample selection (SS) strategy and a specific semi-supervised learning (SSL) model achieved state-of-the-art performance. Intuitively, better performance could be achieved if stronger SS strategies and SSL models are employed. Following this intuition, one might easily derive various effective noisy-label learning methods using different combinations of SS strategies and SSL models, which is, however, simply reinventing the wheel in essence. To prevent this problem, we propose SemiNLL, a versatile framework that investigates how to naturally combine different SS and SSL components based on their effects and efficiencies. We conduct a systematic and detailed analysis of the combinations of possible components based on our framework. Our framework can absorb various SS strategies and SSL backbones, utilizing their power to achieve promising performance. The instantiations of our framework demonstrate substantial improvements over state-of-the-art methods on benchmark-simulated and real-world datasets with noisy labels.

IJCAI Conference 2021 Conference Paper

CFR-MIX: Solving Imperfect Information Extensive-Form Games with Combinatorial Action Space

Shuxin Li
Youzhi Zhang
Xinrun Wang
Wanqi Xue
Bo An

In many real-world scenarios, a team of agents must coordinate with each other to compete against an opponent. The challenge of solving this type of game is that the team's joint action space grows exponentially with the number of agents, which results in the inefficiency of the existing algorithms, e. g. , Counterfactual Regret Minimization (CFR). To address this problem, we propose a new framework of CFR: CFR-MIX. Firstly, we propose a new strategy representation that represents a joint action strategy using individual strategies of all agents and a consistency relationship to maintain the cooperation between agents. To compute the equilibrium with individual strategies under the CFR framework, we transform the consistency relationship between strategies to the consistency relationship between the cumulative regret values. Furthermore, we propose a novel decomposition method over cumulative regret values to guarantee the consistency relationship between the cumulative regret values. Finally, we introduce our new algorithm CFR-MIX which employs a mixing layer to estimate cumulative regret values of joint actions as a non-linear combination of cumulative regret values of individual actions. Experimental results show that CFR-MIX outperforms existing algorithms on various games significantly.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Commission Fee is not Enough: A Hierarchical Reinforced Framework for Portfolio Management

Rundong Wang
Hongxin Wei
Bo An
Zhouyan Feng
Jun Yao

Portfolio management via reinforcement learning is at the forefront of fintech research, which explores how to optimally reallocate a fund into different financial assets over the long term by trial-and-error. Existing methods are impractical since they usually assume each reallocation can be finished immediately and thus ignoring the price slippage as part of the trading cost. To address these issues, we propose a hierarchical reinforced stock trading system for portfolio management (HRPM). Concretely, we decompose the trading process into a hierarchy of portfolio management over trade execution and train the corresponding policies. The high-level policy gives portfolio weights at a lower frequency to maximize the long term profit and invokes the low-level policy to sell or buy the corresponding shares within a short time window at a higher frequency to minimize the trading cost. We train two levels of policies via pre-training scheme and iterative training scheme for data efficiency. Extensive experimental results in the U. S. market and the China market demonstrate that HRPM achieves significant improvement against many state-of-the-art approaches.

AAAI Conference 2021 Conference Paper

Complexity and Algorithms for Exploiting Quantal Opponents in Large Two-Player Games

David Milec
Jakub Černý
Viliam Lisý
Bo An

Solution concepts of traditional game theory assume entirely rational players; therefore, their ability to exploit subrational opponents is limited. One type of subrationality that describes human behavior well is the quantal response. While there exist algorithms for computing solutions against quantal opponents, they either do not scale or may provide strategies that are even worse than the entirely-rational Nash strategies. This paper aims to analyze and propose scalable algorithms for computing effective and robust strategies against a quantal opponent in normal-form and extensive-form games. Our contributions are: (1) we define two different solution concepts related to exploiting quantal opponents and analyze their properties; (2) we prove that computing these solutions is computationally hard; (3) therefore, we evaluate several heuristic approximations based on scalable counterfactual regret minimization (CFR); and (4) we identify a CFR variant that exploits the bounded opponents better than the previously used variants while being less exploitable by the worst-case perfectly-rational opponent.

AAAI Conference 2021 Conference Paper

Computing Ex Ante Coordinated Team-Maxmin Equilibria in Zero-Sum Multiplayer Extensive-Form Games

Youzhi Zhang
Bo An
Jakub Černý

Computational game theory has many applications in the modern world in both adversarial situations and the optimization of social good. While there exist many algorithms for computing solutions in two-player interactions, finding optimal strategies in multiplayer interactions efficiently remains an open challenge. This paper focuses on computing the multiplayer Team-Maxmin Equilibrium with Coordination device (TMECor) in zero-sum extensive-form games. TMECor models scenarios when a team of players coordinates ex ante against an adversary. Such situations can be found in card games (e. g. , in Bridge and Poker), when a team works together to beat a target player but communication is prohibited; and also in real world, e. g. , in forest-protection operations, when coordinated groups have limited contact during interdicting illegal loggers. The existing algorithms struggle to find a TMECor efficiently because of their high computational costs. To compute a TMECor in larger games, we make the following key contributions: (1) we propose a hybrid-form strategy representation for the team, which preserves the set of equilibria; (2) we introduce a column-generation algorithm with a guaranteed finite-time convergence in the infinite strategy space based on a novel best-response oracle; (3) we develop an associated-representation technique for the exact representation of the multilinear terms in the best-response oracle; and (4) we experimentally show that our algorithm is several orders of magnitude faster than prior state-of-the-art algorithms in large games.

AAAI Conference 2021 Conference Paper

Computing Quantal Stackelberg Equilibrium in Extensive-Form Games

Jakub Černý
Viliam Lisý
Branislav Bošanský
Bo An

Deployments of game-theoretic solution concepts in the real world have highlighted the necessity to consider human opponents’ boundedly rational behavior. If subrationality is not addressed, the system can face significant losses in terms of expected utility. While there exist algorithms for computing optimal strategies to commit to when facing subrational decision-makers in one-shot interactions, these algorithms cannot be generalized for solving sequential scenarios because of the inherent curse of strategy-space dimensionality in sequential games and because humans act subrationally in each decision point separately. We study optimal strategies to commit to against subrational opponents in sequential games for the first time and make the following key contributions: (1) we prove the problem is NP-hard in general; (2) to enable further analysis, we introduce a non-fractional reformulation of the direct non-concave representation of the equilibrium; (3) we identify conditions under which the problem can be approximated in polynomial time in the size of the representation; (4) we show how an MILP can approximate the reformulation with a guaranteed bounded error, and (5) we experimentally demonstrate that our algorithm provides higher quality results several orders of magnitude faster than a baseline method for general non-linear optimization.

IS Journal 2021 Journal Article

Embedding-Augmented Generalized Matrix Factorization for Recommendation With Implicit Feedback

Lei Feng
Hongxin Wei
Qingyu Guo
Zhuoyi Lin
Bo An

Learning effective representations of users and items is crucially important to recommendation with implicit feedback. Matrix factorization is the basic idea to derive the representations of users and items by decomposing the given interaction matrix. However, existing matrix factorization based approaches share the limitation in that the interaction between user embedding and item embedding is only weakly enforced by fitting the given individual rating value, which may lose potentially useful information. In this article, we propose a novel augmented generalized matrix factorization approach that is able to incorporate the historical interaction information of users and items for learning effective representations of users and items. Despite the simplicity of our proposed approach, extensive experiments on four public implicit feedback datasets demonstrate that our approach outperforms state-of-the-art counterparts. Furthermore, the ablation study demonstrates that by using the historical interactions to enrich user embedding and item embedding for generalized matrix factorization, better performance, faster convergence, and lower training loss can be achieved.

IJCAI Conference 2021 Conference Paper

Neural Regret-Matching for Distributed Constraint Optimization Problems

Yanchen Deng
Runsheng Yu
Xinrun Wang
Bo An

Distributed constraint optimization problems (DCOPs) are a powerful model for multi-agent coordination and optimization, where information and controls are distributed among multiple agents by nature. Sampling-based algorithms are important incomplete techniques for solving medium-scale DCOPs. However, they use tables to exactly store all the information (e. g. , costs, confidence bounds) to facilitate sampling, which limits their scalability. This paper tackles the limitation by incorporating deep neural networks in solving DCOPs for the first time and presents a neural-based sampling scheme built upon regret-matching. In the algorithm, each agent trains a neural network to approximate the regret related to its local problem and performs sampling according to the estimated regret. Furthermore, to ensure exploration we propose a regret rounding scheme that rounds small regret values to positive numbers. We theoretically show the regret bound of our algorithm and extensive evaluations indicate that our algorithm can scale up to large-scale DCOPs and significantly outperform the state-of-the-art methods.

PDF Details DOI

NeurIPS Conference 2021 Conference Paper

Open-set Label Noise Can Improve Robustness Against Inherent Label Noise

Hongxin Wei
Lue Tao
Renchunzi Xie
Bo An

Learning with noisy labels is a practically challenging problem in weakly supervised learning. In the existing literature, open-set noises are always considered to be poisonous for generalization, similar to closed-set noises. In this paper, we empirically show that open-set noisy labels can be non-toxic and even benefit the robustness against inherent noisy labels. Inspired by the observations, we propose a simple yet effective regularization by introducing Open-set samples with Dynamic Noisy Labels (ODNL) into training. With ODNL, the extra capacity of the neural network can be largely consumed in a way that does not interfere with learning patterns from clean data. Through the lens of SGD noise, we show that the noises induced by our method are random-direction, conflict-free and biased, which may help the model converge to a flat minimum with superior stability and enforce the model to produce conservative predictions on Out-of-Distribution instances. Extensive experimental results on benchmark datasets with various types of noisy labels demonstrate that the proposed method not only enhances the performance of many existing robust algorithms but also achieves significant improvement on Out-of-Distribution detection tasks even in the label noise setting.

AAAI Conference 2021 Conference Paper

Personalized Adaptive Meta Learning for Cold-start User Preference Prediction

Runsheng Yu
Yu Gong
Xu He
Yu Zhu
Qingwen Liu
Wenwu Ou
Bo An

A common challenge in personalized user preference prediction is the cold-start problem. Due to the lack of user-item interactions, directly learning from the new users’ log data causes serious over-fitting problem. Recently, many existing studies regard the cold-start personalized preference prediction as a few-shot learning problem, where each user is the task and recommended items are the classes, and the gradient-based meta learning method (MAML) is leveraged to address this challenge. However, in real-world application, the users are not uniformly distributed (i. e. , different users may have different browsing history, recommended items, and user profiles. We define the major users as the users in the groups with large numbers of users sharing similar user information, and other users are the minor users), existing MAML approaches tend to fit the major users and ignore the minor users. To address this cold-start task-overfitting problem, we propose a novel personalized adaptive meta learning approach to consider both the major and the minor users with three key contributions: 1) We are the first to present a personalized adaptive learning rate meta-learning approach to improve the performance of MAML by focusing on both the major and minor users. 2) To provide better personalized learning rates for each user, we introduce a similarity-based method to find similar users as a reference and a tree-based method to store users’ features for fast search. 3) To reduce the memory usage, we design a memory agnostic regularizer to further reduce the space complexity to constant while maintain the performance. Experiments on MovieLens, BookCrossing, and real-world production datasets reveal that our method outperforms the state-of-the-art methods dramatically for both the minor and major users.

NeurIPS Conference 2021 Conference Paper

RMIX: Learning Risk-Sensitive Policies for Cooperative Reinforcement Learning Agents

Wei Qiu
Xinrun Wang
Runsheng Yu
Rundong Wang
Xu He
Bo An
Svetlana Obraztsova
Zinovi Rabinovich

Current value-based multi-agent reinforcement learning methods optimize individual Q values to guide individuals' behaviours via centralized training with decentralized execution (CTDE). However, such expected, i. e. , risk-neutral, Q value is not sufficient even with CTDE due to the randomness of rewards and the uncertainty in environments, which causes the failure of these methods to train coordinating agents in complex environments. To address these issues, we propose RMIX, a novel cooperative MARL method with the Conditional Value at Risk (CVaR) measure over the learned distributions of individuals' Q values. Specifically, we first learn the return distributions of individuals to analytically calculate CVaR for decentralized execution. Then, to handle the temporal nature of the stochastic outcomes during executions, we propose a dynamic risk level predictor for risk level tuning. Finally, we optimize the CVaR policies with CVaR values used to estimate the target in TD error during centralized training and the CVaR values are used as auxiliary local rewards to update the local distribution via Quantile Regression loss. Empirically, we show that our method outperforms many state-of-the-art methods on various multi-agent risk-sensitive navigation scenarios and challenging StarCraft II cooperative tasks, demonstrating enhanced coordination and revealing improved sample efficiency.

IJCAI Conference 2021 Conference Paper

Solving Large-Scale Extensive-Form Network Security Games via Neural Fictitious Self-Play

Wanqi Xue
Youzhi Zhang
Shuxin Li
Xinrun Wang
Bo An
Chai Kiat Yeo

Securing networked infrastructures is important in the real world. The problem of deploying security resources to protect against an attacker in networked domains can be modeled as Network Security Games (NSGs). Unfortunately, existing approaches, including the deep learning-based approaches, are inefficient to solve large-scale extensive-form NSGs. In this paper, we propose a novel learning paradigm, NSG-NFSP, to solve large-scale extensive-form NSGs based on Neural Fictitious Self-Play (NFSP). Our main contributions include: i) reforming the best response (BR) policy network in NFSP to be a mapping from action-state pair to action-value, to make the calculation of BR possible in NSGs; ii) converting the average policy network of an NFSP agent into a metric-based classifier, helping the agent to assign distributions only on legal actions rather than all actions; iii) enabling NFSP with high-level actions, which can benefit training efficiency and stability in NSGs; and iv) leveraging information contained in graphs of NSGs by learning efficient graph node embeddings. Our algorithm significantly outperforms state-of-the-art algorithms in both scalability and solution quality.

PDF Details DOI

JAAMAS Journal 2021 Journal Article

Utility distribution matters: enabling fast belief propagation for multi-agent optimization with dense local utility function

Yanchen Deng
Bo An

Abstract Belief propagation algorithms including Max-sum and its variants are important methods for multi-agent optimization. However, they face a significant scalability challenge as the computational overhead grows exponentially with respect to the arity of each utility function. To date, a number of acceleration algorithms for belief propagation algorithms were proposed. These algorithms maintain a lower bound on total utility and employ either a domain pruning technique or branch and bound to reduce the search space. However, these algorithms still suffer from low-quality bounds and the inability of filtering out suboptimal tied entries. In this paper, we first show that these issues are exacerbated and can considerably degenerate the performance of the state-of-the-art methods when dealing with the problems with dense utility functions, which widely exist in many real-world domains. Built on this observation, we then develop several novel acceleration algorithms that alleviate the effect of densely distributed local utility values from the perspectives of both bound quality and search space organization. Specifically, we build a search tree for each distinct local utility value to enable efficient branch and bound on tied entries and tighten a running lower bound to perform dynamic domain pruning. That is, we integrate both search and pruning to iteratively reduce the search space. Besides, we propose a discretization mechanism to offer a tradeoff between the reconstruction overhead and the pruning efficiency. Finally, a K -depth partial tree-sorting scheme with different sorting criteria is proposed to reduce the memory consumption. We demonstrate the superiorities of our algorithms over the state-of-the-art acceleration algorithms from both theoretical and experimental perspectives.

IJCAI Conference 2020 Conference Paper

Can Cross Entropy Loss Be Robust to Label Noise?

Lei Feng
Senlin Shu
Zhuoyi Lin
Fengmao Lv
Li Li
Bo An

Trained with the standard cross entropy loss, deep neural networks can achieve great performance on correctly labeled data. However, if the training data is corrupted with label noise, deep models tend to overfit the noisy labels, thereby achieving poor generation performance. To remedy this issue, several loss functions have been proposed and demonstrated to be robust to label noise. Although most of the robust loss functions stem from Categorical Cross Entropy (CCE) loss, they fail to embody the intrinsic relationships between CCE and other loss functions. In this paper, we propose a general framework dubbed Taylor cross entropy loss to train deep models in the presence of label noise. Specifically, our framework enables to weight the extent of fitting the training labels by controlling the order of Taylor Series for CCE, hence it can be robust to label noise. In addition, our framework clearly reveals the intrinsic relationships between CCE and other loss functions, such as Mean Absolute Error (MAE) and Mean Squared Error (MSE). Moreover, we present a detailed theoretical analysis to certify the robustness of this framework. Extensive experimental results on benchmark datasets demonstrate that our proposed approach significantly outperforms the state-of-the-art counterparts.

PDF Details DOI

AAAI Conference 2020 Conference Paper

Computing Team-Maxmin Equilibria in Zero-Sum Multiplayer Extensive-Form Games

Youzhi Zhang
Bo An

The study of ﬁnding the equilibrium for multiplayer games is challenging. This paper focuses on computing Team-Maxmin Equilibria (TMEs) in zero-sum multiplayer Extensive-Form Games (EFGs), which describes the optimal strategies for a team of players who share the same goal but they take actions independently against an adversary. TMEs can capture many realistic scenarios, including: 1) a team of players play against a target player in poker games; and 2) defense resources schedule and patrol independently in security games. However, the study of efﬁciently ﬁnding TMEs within any given accuracy in EFGs is almost completely unexplored. To ﬁll this gap, we ﬁrst study the inefﬁciency caused by computing the equilibrium where team players correlate their strategies and then transforming it into the mixed strategy proﬁle of the team and show that this inefﬁciency can be arbitrarily large. Second, to efﬁciently solve the non-convex program for ﬁnding TMEs directly, we develop the Associated Recursive Asynchronous Multiparametric Disaggregation Technique (ARAMDT) to approximate multilinear terms in the program with two novel techniques: 1) an asynchronous precision method to reduce the number of constraints and variables for approximation by using different precision levels to approximate these terms; and 2) an associated constraint method to reduce the feasible solution space of the mixedinteger linear program resulting from ARAMDT by exploiting the relation between these terms. Third, we develop a novel iterative algorithm to efﬁciently compute TMEs within any given accuracy based on ARAMDT. Our algorithm is orders of magnitude faster than baselines in the experimental evaluation.

IJCAI Conference 2020 Conference Paper

Dinkelbach-Type Algorithm for Computing Quantal Stackelberg Equilibrium

Jakub Cerny
Viliam Lisý
Branislav Bošanský
Bo An

Stackelberg security games (SSGs) have been deployed in many real-world situations to optimally allocate scarce resource to protect targets against attackers. However, actual human attackers are not perfectly rational and there are several behavior models that attempt to predict subrational behavior. Quantal response is among the most commonly used such models and Quantal Stackelberg Equilibrium (QSE) describes the optimal strategy to commit to when facing a subrational opponent. Non-concavity makes computing QSE computationally challenging and while there exist algorithms for computing QSE for SSGs, they cannot be directly used for solving an arbitrary game in the normal form. We (1) present a transformation of the primal problem for computing QSE using a Dinkelbach's method for any general-sum normal-form game, (2) provide a gradient-based and a MILP-based algorithm, give the convergence criteria, and bound their error, and finally (3) we experimentally demonstrate that using our novel transformation, a QSE can be closely approximated several orders of magnitude faster.

PDF Details DOI

JAAMAS Journal 2020 Journal Article

Electric vehicle charging strategy study and the application on charging station placement

Yanhai Xiong
Bo An
Sarit Kraus

Abstract Optimal placement of charging stations for electric vehicles (EVs) is critical for providing convenient charging service to EV owners and promoting public acceptance of EVs. There has been a lot of work on EV charging station placement, yet EV drivers’ charging strategy, which plays an important role in deciding charging stations’ performance, is missing. EV drivers make choice among charging stations according to various factors, including the distance, the charging fare and queuing condition in different stations etc. In turn, some factors, like queuing condition, is greatly influenced by EV drivers’ choices. As more EVs visit the same station, longer queuing duration should be expected. This work first proposes a behavior model to capture the decision making of EV drivers in choosing charging stations, based on which an optimal charging station placement model is presented to minimize the social cost (defined as the congestion in charging stations suffered by all EV drivers). Through analyzing EV drivers’ decision-making in the charging process, we propose a k -Level nested Quantal Response Equilibrium charging behavior model inspired by Quantal Response Equilibrium model and level- k thinking model. We then design a set of user studies to simulate charging scenarios and collect data from human players to learn the parameters of different behavior models. Experimental results show that our charging behavior model can better capture the bounded rationality of human players in the charging activity compared with state-of-the-art behavior models. Furthermore, to evaluate the proposed charging behavior model, we formulate the charging station placement problem with it and design an algorithm to solve the problem. It is shown that our approach obtains placement with a significantly better performance to different extent, especially when the budget is limited and relatively low.

IJCAI Conference 2020 Conference Paper

I²HRL: Interactive Influence-based Hierarchical Reinforcement Learning

Rundong Wang
Runsheng Yu
Bo An
Zinovi Rabinovich

Hierarchical reinforcement learning (HRL) is a promising approach to solve tasks with long time horizons and sparse rewards. It is often implemented as a high-level policy assigning subgoals to a low-level policy. However, it suffers the high-level non-stationarity problem since the low-level policy is constantly changing. The non-stationarity also leads to the data efficiency problem: policies need more data at non-stationary states to stabilize training. To address these issues, we propose a novel HRL method: Interactive Influence-based Hierarchical Reinforcement Learning (I^2HRL). First, inspired by agent modeling, we enable the interaction between the low-level and high-level policies to stabilize the high-level policy training. The high-level policy makes decisions conditioned on the received low-level policy representation as well as the state of the environment. Second, we furthermore stabilize the high-level policy via an information-theoretic regularization with minimal dependence on the changing low-level policy. Third, we propose the influence-based exploration to more frequently visit the non-stationary states where more transition data is needed. We experimentally validate the effectiveness of the proposed solution in several tasks in MuJoCo domains by demonstrating that our approach can significantly boost the learning performance and accelerate learning compared with state-of-the-art HRL methods.

PDF Details DOI

NeurIPS Conference 2020 Conference Paper

Provably Consistent Partial-Label Learning

Lei Feng
Jiaqi Lv
Bo Han
Miao Xu
Gang Niu
Xin Geng
Bo An
Masashi Sugiyama

Partial-label learning (PLL) is a multi-class classification problem, where each training example is associated with a set of candidate labels. Even though many practical PLL methods have been proposed in the last two decades, there lacks a theoretical understanding of the consistency of those methods - none of the PLL methods hitherto possesses a generation process of candidate label sets, and then it is still unclear why such a method works on a specific dataset and when it may fail given a different dataset. In this paper, we propose the first generation model of candidate label sets, and develop two PLL methods that are guaranteed to be provably consistent, i. e. , one is risk-consistent and the other is classifier-consistent. Our methods are advantageous, since they are compatible with any deep network or stochastic optimizer. Furthermore, thanks to the generation model, we would be able to answer the two questions above by testing if the generation model matches given candidate label sets. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed generation model and two PLL methods.

IJCAI Conference 2020 Conference Paper

Speeding Up Incomplete GDL-based Algorithms for Multi-agent Optimization with Dense Local Utilities

Yanchen Deng
Bo An

Incomplete GDL-based algorithms including Max-sum and its variants are important methods for multi-agent optimization. However, they face a significant scalability challenge as the computational overhead grows exponentially with respect to the arity of each utility function. Generic Domain Pruning (GDP) technique reduces the computational effort by performing a one-shot pruning to filter out suboptimal entries. Unfortunately, GDP could perform poorly when dealing with dense local utilities and ties which widely exist in many domains. In this paper, we present several novel sorting-based acceleration algorithms by alleviating the effect of densely distributed local utilities. Specifically, instead of one-shot pruning in GDP, we propose to integrate both search and pruning to iteratively reduce the search space. Besides, we cope with the utility ties by organizing the search space of tied utilities into AND/OR trees to enable branch-and-bound. Finally, we propose a discretization mechanism to offer a tradeoff between the reconstruction overhead and the pruning efficiency. We demonstrate the superiorities of our algorithms over the state-of-the-art from both theoretical and experimental perspectives.

PDF Details DOI

AAAI Conference 2019 Conference Paper

A Memetic Approach for Sequential Security Games on a Plane with Moving Targets

Jan Karwowski
Jacek Mańdziuk
Adam Żychowski
Filip Grajek
Bo An

This paper introduces a new type of Security Games (SG) played on a plane with targets moving along predefined straight line trajectories and its respective Mixed Integer Linear Programming (MILP) formulation. Three approaches for solving the game are proposed and experimentally evaluated: application of an MILP solver to finding exact solutions for small-size games, MILP-based extension of recently published zero-sum SG approach to the case of generalsum games for finding approximate solutions of medium-size games, and the use of Memetic Algorithm (MA) for mediumsize and large-size game instances, which are beyond MILP’s scalability. Utilization of MA is, to the best of our knowledge, a new idea in the field of SG. The novelty of proposed solution lies specifically in efficient chromosome-based game encoding and dedicated local improvement heuristics. In vast majority of test cases with known equilibrium profiles, the method leads to optimal solutions with high stability and approximately linear time scalability. Another advantage is an iteration-based construction of the system, which makes the approach essentially an anytime method. This property is of paramount importance in case of restrictive time limits, which could hinder the possibility of calculating an exact solution. On a general note, we believe that MA-based methods may offer a viable alternative to MILP solvers for complex games that require application of approximate solving methods.

AAAI Conference 2019 Conference Paper

Collaboration Based Multi-Label Learning

Lei Feng
Bo An
Shuo He

It is well-known that exploiting label correlations is crucially important to multi-label learning. Most of the existing approaches take label correlations as prior knowledge, which may not correctly characterize the real relationships among labels. Besides, label correlations are normally used to regularize the hypothesis space, while the final predictions are not explicitly correlated. In this paper, we suggest that for each individual label, the final prediction involves the collaboration between its own prediction and the predictions of other labels. Based on this assumption, we first propose a novel method to learn the label correlations via sparse reconstruction in the label space. Then, by seamlessly integrating the learned label correlations into model training, we propose a novel multi-label learning approach that aims to explicitly account for the correlated predictions of labels while training the desired model simultaneously. Extensive experimental results show that our approach outperforms the state-of-the-art counterparts.

AAMAS Conference 2019 Conference Paper

Competitive Bridge Bidding with Deep Neural Networks

Jiang Rong
Tao Qin
Bo An

The game of bridge consists of two stages: bidding and playing. While playing is proved to be relatively easy for computer programs, bidding is very challenging. During the bidding stage, each player knowing only his/her own cards needs to exchange information with his/her partner and interfere with opponents at the same time. Existing methods for solving perfect-information games cannot be directly applied to bidding. Most bridge programs are based on human-designed rules, which, however, cannot cover all situations and are usually ambiguous and even conflicting with each other. In this paper, we, for the first time, propose a competitive bidding system based on deep learning techniques, which exhibits two novelties. First, we design a compact representation to encode the private and public information available to a player for bidding. Second, based on the analysis of the impact of other players’ unknown cards on one’s final rewards, we design two neural networks to deal with imperfect information, the first one inferring the cards of the partner and the second one taking the outputs of the first one as part of its input to select a bid. Experimental results show that our bidding system outperforms the top rule-based program.

IJCAI Conference 2019 Conference Paper

Dynamic Electronic Toll Collection via Multi-Agent Deep Reinforcement Learning with Edge-Based Graph Convolutional Networks

Wei Qiu
Haipeng Chen
Bo An

Over the past decades, Electronic Toll Collection (ETC) systems have been proved the capability of alleviating traffic congestion in urban areas. Dynamic Electronic Toll Collection (DETC) was recently proposed to further improve the efficiency of ETC, where tolls are dynamically set based on traffic dynamics. However, computing the optimal DETC scheme is computationally difficult and existing approaches are limited to small scale or partial road networks, which significantly restricts the adoption of DETC. To this end, we propose a novel multi-agent reinforcement learning (RL) approach for DETC. We make several key contributions: i) an enhancement over the state-of-the-art RL-based method with a deep neural network representation of the policy and value functions and a temporal difference learning framework to accelerate the update of target values, ii) a novel edge-based graph convolutional neural network (eGCN) to extract the spatio-temporal correlations of the road network state features, iii) a novel cooperative multi-agent reinforcement learning (MARL) which divides the whole road network into partitions according to their geographic and economic characteristics and trains a tolling agent for each partition. Experimental results show that our approach can scale up to realistic-sized problems with robust performance and significantly outperform the state-of-the-art method.

AAMAS Conference 2019 Conference Paper

Efficient City-Scale Patrolling Using Decomposition and Grafting

Wanyuan Wang
Zichen Dong
Bo An
Yichuan Jiang

This paper uses an integer program (IP) to formulate the city-scale patrolling (CSP) problem, with the objective of maximizing the police visibility rate (PVR) and the constraint of incident response time guarantee. We decompose the original CSP into two subproblems: minimizing police problem (MinP) and maximizing PVR (MaxP) problem. A polynomial time approximation algorithm is proposed for MinP, and a polynomial time optimal algorithm is proposed for MaxP. We conduct experiments to demonstrate the efficiency of the proposed algorithm.

NeurIPS Conference 2019 Conference Paper

Manipulating a Learning Defender and Ways to Counteract

Jiarui Gan
Qingyu Guo
Long Tran-Thanh
Bo An
Michael Wooldridge

In Stackelberg security games when information about the attacker's payoffs is uncertain, algorithms have been proposed to learn the optimal defender commitment by interacting with the attacker and observing their best responses. In this paper, we show that, however, these algorithms can be easily manipulated if the attacker responds untruthfully. As a key finding, attacker manipulation normally leads to the defender learning a maximin strategy, which effectively renders the learning attempt meaningless as to compute a maximin strategy requires no additional information about the other player at all. We then apply a game-theoretic framework at a higher level to counteract such manipulation, in which the defender commits to a policy that specifies her strategy commitment according to the learned information. We provide a polynomial-time algorithm to compute the optimal such policy, and in addition, a heuristic approach that applies even when the attacker's payoff space is infinite or completely unknown. Empirical evaluation shows that our approaches can improve the defender's utility significantly as compared to the situation when attacker manipulation is ignored.

AAAI Conference 2019 Conference Paper

On the Inducibility of Stackelberg Equilibrium for Security Games

Qingyu Guo
Jiarui Gan
Fei Fang
Long Tran-Thanh
Milind Tambe
Bo An

Strong Stackelberg equilibrium (SSE) is the standard solution concept of Stackelberg security games. As opposed to the weak Stackelberg equilibrium (WSE), the SSE assumes that the follower breaks ties in favor of the leader and this is widely acknowledged and justified by the assertion that the defender can often induce the attacker to choose a preferred action by making an infinitesimal adjustment to her strategy. Unfortunately, in security games with resource assignment constraints, the assertion might not be valid; it is possible that the defender cannot induce the desired outcome. As a result, many results claimed in the literature may be overly optimistic. To remedy, we first formally define the utility guarantee of a defender strategy and provide examples to show that the utility of SSE can be higher than its utility guarantee. Second, inspired by the analysis of leader’s payoff by Von Stengel and Zamir (2004), we provide the solution concept called the inducible Stackelberg equilibrium (ISE), which owns the highest utility guarantee and always exists. Third, we show the conditions when ISE coincides with SSE and the fact that in general case, SSE can be extremely worse with respect to utility guarantee. Moreover, introducing the ISE does not invalidate existing algorithmic results as the problem of computing an ISE polynomially reduces to that of computing an SSE. We also provide an algorithmic implementation for computing ISE, with which our experiments unveil the empirical advantage of the ISE over the SSE.

AAAI Conference 2019 Conference Paper

Optimal Interdiction of Urban Criminals with the Aid of Real-Time Information

Youzhi Zhang
Qingyu Guo
Bo An
Long Tran-Thanh
Nicholas R. Jennings

Most violent crimes happen in urban and suburban cities. With emerging tracking techniques, law enforcement officers can have real-time location information of the escaping criminals and dynamically adjust the security resource allocation to interdict them. Unfortunately, existing work on urban network security games largely ignores such information. This paper addresses this omission. First, we show that ignoring the real-time information can cause an arbitrarily large loss of efficiency. To mitigate this loss, we propose a novel NEtwork purSuiT game (NEST) model that captures the interaction between an escaping adversary and a defender with multiple resources and real-time information available. Second, solving NEST is proven to be NP-hard. Third, after transforming the non-convex program of solving NEST to a linear program, we propose our incremental strategy generation algorithm, including: (i) novel pruning techniques in our best response oracle; and (ii) novel techniques for mapping strategies between subgames and adding multiple best response strategies at one iteration to solve extremely large problems. Finally, extensive experiments show the effectiveness of our approach, which scales up to realistic problem sizes with hundreds of nodes on networks including the real network of Manhattan.

IJCAI Conference 2019 Conference Paper

Partial Label Learning by Semantic Difference Maximization

Lei Feng
Bo An

Partial label learning is a weakly supervised learning framework, in which each instance is provided with multiple candidate labels while only one of them is correct. Most of the existing approaches focus on leveraging the instance relationships to disambiguate the given noisy label space, while it is still unclear whether we can exploit potentially useful information in label space to alleviate the label ambiguities. This paper gives a positive answer to this question for the first time. Specifically, if two instances do not share any common candidate labels, they cannot have the same ground-truth label. By exploiting such dissimilarity relationships from label space, we propose a novel approach that aims to maximize the latent semantic differences of the two instances whose ground-truth labels are definitely different, while training the desired model simultaneously, thereby continually enlarging the gap of label confidences between two instances of different classes. Extensive experiments on artificial and real-world partial label datasets show that our approach significantly outperforms state-of-the-art counterparts.

AAAI Conference 2019 Conference Paper

Partial Label Learning with Self-Guided Retraining

Lei Feng
Bo An

Partial label learning deals with the problem where each training instance is assigned a set of candidate labels, only one of which is correct. This paper provides the first attempt to leverage the idea of self-training for dealing with partially labeled examples. Specifically, we propose a unified formulation with proper constraints to train the desired model and perform pseudo-labeling jointly. For pseudo-labeling, unlike traditional self-training that manually differentiates the ground-truth label with enough high confidence, we introduce the maximum infinity norm regularization on the modeling outputs to automatically achieve this consideratum, which results in a convex-concave optimization problem. We show that optimizing this convex-concave problem is equivalent to solving a set of quadratic programming (QP) problems. By proposing an upper-bound surrogate objective function, we turn to solving only one QP problem for improving the optimization efficiency. Extensive experiments on synthesized and real-world datasets demonstrate that the proposed approach significantly outperforms the state-of-the-art partial label learning approaches.

IJCAI Conference 2019 Conference Paper

Who Should Pay the Cost: A Game-theoretic Model for Government Subsidized Investments to Improve National Cybersecurity

Xinrun Wang
Bo An
Hau Chan

Due to the recent cyber attacks, cybersecurity is becoming more critical in modern society. A single attack (e. g. , WannaCry ransomware attack) can cause as much as $4 billion in damage. However, the cybersecurity investment by companies is far from satisfactory. Therefore, governments (e. g. , in the UK) launch grants and subsidies to help companies to boost their cybersecurity to create a safer national cyber environment. The allocation problem is hard due to limited subsidies and the interdependence between self-interested companies and the presence of a strategic cyber attacker. To tackle the government's allocation problem, we introduce a Stackelberg game-theoretic model where the government first commits to an allocation and the companies/users and attacker simultaneously determine their protection and attack (pure or mixed) strategies, respectively. For the pure-strategy case, while there may not be a feasible allocation in general, we prove that computing an optimal allocation is NP-hard and propose a linear reverse convex program when the attacker can attack all users. For the mixed-strategy case, we show that there is a polynomial time algorithm to find an optimal allocation when the attacker has a single-attack capability. We then provide a heuristic algorithm, based on best-response-gradient dynamics, to find an effective allocation in the general setting. Experimentally, we show that our heuristic is effective and outperforms other baselines on synthetic and real data.

AIJ Journal 2018 Journal Article

Book review

Bo An

IS Journal 2018 Journal Article

Camera Placement Based on Vehicle Traffic for Better City Security Surveillance

Xiaobo Ma
Yihui He
Xiapu Luo
Jianfeng Li
Mengchen Zhao
Bo An
Xiaohong Guan

Security surveillance is important in smart cities. Deploying numerous cameras is a common approach. Given the importance of vehicles in a metropolis, using vehicle traffic patterns to strategically place cameras could potentially facilitate security surveillance. This article constitutes the first effort toward building the link between vehicle traffic and camera placement for better security surveillance.

AAAI Conference 2018 Conference Paper

Catching Captain Jack: Efficient Time and Space Dependent Patrols to Combat Oil-Siphoning in International Waters

Xinrun Wang
Bo An
Martin Strobel
Fookwai Kong

Pirate syndicates capturing tankers to siphon oil, causing an estimated cost of $5 billion a year, has become a serious security issue for maritime trafﬁc. In response to the threat, coast guards and navies deploy patrol boats to protect international oil trade. However, given the vast area of the sea and the highly time and space dependent behaviors of both players, it remains a signiﬁcant challenge to ﬁnd efﬁcient ways to deploy patrol resources. In this paper, we address the research challenges and provide four key contributions. First, we construct a Stackelberg model of the oil-siphoning problem based on incident reports of actual attacks; Second, we propose a compact formulation and a constraint generation algorithm, which tackle the exponentially growth of the defender’s and attacker’s strategy spaces, respectively, to compute efﬁcient strategies of security agencies; Third, to further improve the scalability, we propose an abstraction method, which exploits the intrinsic similarity of defender’s strategy space, to solve extremely large-scale games; Finally, we evaluate our approaches through extensive simulations and a detailed case study with real ship trafﬁc data. The results demonstrate that our approach achieves a dramatic improvement of scalability with modest inﬂuence on the solution quality and can scale up to realistic-sized problems.

AAAI Conference 2018 Conference Paper

Data Poisoning Attacks on Multi-Task Relationship Learning

Mengchen Zhao
Bo An
Yaodong Yu
Sulin Liu
Sinno Pan

Multi-task learning (MTL) is a machine learning paradigm that improves the performance of each task by exploiting useful information contained in multiple related tasks. However, the relatedness of tasks can be exploited by attackers to launch data poisoning attacks, which has been demonstrated a big threat to single-task learning. In this paper, we provide the ﬁrst study on the vulnerability of MTL. Speciﬁcally, we focus on multi-task relationship learning (MTRL) models, a popular subclass of MTL models where task relationships are quantized and are learned directly from training data. We formulate the problem of computing optimal poisoning attacks on MTRL as a bilevel program that is adaptive to arbitrary choice of target tasks and attacking tasks. We propose an ef- ﬁcient algorithm called PATOM for computing optimal attack strategies. PATOM leverages the optimality conditions of the subproblem of MTRL to compute the implicit gradients of the upper level objective function. Experimental results on realworld datasets show that MTRL models are very sensitive to poisoning attacks and the attacker can signiﬁcantly degrade the performance of target tasks, by either directly poisoning the target tasks or indirectly poisoning the related tasks exploiting the task relatedness. We also found that the tasks being attacked are always strongly correlated, which provides a clue for defending against such attacks.

AAAI Conference 2018 Conference Paper

DyETC: Dynamic Electronic Toll Collection for Traffic Congestion Alleviation

Haipeng Chen
Bo An
Guni Sharon
Josiah Hanna
Peter Stone
Chunyan Miao
Yeng Soh

To alleviate trafﬁc congestion in urban areas, electronic toll collection (ETC) systems are deployed all over the world. Despite the merits, tolls are usually pre-determined and ﬁxed from day to day, which fail to consider trafﬁc dynamics and thus have limited regulation effect when trafﬁc conditions are abnormal. In this paper, we propose a novel dynamic ETC (DyETC) scheme which adjusts tolls to trafﬁc conditions in realtime. The DyETC problem is formulated as a Markov decision process (MDP), the solution of which is very challenging due to its 1) multi-dimensional state space, 2) multidimensional, continuous and bounded action space, and 3) time-dependent state and action values. Due to the complexity of the formulated MDP, existing methods cannot be applied to our problem. Therefore, we develop a novel algorithm, PG-β, which makes three improvements to traditional policy gradient method by proposing 1) time-dependent value and policy functions, 2) Beta distribution policy function and 3) state abstraction. Experimental results show that, compared with existing ETC schemes, DyETC increases trafﬁc volume by around 8%, and reduces travel time by around 14. 6% during rush hour. Considering the total trafﬁc volume in a trafﬁc network, this contributes to a substantial increase to social welfare.

AAAI Conference 2018 Conference Paper

Dynamic Pricing for Reusable Resources in Competitive Market With Stochastic Demand

Jiang Rong
Tao Qin
Bo An

The market for selling reusable products (e. g. , car rental, cloud services and network access resources) is growing rapidly over the last few years, where service providers maximize their revenues through setting optimal prices. While there has been lots of research on pricing optimization, existing works often ignore dynamic property of demand and the competition among providers. Thus, existing pricing solutions might be far from optimal in realistic markets. This paper provides the ﬁrst study of service providers’ dynamic pricing in consideration of market competition and makes three key contributions along this line. First, we propose a comprehensive model that takes into account the dynamic demand and interaction among providers, and formulate the optimal pricing policy in the competitive market as an equilibrium. Second, we propose an approximate Nash equilibrium to describe providers’ behaviors, and design an efﬁcient algorithm to compute the equilibrium which is guaranteed to converge. Third, we derive many properties of the model without any further constraints on demand functions, which can reduce the search space of policies in the algorithm. Finally, we conduct extensive experiments with different parameter settings, showing that the approximate equilibrium is very close to the Nash equilibrium and our proposed pricing policy outperforms existing strategies.

AAMAS Conference 2018 Conference Paper

Equilibrium Refinement in Security Games with Arbitrary Scheduling Constraints

Kai Wang
Qingyu Guo
Phebe Vayanos
Milind Tambe
Bo An

Significant research effort in security games has focused in devising strategies that perform well even when the attacker deviates from optimal (rational) behavior. In most of these frameworks, a price needs to be paid to ensure robustness against this unpredictability. However, equilibrium refinement is an attractive alternative to boost solution robustness at no cost even though it has not received as much attention in security game literature. In this framework, resources are strategically allocated to secure an optimal outcome against a rational adversary while simultaneously protecting other targets to ensure good outcomes against boundedly rational or constrained attackers. Unfortunately, existing approaches for equilibrium refinement in security games cannot effectively address scheduling constraints that arise frequently in real-world applications. In this paper, we aim to fill this gap and make several key contributions. First, we show that existing approaches for equilibrium refinement can fail in the presence of scheduling constraints. Second, we investigate the properties of the best response of the attacker. Third, we leverage these properties to devise novel iterative algorithms to compute the optimally refined equilibrium, with polynomially many calls to an LP oracle for zero-sum games. Finally, we conduct extensive experimental evaluations that showcase i) the superior performance of our approach in the face of a boundedly rational attacker and ii) the attractive scalability properties of our algorithm that can solve realistic-sized instances.

AAAI Conference 2018 Conference Paper

HogRider: Champion Agent of Microsoft Malmo Collaborative AI Challenge

Yanhai Xiong
Haipeng Chen
Mengchen Zhao
Bo An

It has been an open challenge for self-interested agents to make optimal sequential decisions in complex multiagent systems, where agents might achieve higher utility via collaboration. The Microsoft Malmo Collaborative AI Challenge (MCAC), which is designed to encourage research relating to various problems in Collaborative AI, takes the form of a Minecraft mini-game where players might work together to catch a pig or deviate from cooperation, for pursuing high scores to win the challenge. Various characteristics, such as complex interactions among agents, uncertainties, sequential decision making and limited learning trials all make it extremely challenging to ﬁnd effective strategies. We present HogRider - the champion agent of MCAC in 2017 out of 81 teams from 26 countries. One key innovation of HogRider is a generalized agent type hypothesis framework to identify the behavior model of the other agents, which is demonstrated to be robust to observation uncertainty. On top of that, a second key innovation is a novel Q-learning approach to learn effective policies against each type of the collaborating agents. Various ideas are proposed to adapt traditional Q-learning to handle complexities in the challenge, including state-action abstraction to reduce problem scale, a warm start approach using human reasoning for addressing limited learning trials, and an active greedy strategy to balance exploitationexploration. Challenge results show that HogRider outperforms all the other teams by a signiﬁcant edge, in terms of both optimality and stability.

IJCAI Conference 2018 Conference Paper

Impression Allocation for Combating Fraud in E-commerce Via Deep Reinforcement Learning with Action Norm Penalty

Mengchen Zhao
Zhao Li
Bo An
Haifeng Lu
Yifan Yang
Chen Chu

Conducting fraud transactions has become popular among e-commerce sellers to make their products favorable to the platform and buyers, which decreases the utilization efficiency of buyer impressions and jeopardizes the business environment. Fraud detection techniques are necessary but not enough for the platform since it is impossible to recognize all the fraud transactions. In this paper, we focus on improving the platform's impression allocation mechanism to maximize its profit and reduce the sellers' fraudulent behaviors simultaneously. First, we learn a seller behavior model to predict the sellers' fraudulent behaviors from the real-world data provided by one of the largest e-commerce company in the world. Then, we formulate the platform's impression allocation problem as a continuous Markov Decision Process (MDP) with unbounded action space. In order to make the action executable in practice and facilitate learning, we propose a novel deep reinforcement learning algorithm DDPG-ANP that introduces an action norm penalty to the reward function. Experimental results show that our algorithm significantly outperforms existing baselines in terms of scalability and solution quality.

AAMAS Conference 2018 Conference Paper

Inducible Equilibrium for Security Games

Qingyu Guo
Jiarui Gan
Fei Fang
Long Tran-Thanh
Milind Tambe
Bo An

Strong Stackelberg equilibrium (SSE) is the standard solution concept of Stackelberg security games. The SSE assumes that the follower breaks ties in favor of the leader and this is widely acknowledged and justified by the assertion that the defender can often induce the attacker to choose a preferred action by making an infinitesimal adjustment to her strategy. Unfortunately, in security games with resource assignment constraints, the assertion might not be valid. To overcome this issue, inspired by the notion of inducibility and the pessimistic Stackelberg equilibrium [20, 21], this paper presents the inducible Stackelberg equilibrium (ISE), which is guaranteed to exist and avoids overoptimism as the outcome can always be induced with infinitesimal strategy deviation. Experimental evaluation unveils the significant overoptimism and sub-optimality of SSE and thus, verifies the advantage of the ISE as an alternative solution concept.

IJCAI Conference 2018 Conference Paper

Leveraging Latent Label Distributions for Partial Label Learning

Lei Feng
Bo An

In partial label learning, each training example is assigned a set of candidate labels, only one of which is the ground-truth label. Existing partial label learning frameworks either assume each candidate label of equal confidence or consider the ground-truth label as a latent variable hidden in the indiscriminate candidate label set, while the different labeling confidence levels of the candidate labels are regrettably ignored. In this paper, we formalize the different labeling confidence levels as the latent label distributions, and propose a novel unified framework to estimate the latent label distributions while training the model simultaneously. Specifically, we present a biconvex formulation with constrained local consistency and adopt an alternating method to solve this optimization problem. The process of alternating optimization exactly facilitates the mutual adaption of the model training and the constrained label propagation. Extensive experimental results on controlled UCI datasets as well as real-world datasets clearly show the effectiveness of the proposed approach.

AIJ Journal 2018 Journal Article

Optimal defense against election control by deleting voter groups

Yue Yin
Yevgeniy Vorobeychik
Bo An
Noam Hazon

Election control encompasses attempts from an external agent to alter the structure of an election in order to change its outcome. This problem is both a fundamental theoretical problem in social choice, and a major practical concern for democratic institutions. Consequently, this issue has received considerable attention, particularly as it pertains to different voting rules. In contrast, the problem of how election control can be prevented or deterred has been largely ignored. We introduce the problem of optimal defense against election control, including destructive and constructive control, where manipulation is allowed at the granularity of groups of voters (e. g. , voting locations) through a denial-of-service attack, and the defender allocates limited protection resources to prevent control. We consider plurality voting, and show that it is computationally hard to prevent both types of control, though destructive control itself can be performed in polynomial time. For defense against destructive control, we present a double-oracle framework for computing an optimal prevention strategy. We show that both defender and attacker best response subproblems are NP-complete, and develop exact mixed-integer linear programming approaches for solving these, as well as fast heuristic methods. We then extend this general approach to develop effective algorithmic solutions for defense against constructive control. Finally, we generalize the model and algorithmic approaches to consider uncertainty about voter preferences. Experiments conducted on both synthetic and real data demonstrate that the proposed computational framework can scale to realistic problem instances. 1

AAAI Conference 2018 Conference Paper

Optimal Spot-Checking for Improving Evaluation Accuracy of Peer Grading Systems

Wanyuan Wang
Bo An
Yichuan Jiang

Peer grading, allowing students/peers to evaluate others’ assignments, offers a promising solution for scaling evaluation and learning to large-scale educational systems. A key challenge in peer grading is motivating peers to grade diligently. While existing spot-checking (SC) mechanisms can prevent peer collusion where peers coordinate to report the uninformative grade, they unrealistically assume that peers have the same grading reliability and cost. This paper studies the general Optimal Spot-Checking (OptSC) problem of determining the probability each assignment needs to be checked to maximize assignments’ evaluation accuracy aggregated from peers, and takes into consideration 1) peers’ heterogeneous characteristics, and 2) peers’ strategic grading behaviors to maximize their own utility. We prove that the bilevel OptSC is NP-hard to solve. By exploiting peers’ grading behaviors, we ﬁrst formulate a single level relaxation to approximate OptSC. By further exploiting structural properties of the relaxed problem, we propose an efﬁcient algorithm to that relaxation, which also gives a good approximation of the original OptSC. Extensive experiments on both synthetic and real datasets show signiﬁcant advantages of the proposed algorithm over existing approaches.

IJCAI Conference 2018 Conference Paper

Stackelberg Security Games: Looking Beyond a Decade of Success

Arunesh Sinha
Fei Fang
Bo An
Christopher Kiekintveld
Milind Tambe

The Stackelberg Security Game (SSG) model has been immensely influential in security research since it was introduced roughly a decade ago. Furthermore, deployed SSG-based applications are one of most successful examples of game theory applications in the real world. We present a broad survey of recent technical advances in SSG and related literature, and then look to the future by highlighting the new potential applications and open research problems in SSG.

TAAS Journal 2018 Journal Article

Understanding Crowdsourcing Systems from a Multiagent Perspective and Approach

Jiuchuan Jiang
Bo An
Yichuan Jiang
Donghui Lin
Zhan Bu
Jie Cao
Zhifeng Hao

Crowdsourcing has recently been significantly explored. Although related surveys have been conducted regarding this subject, each has mainly consisted of a review of a single aspect of crowdsourcing systems or on the application of crowdsourcing in a specific application domain. A crowdsourcing system is a comprehensive set of multiple entities, including various elements and processes. Multiagent computing has already been widely envisioned as a powerful paradigm for modeling autonomous multi-entity systems with adaptation to dynamic environments. Therefore, this article presents a novel multiagent perspective and approach to understanding crowdsourcing systems, which can be used to correlate the research on crowdsourcing and multiagent systems and inspire possible interdisciplinary research between the two areas. This article mainly discusses the following two aspects: (1) The multiagent perspective can be used for conducting a comprehensive survey on the state of the art of crowdsourcing, and (2) the multiagent approach can bring about concrete enhancements for crowdsourcing technology and inspire future research directions that enable crowdsourcing research to overcome the typical challenges in crowdsourcing technology. Finally, this article discusses the advantages and disadvantages of the multiagent perspective by comparing it with two other popular perspectives on crowdsourcing: the business perspective and the technical perspective.

IJCAI Conference 2017 Conference Paper

Comparing Strategic Secrecy and Stackelberg Commitment in Security Games

Qingyu Guo
Bo An
Branislav Bošanský
Christopher Kiekintveld

The Strong Stackelberg Equilibrium (SSE) has drawn extensive attention recently in several security domains. However, the SSE concept neglects the advantage of defender's strategic revelation of her private information, and overestimates the observation ability of the adversaries. In this paper, we overcome these restrictions and analyze the tradeoff between strategic secrecy and commitment in security games. We propose a Disguised-resource Security Game (DSG) where the defender strategically disguises some of her resources. We compare strategic information revelation with public commitment and formally show that they have different advantages depending the payoff structure. To compute the Perfect Bayesian Equilibrium (PBE), several novel approaches are provided, including a novel algorithm based on support set enumeration, and an approximation algorithm for \epsilon-PBE. Extensive experimental evaluation shows that both strategic secrecy and Stackelberg commitment are critical measures in security domain, and our approaches can efficiently solve PBEs for realistic-sized problems.

TIST Journal 2017 Journal Article

Data-Driven Frequency-Based Airline Profit Maximization

Bo An
Haipeng Chen
Noseong Park
V. S. Subrahmanian

Although numerous traditional models predict market share and demand along airline routes, the prediction of existing models is not precise enough, and to the best of our knowledge, there is no use of data mining--based forecasting techniques for improving airline profitability. We propose the maximizing airline profits (MAP) architecture designed to help airlines and make two key contributions in airline market share and route demand prediction and prediction-based airline profit optimization. Compared to past methods used to forecast market share and demand along airline routes, we introduce a novel ensemble forecasting (MAP-EF) approach considering two new classes of features: (i) features derived from clusters of similar routes and (ii) features based on equilibrium pricing. We show that MAP-EF achieves much better Pearson correlation coefficients (greater than 0.95 vs. 0.82 for market share, 0.98 vs. 0.77 for demand) and R 2 -values compared to three state-of-the-art works for forecasting market share and demand while showing much lower variance. Using the results of MAP-EF, we develop MAP--bilevel branch and bound (MAP-BBB) and MAP-greedy (MAP-G) algorithms to optimally allocate flight frequencies over multiple routes to maximize an airline’s profit. We also study two extensions of the profit maximization problem considering frequency constraints and long-term profits. Furthermore, we develop algorithms for computing Nash equilibrium frequencies when there are multiple strategic airlines. Experimental results show that airlines can increase profits by a significant margin. All experiments were conducted with data aggregated from four sources: the U.S. Bureau of Transportation Statistics (BTS), the U.S. Bureau of Economic Analysis (BEA), the National Transportation Safety Board (NTSB), and the U.S. Census Bureau (CB).

IJCAI Conference 2017 Conference Paper

Defending Against Man-In-The-Middle Attack in Repeated Games

Shuxin Li
Xiaohong Li
Jianye Hao
Bo An
Zhiyong Feng
Kangjie Chen
Chengwei Zhang

The Man-in-the-Middle (MITM) attack has become widespread in networks nowadays. The MITM attack would cause serious information leakage and result in tremendous loss to users. Previous work applies game theory to analyze the MITM attack-defense problem and computes the optimal defense strategy to minimize the total loss. It assumes that all defenders are cooperative and the attacker know defenders' strategies beforehand. However, each individual defender is rational and may not have the incentive to cooperate. Furthermore, the attacker can hardly know defenders' strategies ahead of schedule in practice. To this end, we assume that all defenders are self-interested and model the MITM attack-defense scenario as a simultaneous-move game. Nash equilibrium is adopted as the solution concept which is proved to be always unique. Given the impracticability of computing Nash equilibrium directly, we propose practical adaptive algorithms for the defenders and the attacker to learn towards the unique Nash equilibrium through repeated interactions. Simulation results show that the algorithms are able to converge to Nash equilibrium strategy efficiently.

IJCAI Conference 2017 Conference Paper

Efficient Label Contamination Attacks Against Black-Box Learning Models

Mengchen Zhao
Bo An
Wei Gao
Teng Zhang

Label contamination attack (LCA) is an important type of data poisoning attack where an attacker manipulates the labels of training data to make the learned model beneficial to him. Existing work on LCA assumes that the attacker has full knowledge of the victim learning model, whereas the victim model is usually a black-box to the attacker. In this paper, we develop a Projected Gradient Ascent (PGA) algorithm to compute LCAs on a family of empirical risk minimizations and show that an attack on one victim model can also be effective on other victim models. This makes it possible that the attacker designs an attack against a substitute model and transfers it to a black-box victim model. Based on the observation of the transferability, we develop a defense algorithm to identify the data points that are most likely to be attacked. Empirical studies show that PGA significantly outperforms existing baselines and linear learning models are better substitute models than nonlinear ones.

IJCAI Conference 2017 Conference Paper

Game Theoretic Analysis of Security and Sustainability

Bo An

Computational game theory has become a powerful tool to address critical issues in security and sustainability. Casting the security resource allocation problem as a Stackelberg game, novel algorithms have been developed to provide randomized security resource allocations. These algorithms have led to deployed security-game based decision aids for many real-world security domains including infrastructure security and wildlife protection. We contribute to this community by addressing several major research challenges in complex security resource allocation, including dynamic payoffs, uncertainty, protection externality, games on networks, and strategic secrecy. We also analyze optimal security resource allocation in many potential application domains including cyber security. Furthermore, we apply game theory to reasoning optimal policy in deciding taxi pricing scheme and EV charging placement and pricing.

IS Journal 2017 Journal Article

Game-Theoretic Considerations for Optimizing Taxi System Efficiency

Jiarui Gan
Bo An

Taxi service is an indispensable part of public transport in modern cities. To support its unique features, a taxi system adopts a decentralized operation mode in which thousands of taxis freely decide their working schedules and routes. Taxis compete with each other for individual profits regardless of system-level efficiency, making the taxi system inefficient and hard to optimize. Most research into the management and economics of taxi markets has focused on modeling from a macro level the effects of and relationships between various market factors. Less has been done regarding a more important component--drivers' strategic behavior under the decentralized operation mode. The authors propose looking at the problem from a game-theoretic perspective. Combining game-theoretic solution concepts with existing models of taxi markets, they model taxi drivers' strategy-making process as a game and transform the problem of optimizing taxi system efficiency into finding a market policy that leads to the desired equilibrium.

AIJ Journal 2017 Journal Article

Human–computer negotiation in a three player market setting

Galit Haim
Ya'akov (Kobi) Gal
Bo An
Sarit Kraus

This paper proposes a novel agent-design for a three-player game involving human players and computer agents. The game is analogous to settings in which participants repeatedly negotiate over contracts, such as cell-phones and credit card plans. The game comprises three players, two service providers who compete to sign contracts with a single customer player. The service providers compete to make repeated contract offers to the customer consisting of resource exchanges in the game. Customers can join and leave contracts at will. We computed sub-game perfect equilibrium strategies for all players that were based on making contracts involving commitments between the customer player and one of the service provider players. We conducted extensive empirical studies (spanning over 500 participants) comparing the performance of computer agents using different types of equilibrium strategies with that of people in three different countries, the U. S. , Israel and China, that are characterized by cultural differences in how people make contracts in the game. Two human participants played a single computer agent in various role configurations in the game. For the customer role, agents using equilibrium strategies were able to obtain a higher score than people playing the same role in three countries. For the service provider role, agents using equilibrium strategies that reasoned about possibly irrational behavior were able to obtain higher scores than people (as well as agents that did not reason about irrational behavior). This work shows that for particular market settings involving competition between service providers, equilibrium strategies can be a successful design paradigm for computer agents without relying on data driven approaches.

IJCAI Conference 2017 Conference Paper

Optimal Escape Interdiction on Transportation Networks

Youzhi Zhang
Bo An
Long Tran-Thanh
Zhen Wang
Jiarui Gan
Nicholas R. Jennings

Preventing crimes or terrorist attacks in urban areas is challenging. Law enforcement officers need to respond quickly to catch the attacker on his escape route, which is subject to time-dependent traffic conditions on transportation networks. The attacker can strategically choose his escape path and driving speed to avoid being captured. Existing work on security resource allocation has not considered such scenarios with time-dependent strategies for both players. Therefore, in this paper, we study the problem of efficiently scheduling security resources for interdicting the escaping attacker. We propose: 1) a new defender-attacker security game model for escape interdiction on transportation networks; and 2) an efficient double oracle algorithm to compute the optimal defender strategy, which combines mixed-integer linear programming formulations for best response problems and effective approximation algorithms for improving the scalability of the algorithms. Experimental evaluation shows that our approach significantly outperforms baselines in solution quality and scales up to realistic-sized transportation networks with hundreds of intersections.

AAAI Conference 2017 Conference Paper

Optimal Personalized Defense Strategy Against Man-In-The-Middle Attack

Xiaohong Li
Shuxin Li
Jianye Hao
Zhiyong Feng
Bo An

The Man-In-The-Middle (MITM) attack is one of the most common attacks employed in the network hacking. MITM attackers can successfully invoke attacks such as denial of service (DoS) and port stealing, and lead to surprisingly harmful consequences for users in terms of both ﬁnancial loss and security issues. The conventional defense approaches mainly consider how to detect and eliminate those attacks or how to prevent those attacks from being launched in the ﬁrst place. This paper proposes a game-theoretic defense strategy from a different perspective, which aims at minimizing the loss that the whole system sustains given that the MITM attacks are inevitable. We model the interaction between the attacker and the defender as a Stackelberg security game and adopt the Strong Stackelberg Equilibrium (SSE) as the defender’s strategy. Since the defender’s strategy space is inﬁnite in our model, we employ a novel method to reduce the searching space of computing the optimal defense strategy. Finally, we empirically evaluate our optimal defense strategy by comparing it with non-strategic defense strategies. The results indicate that our game-theoretic defense strategy signiﬁcantly outperforms other non-strategic defense strategies in terms of decreasing the total losses against MITM attacks.

IJCAI Conference 2017 Conference Paper

Playing Repeated Network Interdiction Games with Semi-Bandit Feedback

Qingyu Guo
Bo An
Long Tran-Thanh

We study repeated network interdiction games with no prior knowledge of the adversary and the environment, which can model many real world network security domains. Existing works often require plenty of available information for the defender and neglect the frequent interactions between both players, which are unrealistic and impractical, and thus, are not suitable for our settings. As such, we provide the first defender strategy, that enjoys nice theoretical and practical performance guarantees, by applying the adversarial online learning approach. In particular, we model the repeated network interdiction game with no prior knowledge as an online linear optimization problem, for which a novel and efficient online learning algorithm, SBGA, is proposed, which exploits the unique semi-bandit feedback in network security domains. We prove that SBGA achieves sublinear regret against adaptive adversary, compared with both the best fixed strategy in hindsight and a near optimal adaptive strategy. Extensive experiments also show that SBGA significantly outperforms existing approaches with fast convergence rate.

AAAI Conference 2017 Conference Paper

POI2Vec: Geographical Latent Representation for Predicting Future Visitors

Shanshan Feng
Gao Cong
Bo An
Yeow Meng Chee

With the increasing popularity of location-aware social media applications, Point-of-Interest (POI) recommendation has recently been extensively studied. However, most of the existing studies explore from the users’ perspective, namely recommending POIs for users. In contrast, we consider a new research problem of predicting users who will visit a given POI in a given future period. The challenge of the problem lies in the difﬁculty to effectively learn POI sequential transition and user preference, and integrate them for prediction. In this work, we propose a new latent representation model POI2Vec that is able to incorporate the geographical inﬂuence, which has been shown to be very important in modeling user mobility behavior. Note that existing representation models fail to incorporate the geographical inﬂuence. We further propose a method to jointly model the user preference and POI sequential transition inﬂuence for predicting potential visitors for a given POI. We conduct experiments on 2 real-world datasets to demonstrate the superiority of our proposed approach over the state-of-the-art algorithms for both next POI prediction and future user prediction.

AAMAS Conference 2017 Conference Paper

Pricing Optimization for Selling Reusable Resources

Jiang Rong
Tao Qin
Bo An
Tie-Yan Liu

AAAI Conference 2017 Conference Paper

Revenue Maximization for Finitely Repeated Ad Auctions

Jiang Rong
Tao Qin
Bo An
Tie-Yan Liu

Reserve price is an effective tool for revenue maximization in ad auctions. The optimal reserve price depends on bidders’ value distributions, which, however, are generally unknown to auctioneers. A common practice for auctioneers is to ﬁrst collect information about the value distributions by a sampling procedure and then apply the reserve price estimated with the sampled bids to the following auctions. In order to maximize the total revenue over ﬁnite auctions, it is important for the auctioneer to ﬁnd a proper sample size to trade off between the cost of the sampling procedure and the optimality of the estimated reserve price. We investigate the sample size optimization problem for Generalized Second Price auctions, which is the most widely-used mechanism in ad auctions, and make three main contributions along this line. First, we bound the revenue losses in the form of competitive ratio during and after sampling. Second, we formulate the problem of ﬁnding the optimal sample size as a non-convex mixed integer optimization problem. Then we characterize the properties of the problem and prove the uniqueness of the optimal sample size. Third, we relax the integer optimization problem to a continuous form and develop an efﬁcient algorithm based on the properties to solve it. Experimental results show that our approach can signiﬁcantly improve the revenue for the auctioneer in ﬁnitely repeated ad auctions.

AAAI Conference 2017 Conference Paper

Security Games on a Plane

Jiarui Gan
Bo An
Yevgeniy Vorobeychik
Brian Gauch

Most existing models of Stackelberg security games ignore the underlying topology of the space in which targets and defence resources are located. As a result, allocation of resources is restricted to a discrete collection of exogenously deﬁned targets. However, in many practical security settings, defense resources can be located on a continuous plane. Better defense solutions could therefore be potentially achieved by placing resources in a space outside of actual targets (e. g. , between targets). To address this limitation, we propose a model called Security Game on a Plane (SGP) in which targets are distributed on a 2-dimensional plane, and security resources, to be allocated on the same plane, protect targets within a certain effective distance. We investigate the algorithmic aspects of SGP. We ﬁnd that computing a strong Stackelberg equilibrium of an SGP is NP-hard even for zerosum games, and these are inapproximable in general. On the positive side, we ﬁnd an exact solution technique for general SGPs based on an existing approach, and develop a PTAS (polynomial-time approximation scheme) for zero-sum SGP to more fundamentally overcome the computational obstacle. Our experiments demonstrate the value of considering SGP and effectiveness of our algorithms.

AAMAS Conference 2017 Conference Paper

Stop Nuclear Smuggling Through Efficient Container Inspection

Xinrun Wang
Qingyu Guo
Bo An

Since 2003, the U. S. government has spent $850 million on the Megaport Initiative which aims at stopping the nuclear smuggling in international container shipping through advanced inspection facilities including Non-Intrusive Inspection (NII) and Mobile Radiation Detection and Identification System (MRDIS). Unfortunately, it remains a significant challenge to efficiently inspect more than 11. 7 million containers imported to the U. S. due to the limited inspection resources. Moreover, existing work in container inspection neglects the sophisticated behavior of the smuggler who can surveil the inspector’s strategy and decide the optimal (sequential) smuggling plan. This paper is the first to tackle this challenging container inspection problem, where a novel Container Inspection Model (CIM) is proposed, which models the interaction between the inspector and the smuggler as a leader-follower Stackelberg game and formulates the smuggler’s sequential decision behavior as a Markov Decision Process (MDP). The special structure of the CIM results in a non-convex optimization problem, which cannot be addressed by existing approaches. We make several key contributions including: i) a linear relaxation approximation with guarantee of solution quality which reformulates the model as a bilinear optimization problem, ii) an algorithm inspired by the Multipleparametric Disaggregation Technique (MDT) to solve the reformulated bilinear optimization, and iii) a novel iterative algorithm to further improve the scalability. Extensive experimental evaluation shows that our approach can scale up to realistic-sized problems with robust enough solutions outperforming heuristic baselines significantly. CCS Concepts •Computing methodologies → Multi-agent systems;

AAMAS Conference 2016 Conference Paper

Coalitional Security Games

Qingyu Guo
Bo An
Yevgeniy Vorobeychik
Long Tran-Thanh
Jiarui Gan
Chunyan Miao

Game theoretic models of security, and associated computational methods, have emerged as critical components of security posture across a broad array of domains, including airport security and coast guard. These approaches consider terrorists as motivated but independent entities. There is, however, increasing evidence that attackers, be it terrorists or cyber attackers, communicate extensively and form coalitions that can dramatically increase their ability to achieve malicious goals. To date, such cooperative decision making among attackers has been ignored in the security games literature. To address the issue of cooperation among attackers, we introduce a novel coalitional security game (CSG) model. A CSG consists of a set of attackers connected by a (communication or trust) network who can form coalitions as connected subgraphs of this network so as to attack a collection of targets. A defender in a CSG can delete a set of edges, incurring a cost for deleting each edge, with the goal of optimally limiting the attackers’ ability to form effective coalitions (in terms of successfully attacking high value targets). We first show that a CSG is, in general, hard to approximate. Nevertheless, we develop a novel branch and price algorithm, leveraging a combination of column generation, relaxation, greedy approximation, and stabilization methods to enable scalable high-quality approximations of CSG solutions on realistic problem instances.

AAAI Conference 2016 Conference Paper

Computing Optimal Monitoring Strategy for Detecting Terrorist Plots

Zhen Wang
Yue Yin
Bo An

In recent years, terrorist organizations (e. g. , ISIS or al-Qaeda) are increasingly directing terrorists to launch coordinated attacks in their home countries. One example is the Paris shootings on January 7, 2015. By monitoring potential terrorists, security agencies are able to detect and stop terrorist plots at their planning stage. Although security agencies may have knowledge about potential terrorists (e. g. , who they are, how they interact), they usually have limited resources and cannot monitor all terrorists. Moreover, a terrorist planner may strategically choose to arouse terrorists considering the security agency’s monitoring strategy. This paper makes ﬁve key contributions toward the challenging problem of computing optimal monitoring strategies: 1) A new Stackelberg game model for terrorist plot detection; 2) A modiﬁed double oracle framework for computing the optimal strategy effectively; 3) Complexity results for both defender and attacker oracle problems; 4) Novel mixed-integer linear programming (MILP) formulations for best response problems of both players; and 5) Effective approximation algorithms for generating suboptimal responses for both players. Experimental evaluation shows that our approach can obtain a robust enough solution outperforming widely-used centrality based heuristics signiﬁcantly and scale up to realistic-sized problems.

AAAI Conference 2016 Conference Paper

Deploying PAWS to Combat Poaching: Game-Theoretic Patrolling in Areas with Complex Terrain (Demonstration)

Fei Fang
Thanh Nguyen
Rob Pickles
Wai Lam
Gopalasamy Clements
Bo An
Amandeep Singh
Milind Tambe

The conservation of key wildlife species such as tigers and elephants are threatened by poaching activities. In many conservation areas, foot patrols are conducted to prevent poaching but they may not be well-planned to make the best use of the limited patrolling resources. While prior work has introduced PAWS (Protection Assistant for Wildlife Security) as a game-theoretic decision aid to design effective foot patrol strategies to protect wildlife, the patrol routes generated by PAWS may be difﬁcult to follow in areas with complex terrain. Subsequent research has worked on the signiﬁcant evolution of PAWS, from an emerging application to a regularly deployed software. A key advance of the deployed version of PAWS is that it incorporates the complex terrain information and generates a strategy consisting of easy-to-follow routes. In this demonstration, we provide 1) a video introducing the PAWS system; 2) an interactive visualization of the patrol routes generated by PAWS in an example area with complex terrain; and 3) a machine-human competition in designing patrol strategy given complex terrain and animal distribution.

AAAI Conference 2016 Conference Paper

Efficient Average Reward Reinforcement Learning Using Constant Shifting Values

Shangdong Yang
Yang Gao
Bo An
Hao Wang
Xingguo Chen

There are two classes of average reward reinforcement learning (RL) algorithms: model-based ones that explicitly maintain MDP models and model-free ones that do not learn such models. Though model-free algorithms are known to be more efﬁcient, they often cannot converge to optimal policies due to the perturbation of parameters. In this paper, a novel model-free algorithm is proposed, which makes use of constant shifting values (CSVs) estimated from prior knowledge. To encourage exploration during the learning process, the algorithm constantly subtracts the CSV from the rewards. A terminating condition is proposed to handle the unboundedness of Q-values caused by such substraction. The convergence of the proposed algorithm is proved under very mild assumptions. Furthermore, linear function approximation is investigated to generalize our method to handle large-scale tasks. Extensive experiments on representative MDPs and the popular game Tetris show that the proposed algorithms signiﬁcantly outperform the state-of-the-art ones.

IJCAI Conference 2016 Conference Paper

Efficient Resource Allocation for Protecting Coral Reef Ecosystems

Yue Yin
Bo An

Coral reefs are valuable and fragile ecosystems which are under threat from human activities like coral mining. Many countries have built marine protected areas (MPAs) and protect their ecosystems through boat patrol. However, it remains a significant challenge to efficiently patrol the MPAs given the limited patrol resources of the protection agency and potential destructors' strategic actions. In this paper, we view the problem of efficiently patrolling for protecting coral reef ecosystems from a game-theoretic perspective and propose 1) a new Stackelberg game model to formulate the problem of protecting MPAs, 2) two algorithms to compute the efficient protection agency's strategies: CLP in which the protection agency's strategies are compactly represented as fractional flows in a network, and CDOG which combines the techniques of compactly representing defender strategies and incrementally generating strategies. Experimental results show that our approach leads to significantly better solution quality than that of previous works.

AAMAS Conference 2016 Conference Paper

Measuring the Distance Between Finite Markov Decision Processes

Jinhua Song
Yang Gao
Hao Wang
Bo An

Markov decision processes (MDPs) have been studied for many decades. Recent research in using transfer learning methods to solve MDPs has shown that knowledge learned from one MDP may be used to solve a similar MDP better. In this paper, we propose two metrics for measuring the distance between ﬁnite MDPs. Our metrics are based on the Hausdorﬀ metric which measures the distance between two subsets of a metric space and the Kantorovich metric for measuring the distance between probabilistic distributions. Our metrics can be used to compute the distance between reinforcement learning tasks that are modeled as MDPs. The second contribution of this paper is that we apply the metrics to direct transfer learning by ﬁnding the similar source tasks. Our third contribution is that we propose two knowledge transfer methods which transfer value functions of the selected source tasks to the target task. Extensive experimental results show that our metrics are eﬀective in ﬁnding similar tasks and signiﬁcantly improve the performance of transfer learning with the transfer methods.

IJCAI Conference 2016 Conference Paper

Optimal Interdiction of Illegal Network Flow

Qingyu Guo
Bo An
Yair Zick
Chunyan Miao

Large scale smuggling of illegal goods is a long-standing problem, with $1. 4b and thousands of agents assigned to protect the borders from such activity in the US-Mexico border alone. Illegal smuggling activities are usually blocked via inspection stations or ad-hoc checkpoints/roadblocks. Security resources are insufficient to man all stations at all times; furthermore, smugglers regularly conduct surveillance activities. This paper makes several contributions toward the challenging task of optimally interdicting an illegal network flow: i) A new Stackelberg game model for network flow interdiction; ii) A novel Column and Constraint Generation approach for computing the optimal defender strategy; iii) Complexity analysis of the column generation subproblem; iv) Compact convex nonlinear programs for solving the subproblems; v) Novel greedy and heuristic approaches for subproblems with good approximation guarantee. Experimental evaluation shows that our approach can obtain a robust enough solution outperforming the existing methods and heuristic baselines significantly and scale up to realistic-sized problems.

AAMAS Conference 2016 Conference Paper

Optimal Pricing for Efficient Electric Vehicle Charging Station Management

Yanhai Xiong
Jiarui Gan
Bo An
Chunyan Miao
Yeng Chai Soh

The rapid development of Electric Vehicles (EVs) seen in recent years has been drawing increasing attentions from the public, markets, decision-makers, and academia. Notwithstanding the progress, issues still remain. Because of the widely complained disadvantages of limited battery capacity and long charging time, charging convenience has become a top concern that greatly hinders the adoption of EVs. Specialized EV charging station, which provides more than 10 times faster charging speed than domestic charging, is therefore a critical element for successful EV promotion. While most existing researches focus on optimizing spatial placement of charging stations, they are inflexible and inefficient against rapidly changing urban structure and traffic pattern. Therefore, this paper approaches the management of EV charging stations from the pricing perspective as a more flexible and adaptive complement to established charging station placement. In this paper, we build a realistic pricing model in consideration of residential travel pattern and EV drivers’ self-interested charging behavior, traffic congestion, and operating expense of charging stations. We formulate the pricing problem as a mixed integer non-convex optimization problem, and propose a scalable algorithm to solve it. Experiments on both mock and real data are also conducted, which show scalability of our algorithm as well as our solution’s significant improvement over existing approaches.

AAMAS Conference 2016 Conference Paper

Optimal Sample Size for Adword Auctions (Extended Abstract)

Jiang Rong
Tao Qin
Bo An
Tie-Yan Liu

Generalized Second Price (GSP) mechanism is widely used in ad auctions and reserve price is an effective tool for revenue maximization. The optimal reserve price depends on bidders’ value distribution, which, however, is generally unknown to auctioneers. A common practice for auctioneers is to first collect information about the value distribution by a sampling procedure and then apply the reserve price estimated with the sampled bids to the following auctions. In order to maximize his/her total revenue over finite GSP ad auctions, it is important for the auctioneer to find a proper sample size to trade off between the cost of the sampling procedure and the optimality of the estimated reserve price. We first propose the revenue bounds during and after sampling. Then we formulate the problem of finding the optimal sample size that maximizes the auctioneer’s worse-case total revenue as an constrained optimization problem, the solution of which is independent of the value distribution.

IJCAI Conference 2016 Conference Paper

Optimally Protecting Elections

Yue Yin
Yevgeniy Vorobeychik
Bo An
Noam Hazon

Election control encompasses attempts from an external agent to alter the structure of an election in order to change its outcome. This problem is both a fundamental theoretical problem in social choice, and a major practical concern for democratic institutions. Consequently, this issue has received considerable attention, particularly as it pertains to different voting rules. In contrast, the problem of how election control can be prevented or deterred has been largely ignored. We introduce the problem of optimal protection against election control, where manipulation is allowed at the granularity of groups of voters (e. g. , voting locations), through a denial-of-service attack, and the defender allocates limited protection resources to prevent control. We show that for plurality voting, election control through group deletion to prevent a candidate from winning is in P, while it is NP-Hard to prevent such control. We then present a double-oracle framework for computing an optimal prevention strategy, developing exact mixed-integer linear programming for mulations for both the defender and attacker oracles (both of these subproblems we show to be NP-Hard), as well as heuristic oracles. Experiments conducted on both synthetic and real data demonstrate that the proposed computational framework can scale to realistic problem instances.

AAAI Conference 2016 Conference Paper

Optimizing Personalized Email Filtering Thresholds to Mitigate Sequential Spear Phishing Attacks

Mengchen Zhao
Bo An
Christopher Kiekintveld

Highly targeted spear phishing attacks are increasingly common, and have been implicated in many major security breeches. Email ﬁltering systems are the ﬁrst line of defense against such attacks. These ﬁlters are typically conﬁgured with uniform thresholds for deciding whether or not to allow a message to be delivered to a user. However, users have very signiﬁcant differences in both their susceptibility to phishing attacks as well as their access to critical information and credentials that can cause damage. Recent work has considered setting personalized thresholds for individual users based on a Stackelberg game model. We consider two important extensions of the previous model. First, in our model user values can be substitutable, modeling cases where multiple users provide access to the same information or credential. Second, we consider attackers who make sequential attack plans based on the outcome of previous attacks. Our analysis starts from scenarios where there is only one credential and then extends to more general scenarios with multiple credentials. For single-credential scenarios, we demonstrate that the optimal defense strategy can be found by solving a binary combinatorial optimization problem called PEDS. For multiple-credential scenarios, we formulate it as a bilevel optimization problem for ﬁnding the optimal defense strategy and then reduce it to a single level optimization problem called PEMS using complementary slackness conditions. Experimental results show that both PEDS and PEMS lead to signiﬁcant higher defender utilities than two existing benchmarks in different parameter settings. Also, both PEDS and PEMS are more robust than the existing benchmarks considering uncertainties.

IJCAI Conference 2015 Conference Paper

Computing Optimal Mixed Strategies for Security Games with Dynamic Payoffs

Yue Yin
Haifeng Xu
Jiarui Gan
Bo An
Albert Xin Jiang

Security agencies in the real world often need to protect targets with time-dependent values, e. g. , tourist sites where the number of travelers changes over time. Since the values of different targets often change asynchronously, the defender can relocate security resources among targets dynamically to make the best use of limited resources. We propose a game-theoretic scheme to develop dynamic, randomized security strategies in consideration of adversary’s surveillance capability. This differs from previous studies on security games by considering varying target values and continuous strategy spaces of the security agency and the adversary. The main challenge lies in the computational intensiveness due to the continuous, hence infinite strategy spaces. We propose an optimal algorithm and an arbitrarily near-optimal algorithm to compute security strategies under different conditions. Experimental results show that both algorithms significantly outperform existing approaches.

IJCAI Conference 2015 Conference Paper

Optimal Electric Vehicle Charging Station Placement

Yanhai Xiong
Jiarui Gan
Bo An
Chunyan Miao
Ana L. C. Bazzan

Many countries like Singapore are planning to introduce Electric Vehicles (EVs) to replace traditional vehicles to reduce air pollution and improve energy efficiency. The rapid development of EVs calls for efficient deployment of charging stations both for the convenience of EVs and maintaining the efficiency of the road network. Unfortunately, existing work makes unrealistic assumption on EV drivers’ charging behaviors and focus on the limited mobility of EVs. This paper studies the Charging Station PLacement (CSPL) problem, and takes into consideration 1) EV drivers’ strategic behaviors to minimize their charging cost, and 2) the mutual impact of EV drivers’ strategies on the traffic conditions of the road network and service quality of charging stations. We first formulate the CSPL problem as a bilevel optimization problem, which is subsequently converted to a single-level optimization problem by exploiting structures of the EV charging game. Properties of CSPL problem are analyzed and an algorithm called OCEAN is proposed to compute the optimal allocation of charging stations. We further propose a heuristic algorithm OCEAN-C to speed up OCEAN. Experimental results show that the proposed algorithms significantly outperform baseline methods.

AAAI Conference 2015 Conference Paper

Security Games with Protection Externalities

Jiarui Gan
Bo An
Yevgeniy Vorobeychik

Stackelberg security games have been widely deployed in recent years to schedule security resources. An assumption in most existing security game models is that one security resource assigned to a target only protects that target. However, in many important real-world security scenarios, when a resource is assigned to a target, it exhibits protection externalities: that is, it also protects other “neighbouring” targets. We investigate such Security Games with Protection Externalities (SPEs). First, we demonstrate that computing a strong Stackelberg equilibrium for an SPE is NP-hard, in contrast with traditional Stackelberg security games which can be solved in polynomial time. On the positive side, we propose a novel column generation based approach—CLASPE—to solve SPEs. CLASPE features the following novelties: 1) a novel mixed-integer linear programming formulation for the slave problem; 2) an extended greedy approach with a constant-factor approximation ratio to speed up the slave problem; and 3) a linear-scale linear programming that efficiently calculates the upper bounds of target-defined subproblems for pruning. Our experimental evaluation demonstrates that CLASPE enable us to scale to realistic-sized SPE problem instances.

AAAI Conference 2014 Conference Paper

Game-Theoretic Resource Allocation for Protecting Large Public Events

Yue Yin
Bo An
Manish Jain

High profile large scale public events are attractive targets for terrorist attacks. The recent Boston Marathon bombings on April 15, 2013 have further emphasized the importance of protecting public events. The security challenge is exacerbated by the dynamic nature of such events: e. g. , the impact of an attack at different locations changes over time as the Boston marathon participants and spectators move along the race track. In addition, the defender can relocate security resources among potential attack targets at any time and the attacker may act at any time during the event. This paper focuses on developing efficient patrolling algorithms for such dynamic domains with continuous strategy spaces for both the defender and the attacker. We propose SCOUT-A, which makes assumptions on relocation cost, exploits payoff representation and computes optimal solutions efficiently. We also propose SCOUT-C to compute the exact optimal defender strategy for general cases despite the continuous strategy spaces. SCOUT-C computes the optimal defender strategy by constructing an equivalent game with discrete defender strategy space, then solving the constructed game. Experimental results show that both SCOUT-A and SCOUT-C significantly outperform other existing strategies.

AAAI Conference 2014 Conference Paper

Regret-Based Optimization and Preference Elicitation for Stackelberg Security Games with Uncertainty

Thanh Nguyen
Amulya Yadav
Bo An
Milind Tambe
Craig Boutilier

Stackelberg security games (SSGs) have been deployed in a number of real-world domains. One key challenge in these applications is the assessment of attacker payoffs, which may not be perfectly known. Previous work has studied SSGs with uncertain payoffs modeled by interval uncertainty and provided maximin-based robust solutions. In contrast, in this work we propose the use of the less conservative minimax regret decision criterion for such payoff-uncertain SSGs and present the first algorithms for computing minimax regret for SSGs. We also address the challenge of preference elicitation, using minimax regret to develop the first elicitation strategies for SSGs. Experimental results validate the effectiveness of our approaches.

IJCAI Conference 2013 Conference Paper

A Reputation Management Approach for Resource Constrained Trustee Agents

Han Yu
Chunyan Miao
Bo An
Cyril Leung
Victor R. Lesser

Trust is an important mechanism enabling agents to self-police open and dynamic multi-agent systems (ODMASs). Trusters evaluate the reputation of trustees based on their past observed performance, and use this information to guide their future interaction decisions. Existing trust models tend to concentrate trusters’ interactions on a small number of highly reputable trustees to minimize risk exposure. When a trustee’s servicing capacity is limited, such an approach may cause long delays for trusters and subsequently damage the reputation of trustees. To mitigate this problem, we propose a reputation management approach for trustee agents based on distributed constraint optimization. It helps a trustee to make situation-aware decisions on which incoming requests to serve and prevent the resulting reputation score from being affected by factors out of the trustee’s control. The approach is evaluated through theoretical analysis and within a simulated, highly dynamic multi-agent environment. The results show that it can achieve close to optimally efﬁcient utilization of the trustee agents’ collective capacity in an ODMAS, promotes fair treatment of trustee agents based on their behavior, and signiﬁcantly outperforms related work in enhancing social welfare.

PDF Details DOI

AAMAS Conference 2013 Conference Paper

A Reputation-aware Decision-making Approach for Improving the Efficiency of Crowdsourcing Systems

Han Yu
Zhiqi Shen
Chunyan Miao
Bo An

A crowdsourcing system is a useful platform for utilizing the intelligence and skills of the mass. Nevertheless, like any open system that involves the exchange of things of value, selfish and malicious behaviors exist in crowdsourcing systems and need to be mitigated. Trust management has been proven to be a viable solution in many systems. However, a major difference between crowdsourcing systems and existing trust models designed for multi-agent systems is that human trustees have limited task processing capacity per unit time compared to an intelligent agent program. This paper recognizes a problem in current trust-aware decision-making methods for task assignment in crowdsourcing platforms. On the one hand, trust-based methods over-assign tasks to trusted workers, while on the other hand, workload-based solutions do not give sufficient guarantees on the quality of work. The proposed solution, the social welfare optimizing reputation-aware decision-making (SWORD) approach, strikes a balance between the two and is shown through extensive simulations to significantly improve social welfare of crowdsourcing platforms compared to related work.

IJCAI Conference 2013 Conference Paper

Optimal Pricing for Improving Efficiency of Taxi Systems

Jiarui Gan
Bo An
Haizhong Wang
Xiaoming Sun
Zhongzhi Shi

In Beijing, most taxi drivers intentionally avoid working during peak hours despite of the huge customer demand within these peak periods. This dilemma is mainly due to the fact that taxi drivers’ congestion costs are not reﬂected in the current taxi fare structure. To resolve this problem, we propose a new pricing scheme to provide taxi drivers with extra incentives to work during peak hours. This differs from previous studies of taxi market by considering market variance over multiple periods, taxi drivers’ proﬁt-driven decisions, and their scheduling constraints regarding the interdependence among different periods. The major challenge of this research is the computational intensiveness to identify optimal strategy due to the exponentially large size of a taxi driver’s strategy space and the scheduling constraints. We develop an atom schedule method to overcome these issues. It reduces the magnitude of the problem while satisfying the constraints to ﬁlter out infeasible pure strategies. Simulation results based on real data show the effectiveness of the proposed methods, which opens up a new door to improving the ef- ﬁciency of taxi market in megacities (e. g. , Beijing).

PDF Details DOI

AAMAS Conference 2012 Conference Paper

Adversarial Patrolling Games

Yevgeniy Vorobeychik
Bo An
Milind Tambe

Defender-Attacker Stackelberg games are the foundations of tools deployed for computing optimal patrolling strategies in adversarial domains such as the United states Federal Air Marshals Service and the United States Coast Guard, among others. In Stackelberg game models of these systems the attacker knows only the probability that each target is covered by the defender, but is oblivious to the detailed timing of the coverage schedule. In many real-world situations, however, the attacker can observe the current location of the defender and can exploit this knowledge to reason about the defender’s future moves. We study Stackelberg security games in which the defender sequentially moves between targets, with moves constrained by an exogenously specified graph, while the attacker can observe the defender’s current location and his (stochastic) policy concerning future moves.

JAAMAS Journal 2012 Journal Article

An extended study on multi-objective security games

Matthew Brown
Bo An
Milind Tambe

Abstract The burgeoning area of security games has focused on real-world domains where security agencies protect critical infrastructure from a diverse set of adaptive adversaries. In such domains, decision makers have multiple competing objectives they must consider which may take different forms that are not readily comparable including safety, cost, and public perception. Thus, it can be difficult to know how to weigh the different objectives when deciding on a security strategy. To address the challenges of these domains, we propose a fundamentally different solution concept, multi-objective security games (MOSGs). Instead of a single optimal solution, MOSGs have a set of Pareto optimal (non-dominated) solutions referred to as the Pareto frontier, which can be generated by solving a sequence of constrained single-objective optimization problems (CSOPs). The Pareto frontier allows the decision maker to analyze the tradeoffs that exist between the multiple objectives. Our contributions include: (i) an algorithm, Iterative-ε-Constraints, , for generating the sequence of CSOPs; (ii) an exact approach for solving an mixed-integer linear program (MILP) formulation of a CSOP; (iii) heuristics that achieve speed up by exploiting the structure of security games to further constrain the MILP; (iv) an approximate approach for solving a CSOP built off those same heuristics, increasing the scalability of our approach with quality guarantees. Additional contributions of this paper include proofs on the level of approximation, detailed experimental evaluation of the proposed approaches and heuristics, as well as a discussion on techniques for visualizing the Pareto frontier.

JAAMAS Journal 2012 Journal Article

Bilateral bargaining with one-sided uncertain reserve prices

Bo An
Nicola Gatti
Victor Lesser

Abstract The problem of finding agents’ rational strategies in bargaining with incomplete information is well known to be challenging. The literature provides a collection of results for very narrow uncertainty settings, but no generally applicable algorithm. This lack has led researchers to develop heuristic approaches in an attempt to find outcomes that, even if not being of equilibrium, are mutually satisfactory. In the present paper, we focus on the principal bargaining protocol (i. e. , the alternating-offers protocol) where there is uncertainty regarding one agent’s reserve price. We provide an algorithm based on the combination of game theoretic analysis and search techniques which finds pure strategy sequential equilibria when they exist. Our approach is sound, complete and, in principle, can be applied to other uncertainty settings, e. g. , uncertain discount factors, and uncertain weights of negotiation issues in multi-issue negotiation. We experimentally evaluate our algorithm with a number of case studies showing that the average computational time is less than 30 s and at least one pure strategy equilibrium exists in almost all (about 99. 7 %) the bilateral bargaining scenarios we have looked at in the paper.

AAMAS Conference 2012 Conference Paper

Multi-Objective Optimization for Security Games

Matthew Brown
Bo An
Christopher Kiekintveld
Fernando Ord
oacute;
ntilde; ez
Milind Tambe

The burgeoning area of security games has focused on real-world domains where security agencies protect critical infrastructure from a diverse set of adaptive adversaries. There are security domains where the payoffs for preventing the different types of adversaries may take different forms (seized money, reduced crime, saved lives, etc) which are not readily comparable. Thus, it can be difficult to know how to weigh the different payoffs when deciding on a security strategy. To address the challenges of these domains, we propose a fundamentally different solution concept, multi-objective security games (MOSG), which combines security games and multi-objective optimization. Instead of a single optimal solution, MOSGs have a set of Pareto optimal (non-dominated) solutions referred to as the Pareto frontier. The Pareto frontier can be generated by solving a sequence of constrained single-objective optimization problems (CSOP), where one objective is selected to be maximized while lower bounds are specified for the other objectives. Our contributions include: (i) an algorithm, Iterative $\epsilon$-Constraints, for generating the sequence of CSOPs; (ii) an exact approach for solving an MILP formulation of a CSOP (which also applies to multi-objective optimization in more general Stackelberg games); (iii) heuristics that achieve speedup by exploiting the structure of security games to further constrain a CSOP; (iv) an approximate approach for solving an algorithmic formulation of a CSOP, increasing the scalability of our approach with quality guarantees. Additional contributions of this paper include proofs on the level of approximation and detailed experimental evaluation of the proposed approaches.

AAMAS Conference 2012 Conference Paper

PROTECT: A Deployed Game Theoretic System to Protect the Ports of the United States

Eric Shieh
Bo An
Rong Yang
Milind Tambe
Craig Baldwin
Joseph DiRenzo
Ben Maule
Garrett Meyer

While three deployed applications of game theory for security have recently been reported at AAMAS, we as a community remain in the early stages of these deployments; there is a continuing need to understand the core principles for innovative security applications of game theory. Towards that end, this paper presents PROTECT, a game-theoretic system deployed by the United States Coast Guard (USCG) in the port of Boston for scheduling their patrols. USCG has termed the deployment of PROTECT in Boston a success, and efforts are underway to test it in the port of New York, with the potential for nationwide deployment. PROTECT is premised on an attacker-defender Stackelberg game model and offers five key innovations. First, this system is a departure from the assumption of perfect adversary rationality noted in previous work, relying instead on a quantal response (QR) model of the adversary's behavior --- to the best of our knowledge, this is the first real-world deployment of the QR model. Second, to improve PROTECT's efficiency, we generate a compact representation of the defender's strategy space, exploiting equivalence and dominance. Third, we show how to practically model a real maritime patrolling problem as a Stackelberg game. Fourth, our experimental results illustrate that PROTECT's QR model more robustly handles real-world uncertainties than a perfect rationality model. Finally, in evaluating PROTECT, this paper for the first time provides real-world data: (i) comparison of human-generated vs PROTECT security schedules, and (ii) results from an Adversarial Perspective Team's (human mock attackers) analysis.

AAAI Conference 2012 Conference Paper

PROTECT: An Application of Computational Game Theory for the Security of the Ports of the United States

Eric Shieh
Bo An
Rong Yang
Milind Tambe
Craig Baldwin
Joseph DiRenzo
Ben Maule
Garrett Meyer

Building upon previous security applications of computational game theory, this paper presents PROTECT, a gametheoretic system deployed by the United States Coast Guard (USCG) in the port of Boston for scheduling their patrols. USCG has termed the deployment of PROTECT in Boston a success, and efforts are underway to test it in the port of New York, with the potential for nationwide deployment. PROTECT is premised on an attacker-defender Stackelberg game model and offers five key innovations. First, this system is a departure from the assumption of perfect adversary rationality noted in previous work, relying instead on a quantal response (QR) model of the adversary’s behavior — to the best of our knowledge, this is the first real-world deployment of the QR model. Second, to improve PROTECT’s efficiency, we generate a compact representation of the defender’s strategy space, exploiting equivalence and dominance. Third, we show how to practically model a real maritime patrolling problem as a Stackelberg game. Fourth, our experimental results illustrate that PROTECT’s QR model more robustly handles real-world uncertainties than a perfect rationality model. Finally, in evaluating PROTECT, this paper provides realworld data: (i) comparison of human-generated vs PROTECT security schedules, and (ii) results from an Adversarial Perspective Team’s (human mock attackers) analysis. 1

AAAI Conference 2012 Conference Paper

Security Games with Limited Surveillance

Bo An
David Kempe
Christopher Kiekintveld
Eric Shieh
Satinder Singh
Milind Tambe
Yevgeniy Vorobeychik

Randomized first-mover strategies of Stackelberg games are used in several deployed applications to allocate limited resources for the protection of critical infrastructure. Stackelberg games model the fact that a strategic attacker can surveil and exploit the defender’s strategy, and randomization guards against the worst effects by making the defender less predictable. In accordance with the standard game-theoretic model of Stackelberg games, past work has typically assumed that the attacker has perfect knowledge of the defender’s randomized strategy and will react correspondingly. In light of the fact that surveillance is costly, risky, and delays an attack, this assumption is clearly simplistic: attackers will usually act on partial knowledge of the defender’s strategies. The attacker’s imperfect estimate could present opportunities and possibly also threats to a strategic defender. In this paper, we therefore begin a systematic study of security games with limited surveillance. We propose a natural model wherein an attacker forms or updates a belief based on observed actions, and chooses an optimal response. We investigate the model both theoretically and experimentally. In particular, we give mathematical programs to compute optimal attacker and defender strategies for a fixed observation duration, and show how to use them to estimate the attacker’s observation durations. Our experimental results show that the defender can achieve significant improvement in expected utility by taking the attacker’s limited surveillance into account, validating the motivation of our work.

AAMAS Conference 2011 Conference Paper

Agent-Mediated Multi-Step Optimization for Resource Allocation in Distributed Sensor Networks

Bo An
Victor Lesser
David Westbrook
Michael Zink

Distributed collaborative adaptive sensing (DCAS) of the atmosphere is a new paradigm for detecting and predicting hazardous weather using a large dense network of short-range, low-powered radars to sense the lowest few kilometers of the earths atmosphere. In DCAS, radars are controlled by a collection of Meteorological Command and Control (MC&C) agents that instruct where to scan based on emerging weather conditions. Within this context, this work concentrates on designing efficient approaches for allocating sensing resources to cope with restricted real-time requirements and limited computational resources. We have developed a new approach based on explicit goals that can span multiple system heartbeats. This allows us to reason ahead about sensor allocations based on expected requirements of goals as they project forward in time. Each goal explicitly specifies end-users' preferences as well as a prediction of how a phenomena will move. We use a genetic algorithm to generate scanning strategies of each single MC&C and a distributed negotiation model to coordinate multiple MC&Cs' scanning strategies over multiple heartbeats. Simulation results show that as compared to simpler variants of our approach, the proposed distributed model achieved the highest social welfare. Our approach also has exhibited similarly very good performance in an operational radar testbed that is deployed in Oklahoma to observe severe weather events.

AAMAS Conference 2011 Conference Paper

Negotiation Over Decommitment Penalty

Bo An
Victor Lesser

We consider the role of negotiation in deciding decommitment penalties. In our model, agents simultaneously negotiate over both the contract price and decommitment penalty in the contracting game and then decide whether to decommit from contracts in the decommitment game. Experimental results show that setting penalties through negotiation achieved higher social welfare than other exogenous penalty setting mechanisms.

AAAI Conference 2011 Conference Paper

Refinement of Strong Stackelberg Equilibria in Security Games

Bo An
Milind Tambe
Fernando Ordonez
Eric Shieh
Christopher Kiekintveld

Given the real-world deployments of attacker-defender Stackelberg security games, robustness to deviations from expected attacker behaviors has now emerged as a critically important issue. This paper provides four key contributions in this context. First, it identiﬁes a fundamentally problematic aspect of current algorithms for security games. It shows that there are many situations where these algorithms face multiple equilibria, and they arbitrarily select one that may hand the defender a signiﬁcant disadvantage, particularly if the attacker deviates from its equilibrium strategies due to unknown constraints. Second, for important subclasses of security games, it identiﬁes situations where we will face such multiple equilibria. Third, to address these problematic situations, it presents two equilibrium reﬁnement algorithms that can optimize the defender’s utility if the attacker deviates from equilibrium strategies. Finally, it experimentally illustrates that the reﬁnement approach achieved signiﬁcant robustness in consideration of attackers’ deviation due to unknown constraints.

AAMAS Conference 2010 Conference Paper

Automated Negotiation with Decommitment for Dynamic Resource Allocation in Cloud Computing

Bo An
Victor Lesser
David Irwin
Michael Zink

We consider the problem of allocating networked resources in dynamic environment, such as cloud computing platforms, where providers strategically price resources to maximize their utility. Resource allocation in these environments, where both providers and consumers are selfish agents, presents numerous challenges since the number of consumers and their resource demand is highly dynamic. While numerous auction-based approaches have been proposed in the literature, this paper explores an alternative approach where providers and consumers automatically negotiate resource leasing contracts. Since resource demand and supply can be dynamic and uncertain, we propose a distributed negotiation mechanism where agents negotiate over both a contract price and a decommitment penalty, which allows agents to decommit from contracts at a cost. We compare our approach experimentally, using representative scenarios and workloads, to both combinatorial auctions and the fixed-price model used by Amazon's Elastic Compute Cloud, and show that the negotiation model achieves a higher social welfare.

AAMAS Conference 2010 Conference Paper

Searching for Pure Strategy Equilibria in Bilateral Bargaining With One-sided Uncertainty

Bo An
Nicola Gatti
Victor Lesser

The problem of finding agents' rational strategies in bargaining with incomplete information is well known to be challenging. The literature provides a collection of results for very narrow uncertainty settings, but no generally applicable algorithm. In this paper, we focus on the alternating-offers finite horizon bargaining protocol with one-sided uncertainty regarding agents' reserve prices. We provide an algorithm based on the combination of game theoretic analysis and search techniques which finds agents' equilibrium in pure strategies when they exist. Our approach is sound, complete and, in principle, can be applied to other uncertainty settings.

JAAMAS Journal 2010 Journal Article

Strategic agents for multi-resource negotiation

Bo An
Victor Lesser
Kwang Mong Sim

Abstract In electronic commerce markets where selfish agents behave individually, agents often have to acquire multiple resources in order to accomplish a high level task with each resource acquisition requiring negotiations with multiple resource providers. Thus, it is crucial to efficiently coordinate these interrelated negotiations. This paper presents the design and implementation of agents that concurrently negotiate with other entities for acquiring multiple resources. Negotiation agents in this paper are designed to adjust (1) the number of tentative agreements for each resource and (2) the amount of concession they are willing to make in response to changing market conditions and negotiation situations. In our approach, agents utilize a time-dependent negotiation strategy in which the reserve price of each resource is dynamically determined by (1) the likelihood that negotiation will not be successfully completed ( conflict probability ), (2) the expected agreement price of the resource, and (3) the expected number of final agreements. The negotiation deadline of each resource is determined by its relative scarcity. Agents are permitted to decommit from agreements by paying a time-dependent penalty, and a buyer can make more than one tentative agreement for each resource. The maximum number of tentative agreements for each resource made by an agent is constrained by the market situation. Experimental results show that our negotiation strategy achieved significantly more utilities than simpler strategies.

AAMAS Conference 2008 Conference Paper

Decommitment in Multi-resource Negotiation

Bo An
Victor Lesser
Kwang Mong Sim

This paper presents the design and implementation of negotiation agents that negotiate with other entities for acquiring multiple resources. In our approach, agents utilize a time-dependent negotiation strategy in which the reserve price of each negotiation issue is dynamically determined by 1) the likelihood that negotiation will not be successfully completed (conflict probability), 2) the expected agreement price of the issue, and 3) the expected number of final agreements. Results from a series of experiments indicate that on average, our negotiation strategy achieved higher average utility than traditional negotiation strategies.

AAMAS Conference 2008 Conference Paper

Heuristics for Negotiation Schedules in Multi-plan Optimization

Bo An
Fred Douglis
Fan Ye

In cooperating systems such as grids [4] and collaborative streaming analysis [2], autonomous sites can establish “agreements” to arrange access to remote resources for a period of time [1]. The determination of which resources to reserve to accomplish a task need not be known a priori, because there exist multiple plans for accomplishing the same task and they may require access to different resources [3]. While these plans can be functionally equivalent, they may have different performance/cost tradeoffs and may use a variety of resources, both local and belonging to other sites. The negotiation schedule, i. e. , the order in which remote resources are negotiated, determines how quickly one plan can be selected and deployed; it also decides the utility for running the plan. This paper studies the problem of optimizing negotiation schedules in cooperative systems with multiple plans. We first provide a votingbased heuristic that reduces the complexity O(n!) of the exhaustive search to O(mnq ). We also present a weight-based heuristic that further reduces the complexity to O(mn). Experimental results show that, on average, 1) the voting-based approach achieved 6% higher utility than the weight-based approach but the voting-based approach has a much higher computation cost than the weightbased approach, 2) the two proposed approaches achieved almost 50% higher utility than a randomized approach; and 3) the average utility produced by the two proposed approaches are within almost 90% of that of the optimal results with reasonable plan sizes.

IJCAI Conference 2007 Conference Paper

Bo An
Chunyan Miao
Zhiqi Shen

Although there are some research efforts toward resource allocation in multi-agent systems (MAS), most of these work assume that each agent has complete information about other agents. This research investigates interactions among selfish, rational, and autonomous agents in resource allocation, each with incomplete information about other entities, and each seeking to maximize its expected utility. This paper presents a proportional resource allocation mechanism and gives a game theoretical analysis of the optimal strategies and the analysis shows the existence of equilibrium in the incomplete information setting. By augmenting the resource allocation mechanism with a deal optimization mechanism, trading agents can be programmed to optimize resource allocation results by updating beliefs and resubmitting bids. Experimental results showed that by having a deal optimization stage, the resource allocation mechanism produced generally optimistic outcomes (close to market equilibrium).