Arrow Research search

Author name cluster

Yi Wu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

52 papers
2 author rows

Possible papers

52

AAAI Conference 2026 Conference Paper

Learning Knowledge from Textual Descriptions for 3D Human Pose Estimation

  • Yi Wu
  • Jingtian Li
  • Shangfei Wang
  • Guoming Li
  • Meng Mao
  • Linxiang Tan

Mainstream 3D human pose estimation methods directly predict 3D coordinates of joints from 2D keypoints, suffering from severe depth ambiguity. Pose textual descriptions contain abundant semantic information, which facilitates the model to learn the spatial relationship among different body parts, partially alleviating this issue. Leveraging this insight, we propose a 3D human pose estimation method assisted by textual descriptions. Specifically, we utilize an automatic captioning pipeline to generate textual descriptions of 3D poses based on spatial relations among joints. These descriptions include details regarding angles, distances, relative positions, pitch\&roll and ground-contacts. Subsequently, text features are extracted from these descriptions using a language model, while a 3D human pose estimation model extracts pose features. Aligning the pose features with the text features allows for a more targeted optimization of the estimation model. Therefore, we systematically introduce three alignment approaches to effectively align features extracted by two models operating in entirely different domains. Our method incorporates prior knowledge derived from the textual descriptions into the estimation model and can be seamlessly applied to various existing framework. Experimental results on the Human3.6M and MPI-INF-3DHP datasets demonstrate that our method surpasses state-of-the-art methods.

JBHI Journal 2025 Journal Article

A Multi-Modality Attention Network for Driver Fatigue Detection Based on Frontal EEG, EDA and PPG Signals

  • Yuanru Guo
  • Kunping Yang
  • Yi Wu

Fatigue driving is a common issue that often leads to traffic accidents, which has motivated numerous automatic driving fatigue detection methods based on various sources, especially reliable physiological signals. However, it still faces the challenges of accuracy, robustness and practicality, especially for the cross-subject detection. The fusion of multi-modality data can improve the effective estimation of driving fatigue. In this work, we take the advantages of user-friendly and multi-modality signals to build a Multi-Modality Attention Network (MMA-Net) for driver fatigue detection with frontal electroencephalography (EEG), electrodermal activity (EDA) and photoplethysmography (PPG) signals for a hybrid. Specifically, a signal adaptive coding module (SAC-M) has been constructed to fully excavate spatial-temporal information of signals, combining with an attention-based feature dissimilation module (AFD-M) to further obtain key comprehensive features. In addition, the performances of baseline models and state-of-the-art methods on signal sources with different window lengths are also compared. The cross-subject experiment is performed on two groups of 14 participants in the driving simulation experiment. The experimental results prove the superiority of our proposed method. It is possible to use the MMA-Net for driver fatigue detection with user-friendly multi-modality signals, such as our selected frontal EEG, EDA and PPG in real-world applications.

NeurIPS Conference 2025 Conference Paper

AREAL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

  • Wei Fu
  • Jiaxuan Gao
  • Xujie Shen
  • Chen Zhu
  • Zhiyu Mei
  • Chuyi He
  • Shusheng Xu
  • Guo Wei

Reinforcement learning (RL) has become a trending paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous by alternating generation and training in a batch setting, where the rollouts in each training batch are generated by the same (or latest) model. This stabilizes RL training but suffers from severe system-level inefficiency. Generation must wait until the longest output in the batch is completed before model update, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2. 77x training speedup compared to synchronous systems with the same number of GPUs and matched or even improved final performance. The code of AReaL is available at https: //github. com/inclusionAI/AReaL/.

JMLR Journal 2025 Journal Article

BitNet: 1-bit Pre-training for Large Language Models

  • Hongyu Wang
  • Shuming Ma
  • Lingxiao Ma
  • Lei Wang
  • Wenhui Wang
  • Li Dong
  • Shaohan Huang
  • Huaijie Wang

The increasing size of large language models (LLMs) has posed challenges for deployment and raised concerns about environmental impact due to high energy consumption. Previous research typically applies quantization after pre-training. While these methods avoid the need for model retraining, they often cause notable accuracy loss at extremely low bit-widths. In this work, we explore the feasibility and scalability of 1-bit pre-training. We introduce BitNet b1 and BitNet b1.58, the scalable and stable 1-bit Transformer architecture designed for LLMs. Specifically, we introduce BitLinear as a drop-in replacement of the nn.Linear layer in order to train 1-bit weights from scratch. Experimental results show that BitNet b1 achieves competitive performance, compared to state-of-the-art 8-bit quantization methods and FP16 Transformer baselines. With the ternary weight, BitNet b1.58 matches the half-precision Transformer LLM with the same model size and training tokens in terms of both perplexity and end-task performance, while being significantly more cost-effective in terms of latency, memory, throughput, and energy consumption. More profoundly, BitNet defines a new scaling law and recipe for training new generations of LLMs that are both high-performance and cost-effective. It enables a new computation paradigm and opens the door for designing specific hardware optimized for 1-bit LLMs. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

ICRA Conference 2025 Conference Paper

Fuel-Optimal Operational Speed Planning for Autonomous Trucking on Highways

  • Wei Li 0111
  • Bin Wu
  • Jiahao Xiang
  • Jiaping Ren
  • Yi Wu
  • Ruigang Yang

The rapid advancement of autonomous driving technology, particularly in autonomous trucking on highways, shows great value for enhancing efficiency and reducing costs in the logistics industry. In this work, we define the full-trip speed planning problem for autonomous trucks under delivery time and fuel consumption constraints, referred to as the Operational Speed Planning (OSP) problem. To support and accelerate research on the OSP problem, we have developed a comprehensive dataset using a fleet of over 400 trucks. The dataset contains rich, diverse information covering more than 22 million kilometers of real-world highway driving data. In addition to this static dataset, we have developed a closed-loop simulator that allows for the interactive evaluation of OSP solutions, enabling researchers to test speed planning strategies in a realistic environment. Furthermore, we provide an OSP baseline method based on dynamic programming to optimize speed planning, balancing the delivery time requirements and fuel consumption. Our extensive experiments demonstrate both the accuracy of the simulation and the effectiveness of the OSP baseline in planning optimal speeds, proving its capability to meet time constraints while improving fuel efficiency. The dataset, simulator, and baseline will be made publicly available to foster further research and innovation in this area.

NeurIPS Conference 2025 Conference Paper

How Far Are We from Optimal Reasoning Efficiency?

  • Jiaxuan Gao
  • Shu Yan
  • Qixin Tan
  • Lu Yang
  • Shusheng Xu
  • Wei Fu
  • Zhiyu Mei
  • Kaifeng Lyu

Large Reasoning Models (LRMs) demonstrate remarkable problem-solving capabilities through extended Chain-of-Thought (CoT) reasoning but often produce excessively verbose and redundant reasoning traces. This inefficiency incurs high inference costs and limits practical deployment. While existing fine-tuning methods aim to improve reasoning efficiency, assessing their efficiency gains remains challenging due to inconsistent evaluations. In this work, we introduce the reasoning efficiency frontiers, empirical upper bounds derived from fine-tuning a base LRM (DeepSeek-R1-Distill-Qwen-1. 5B/7B) across diverse approaches and training configurations. Based on these frontiers, we propose the Reasoning Efficiency Gap (REG), a unified metric quantifying deviations of any fine-tuned LRMs from these frontiers. Systematic evaluation on challenging mathematical benchmarks, AMC23, AIME24, and AIME25, reveals significant gaps in current methods: they either sacrifice accuracy for short length or use excessive tokens to achieve sub-optimal accuracies despite high overall accuracy. To reduce the efficiency gap, we propose REO-RL, a Reinforcement Learning algorithm that optimizes reasoning efficiency by targeting a sparse set of token budgets. Leveraging numerical integration over strategically selected budgets, REO-RL approximates the full efficiency objective with low error using a small set of token budgets. Experiments show that, compared to vanilla RL with outcome reward, REO-RL reduces the reasoning efficiency gap by 74. 5\% and 64. 2\% in the 1. 5B and 7B settings. The 7B LRM fine-tuned with REO-RL achieves reasoning conciseness surpassing frontier LRMs like Qwen3 and Claude Sonnet 3. 7. Ablation studies confirm the efficacy of our token budget strategy and highlight REO-RL’s flexibility across design choices. This work establishes a systematic framework for evaluating and optimizing reasoning efficiency in LRMs. We will release the related code, data, and models to support future research on efficient reasoning in LRMs.

JMLR Journal 2025 Journal Article

Learning Global Nash Equilibrium in Team Competitive Games with Generalized Fictitious Cross-Play

  • Zelai Xu
  • Chao Yu
  • Yancheng Liang
  • Yi Wu
  • Yu Wang

Self-play (SP) is a popular multi-agent reinforcement learning framework for competitive games. Despite the empirical success, the theoretical properties of SP are limited to two-player settings. For team competitive games where two teams of cooperative agents compete with each other, we show a counter-example where SP cannot converge to a global Nash equilibrium (NE) with high probability. Policy-Space Response Oracles (PSRO) is an alternative framework that finds NEs by iteratively learning the best response (BR) to previous policies. PSRO can be directly extended to team competitive games with unchanged convergence properties by learning team BRs, but its repeated training from scratch makes it hard to scale to complex games. In this work, we propose Generalized Fictitious Cross-Play (GFXP), a novel algorithm that inherits benefits from both frameworks. GFXP simultaneously trains an SP-based main policy and a counter population. The main policy is trained by fictitious self-play and cross-play against the counter population, while the counter policies are trained as the BRs to the main policy's checkpoints. We evaluate GFXP in matrix games and gridworld domains where GFXP achieves the lowest exploitabilities. We further conduct experiments in a challenging football game where GFXP defeats SOTA models with over 94% win rate. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

Reasoning Is Not a Race: When Stopping Early Beats Going Deeper

  • Mohan Zhang
  • Jiaxuan Gao
  • Shusheng Xu
  • Yi Wu

We study the use of Process Reward Models (PRMs) for guiding Long Chain-of-Thought (CoT) reasoning in large language models. Although PRMs deliver fine-grained feedback in standard tasks, PRM-guided beam search does not consistently outperform PRM-free approaches in long CoT reasoning. We trace this shortfall to a "step quality degradation''—the expected step quality shows concave behavior, yielding unimodal or monotonically declining trends. To counteract this, we propose Z-Score Guided Early Stopping (ZGES), which halts search at the detected quality peak using local PRM-reward z-scores. Across multiple math benchmarks and model scales, ZGES outperforms both standard PRM-guided beam search and the PRM-free methods. Ablation studies further highlight the advantages and robustness of ZGES’s adaptive stopping mechanism.

NeurIPS Conference 2025 Conference Paper

What Can RL Bring to VLA Generalization? An Empirical Study

  • Jijia Liu
  • Feng Gao
  • Bingwen Wei
  • Xinlei Chen
  • Qingmin Liao
  • Yi Wu
  • Chao Yu
  • Yu Wang

Large Vision-Language Action (VLA) models have shown significant potential for embodied AI. However, their predominant training via supervised fine-tuning (SFT) limits generalization due to susceptibility to compounding errors under distribution shifts. Reinforcement learning (RL) offers a path to overcome these limitations by optimizing for task objectives via trial-and-error, yet a systematic understanding of its specific generalization benefits for VLAs compared to SFT is lacking. To address this, our study introduces a comprehensive benchmark for evaluating VLA generalization and systematically investigates the impact of RL fine-tuning across diverse visual, semantic, and execution dimensions. Our extensive experiments reveal that RL fine-tuning, particularly with PPO, significantly enhances generalization in semantic understanding and execution robustness over SFT, while maintaining comparable visual robustness. We identify PPO as a more effective RL algorithm for VLAs than LLM-derived methods like DPO and GRPO. We also develop a simple recipe for efficient PPO training on VLAs, and demonstrate its practical utility for improving VLA generalization. The project page is at https: //rlvla. github. io

AAAI Conference 2024 Conference Paper

Accelerate Multi-Agent Reinforcement Learning in Zero-Sum Games with Subgame Curriculum Learning

  • Jiayu Chen
  • Zelai Xu
  • Yunfei Li
  • Chao Yu
  • Jiaming Song
  • Huazhong Yang
  • Fei Fang
  • Yu Wang

Learning Nash equilibrium (NE) in complex zero-sum games with multi-agent reinforcement learning (MARL) can be extremely computationally expensive. Curriculum learning is an effective way to accelerate learning, but an under-explored dimension for generating a curriculum is the difficulty-to-learn of the subgames –games induced by starting from a specific state. In this work, we present a novel subgame curriculum learning framework for zero-sum games. It adopts an adaptive initial state distribution by resetting agents to some previously visited states where they can quickly learn to improve performance. Building upon this framework, we derive a subgame selection metric that approximates the squared distance to NE values and further adopt a particle-based state sampler for subgame generation. Integrating these techniques leads to our new algorithm, Subgame Automatic Curriculum Learning (SACL), which is a realization of the subgame curriculum learning framework. SACL can be combined with any MARL algorithm such as MAPPO. Experiments in the particle-world environment and Google Research Football environment show SACL produces much stronger policies than baselines. In the challenging hide-and-seek quadrant environment, SACL produces all four emergent stages and uses only half the samples of MAPPO with self-play. The project website is at https://sites.google.com/view/sacl-neurips.

ICRA Conference 2024 Conference Paper

ESP: Extro-Spective Prediction for Long-term Behavior Reasoning in Emergency Scenarios

  • Dingrui Wang
  • Zheyuan Lai
  • Yuda Li
  • Yi Wu
  • Yuexin Ma
  • Johannes Betz
  • Ruigang Yang
  • Wei Li 0111

Emergent-scene safety is the key milestone for fully autonomous driving, and reliable on-time prediction is essential to maintain safety in emergency scenarios. However, these emergency scenarios are long-tailed and hard to collect, which restricts the system from getting reliable predictions. In this paper, we build a new dataset, which aims at the longterm prediction with the inconspicuous state variation in history for the emergency event, named the Extro-Spective Prediction (ESP) problem. Based on the proposed dataset, a flexible feature encoder for ESP is introduced to various prediction methods as a seamless plug-in, and its consistent performance improvement underscores its efficacy. Furthermore, a new metric named clamped temporal error (CTE) is proposed to give a more comprehensive evaluation of prediction performance, especially in time-sensitive emergency events of subseconds. Interestingly, as our ESP features can be described in human-readable language naturally, the application of integrating into ChatGPT also shows huge potential. The ESP-dataset and all benchmarks are released at https://dingrui-wang.github.io/ESP-Dataset/.

AAMAS Conference 2024 Conference Paper

LLM-Powered Hierarchical Language Agent for Real-time Human-AI Coordination

  • Jijia Liu
  • Chao Yu
  • Jiaxuan Gao
  • Yuqing Xie
  • Qingmin Liao
  • Yi Wu
  • Yu Wang

AI agents powered by Large Language Models (LLMs) have made significant advances, enabling them to assist humans in diverse complex tasks and leading to a revolution in human-AI coordination. LLM-powered agents typically require invoking LLM APIs and employing artificially designed complex prompts, which results in high inference latency. While this paradigm works well in scenarios with minimal interactive demands, such as code generation, it is unsuitable for highly interactive and real-time applications, such as gaming. Traditional gaming AI often employs small models or reactive policies, enabling fast inference but offering limited task completion and interaction abilities. In this work, we consider Overcooked as our testbed where players could communicate with natural language and cooperate to serve orders. We propose a Hierarchical Language Agent (HLA) for human-AI coordination that provides both strong reasoning abilities while keeping real-time execution. In particular, HLA adopts a hierarchical framework and comprises three modules: a proficient LLM, referred to as Slow Mind, for intention reasoning and language interaction, a lightweight LLM, referred to as Fast Mind, for generating macro actions, and a reactive policy, referred to as Executor, for transforming macro actions into atomic actions. Human studies show that HLA outperforms other baseline agents, including slow-mind-only agents and fast-mind-only agents, with stronger cooperation abilities, faster responses, and more consistent language communications.

AAMAS Conference 2024 Conference Paper

MESA: Cooperative Meta-Exploration in Multi-Agent Learning through Exploiting State-Action Space Structure

  • Zhicheng Zhang
  • Yancheng Liang
  • Yi Wu
  • Fei Fang

Multi-agent reinforcement learning (MARL) algorithms often struggle to find strategies close to Pareto optimal Nash Equilibrium, owing largely to the lack of efficient exploration. The problem is exacerbated in sparse-reward settings, caused by the larger variance exhibited in policy learning. This paper introduces MESA, a novel meta-exploration method for cooperative multi-agent learning. It learns to explore by first identifying the agents’ high-rewarding joint state-action subspace from training tasks and then learning a set of diverse exploration policies to “cover” the subspace. These trained exploration policies can be integrated with any off-policy MARL algorithm for test-time tasks. We first showcase MESA’s advantage in a multi-step matrix game. Furthermore, experiments show that with learned exploration policies, MESA achieves significantly better performance in sparse-reward tasks in several multi-agent particle environments and multi-agent MuJoCo environments, and exhibits the ability to generalize to more challenging tasks at test time.

ICRA Conference 2024 Conference Paper

NPC: Neural Predictive Control for Fuel-Efficient Autonomous Trucks

  • Jiaping Ren
  • Jiahao Xiang
  • Hongfei Gao
  • Jinchuan Zhang
  • Yiming Ren
  • Yuexin Ma
  • Yi Wu
  • Ruigang Yang

Fuel efficiency is a crucial aspect of long-distance cargo transportation by oil-powered trucks that economize on costs and decrease carbon emissions. Current predictive control methods depend on an accurate model of vehicle dynamics and engine, including weight, drag coefficient, and the Brake-specific Fuel Consumption (BSFC) map of the engine. We propose a pure data-driven method, Neural Predictive Control (NPC), which does not use any physical model for the vehicle. After training with over 20, 000 km of historical data, the novel proposed NVFormer implicitly models the relationship between vehicle dynamics, road slope, fuel consumption, and control commands using the attention mechanism. Based on the online sampled primitives from the past of the current freight trip and anchor-based future data synthesis, the NVFormer can infer optimal control command for reasonable fuel consumption. The physical model-free NPC outperforms the base PCC method with 2. 41% and 3. 45% more significant fuel saving in simulation and open-road highway testing, respectively.

IROS Conference 2024 Conference Paper

Robot Generating Data for Learning Generalizable Visual Robotic Manipulation

  • Yunfei Li
  • Ying Yuan
  • Jingzhi Cui
  • Haoran Huan
  • Wei Fu
  • Jiaxuan Gao
  • Zekai Xu
  • Yi Wu

It has been a popular trend in AI to pretrain foundation models on massive data. However, collecting sufficient offline training trajectories for robot learning is particularly expensive since valid control actions are required. Therefore, most existing robotic datasets are collected from human experts. We tackle such a data collection issue with a new framework called "robot self-teaching", which asks the robot to self-generate effective training data instead of relying on human demonstrators. Our key idea is to train a separate data-generation policy operating on the state space to automatically generate meaningful actions and trajectories with ever-growing complexities. Then, these generated data can be further used to train a visual policy with strong compositional generalization capabilities. We validate our framework in two visual manipulation testbeds, including a multi-object stacking domain and a popular RL benchmark "Franka kitchen". Experiments show that the final visual policy trained on self-generated data can accomplish novel testing goals that require long-horizon robot executions. Project website https://sites.google.com/view/robot-self-teaching.

AAAI Conference 2023 Short Paper

AlphaSnake: Policy Iteration on a Nondeterministic NP-Hard Markov Decision Process (Student Abstract)

  • Kevin Du
  • Ian Gemp
  • Yi Wu
  • Yingying Wu

Reinforcement learning has been used to approach well-known NP-hard combinatorial problems in graph theory. Among these, Hamiltonian cycle problems are exceptionally difficult to analyze, even when restricted to individual instances of structurally complex graphs. In this paper, we use Monte Carlo Tree Search (MCTS), the search algorithm behind many state-of-the-art reinforcement learning algorithms such as AlphaZero, to create autonomous agents that learn to play the game of Snake, a game centered on properties of Hamiltonian cycles on grid graphs. The game of Snake can be formulated as a single-player discounted Markov Decision Process (MDP), where the agent must behave optimally in a stochastic environment. Determining the optimal policy for Snake, defined as the policy that maximizes the probability of winning -- or win rate -- with higher priority and minimizes the expected number of time steps to win with lower priority, is conjectured to be NP-hard. Performance-wise, compared to prior work in the Snake game, our algorithm is the first to achieve a win rate over 0.5 (a uniform random policy achieves a win rate < 2.57 x 10^{-15}), demonstrating the versatility of AlphaZero in tackling NP-hard problems.

AAMAS Conference 2023 Conference Paper

Asynchronous Multi-Agent Reinforcement Learning for Efficient Real-Time Multi-Robot Cooperative Exploration

  • Chao Yu
  • Xinyi Yang
  • Jiaxuan Gao
  • Jiayu Chen
  • Yunfei Li
  • Jijia Liu
  • Yunfei Xiang
  • Ruixin Huang

We consider the problem of cooperative exploration where multiple robots need to cooperatively explore an unknown region as fast as possible. Multi-agent reinforcement learning (MARL) has recently become a trending paradigm for solving this challenge. However, existing MARL-based methods adopt action-making steps as the metric for exploration efficiency by assuming all the agents are acting in a fully synchronous manner: i. e. , every single agent produces an action simultaneously and every single action is executed instantaneously at each time step. Despite its mathematical simplicity, such a synchronous MARL formulation can be problematic for real-world robotic applications. It can be typical that different robots may take slightly different wall-clock times to accomplish an atomic action or even periodically get lost due to hardware issues. Simply waiting for every robot being ready for the next action can be particularly time-inefficient. Therefore, we propose an asynchronous MARL solution, Asynchronous Coordination Explorer (ACE), to tackle this real-world challenge. We first extend a classical MARL algorithm, multi-agent PPO (MAPPO), to the asynchronous setting and additionally apply action-delay randomization to enforce the learned policy to generalize better to varying action delays in the real world. Moreover, each navigation agent is represented as a team-size-invariant CNN-based policy, which greatly benefits real-robot deployment by handling possible robot lost and allows bandwidth-efficient intra-agent communication through low-dimensional CNN features. We first validate our approach in a grid-based scenario. Both simulation and real-robot results show that ACE reduces over 10% actual exploration time compared with classical approaches. We also apply our framework Proc. of the 22nd International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2023), A. Ricci, W. Yeoh, N. Agmon, B. An (eds.), May 29 – June 2, 2023, London, United Kingdom. © 2023 International Foundation for Autonomous Agents and Multiagent Systems (www. ifaamas. org). All rights reserved. to a high-fidelity visual-based environment, Habitat, achieving 28% improvement in exploration efficiency.

IJCAI Conference 2023 Conference Paper

Automatic Truss Design with Reinforcement Learning

  • Weihua Du
  • Jinglun Zhao
  • Chao Yu
  • Xingcheng Yao
  • Zimeng Song
  • Siyang Wu
  • Ruifeng Luo
  • Zhiyuan Liu

Truss layout design, namely finding a lightweight truss layout satisfying all the physical constraints, is a fundamental problem in the building industry. Generating the optimal layout is a challenging combinatorial optimization problem, which can be extremely expensive to solve by exhaustive search. Directly applying end-to-end reinforcement learning (RL) methods to truss layout design is infeasible either, since only a tiny portion of the entire layout space is valid under the physical constraints, leading to particularly sparse rewards for RL training. In this paper, we develop AutoTruss, a two-stage framework to efficiently generate both lightweight and valid truss layouts. AutoTruss first adopts Monte Carlo tree search to discover a diverse collection of valid layouts. Then RL is applied to iteratively refine the valid solutions. We conduct experiments and ablation studies in popular truss layout design test cases in both 2D and 3D settings. AutoTruss outperforms the best-reported layouts by 25. 1% in the most challenging 3D test cases, resulting in the first effective deep-RL-based approach in the truss layout design literature.

TMLR Journal 2023 Journal Article

Beyond Information Gain: An Empirical Benchmark for Low-Switching-Cost Reinforcement Learning

  • Shusheng Xu
  • Yancheng Liang
  • Yunfei Li
  • Simon Shaolei Du
  • Yi Wu

A ubiquitous requirement in many practical reinforcement learning (RL) applications is that the deployed policy that actually interacts with the environment cannot change frequently. Such an RL setting is called low-switching-cost RL, i.e., achieving the highest reward while reducing the number of policy switches during training. It has been a recent trend in theoretical RL research to develop provably efficient RL algorithms with low switching cost. The core idea in these theoretical works is to measure the information gain and switch the policy when the information gain is doubled. Despite of the theoretical advances, none of existing approaches have been validated empirically. We conduct the first empirical evaluation of different policy switching criteria on popular RL testbeds, including a medical treatment environment, the Atari games, and robotic control tasks. Surprisingly, although information-gain-based methods do recover the optimal rewards, they often lead to a substantially higher switching cost. By contrast, we find that a feature-based criterion, which has been largely ignored in the theoretical research, consistently produces the best performances over all the domains. We hope our benchmark could bring insights to the community and inspire future research. Our code and complete results can be found at https: // sites. google. com/ view/ low-switching-cost-rl

AAMAS Conference 2023 Conference Paper

Differentiable Arbitrating in Zero-sum Markov Games

  • Jing Wang
  • Meichen Song
  • Feng Gao
  • Boyi Liu
  • Zhaoran Wang
  • Yi Wu

We initiate the study of how to perturb the reward in a zero-sum Markov game with two players to induce a desirable Nash equilibrium, namely arbitrating. Such a problem admits a bi-level optimization formulation. The lower level requires solving the Nash equilibrium under a given reward function, which makes the overall problem challenging to optimize in an end-to-end way. We propose a backpropagation scheme that differentiates through the Nash equilibrium, which provides the gradient feedback for the upper level. In particular, our method only requires a black-box solver for the (regularized) Nash equilibrium (NE). We develop the convergence analysis for the proposed framework with proper black-box NE solvers and demonstrate the empirical successes in two multi-agent reinforcement learning (MARL) environments. Supplementary for all the proofs in this paper could be found in: https: //arxiv. org/abs/2302. 10058.

NeurIPS Conference 2023 Conference Paper

Domain Re-Modulation for Few-Shot Generative Domain Adaptation

  • Yi Wu
  • Ziqiang Li
  • Chaoyue Wang
  • Heliang Zheng
  • Shanshan Zhao
  • Bin Li
  • Dacheng Tao

In this study, we delve into the task of few-shot Generative Domain Adaptation (GDA), which involves transferring a pre-trained generator from one domain to a new domain using only a few reference images. Inspired by the way human brains acquire knowledge in new domains, we present an innovative generator structure called $\textbf{Domain Re-Modulation (DoRM)}$. DoRM not only meets the criteria of $\textit{high quality}$, $\textit{large synthesis diversity}$, and $\textit{cross-domain consistency}$, which were achieved by previous research in GDA, but also incorporates $\textit{memory}$ and $\textit{domain association}$, akin to how human brains operate. Specifically, DoRM freezes the source generator and introduces new mapping and affine modules (M\&A modules) to capture the attributes of the target domain during GDA. This process resembles the formation of new synapses in human brains. Consequently, a linearly combinable domain shift occurs in the style space. By incorporating multiple new M\&A modules, the generator gains the capability to perform high-fidelity multi-domain and hybrid-domain generation. Moreover, to maintain cross-domain consistency more effectively, we introduce a similarity-based structure loss. This loss aligns the auto-correlation map of the target image with its corresponding auto-correlation map of the source image during training. Through extensive experiments, we demonstrate the superior performance of our DoRM and similarity-based structure loss in few-shot GDA, both quantitatively and qualitatively. Code will be available at https: //github. com/wuyi2020/DoRM.

AAMAS Conference 2023 Conference Paper

Fictitious Cross-Play: Learning Global Nash Equilibrium in Mixed Cooperative-Competitive Games

  • Zelai Xu
  • Yancheng Liang
  • Chao Yu
  • Yu Wang
  • Yi Wu

Self-play (SP) is a popular multi-agent reinforcement learning (MARL) framework for solving competitive games, where each agent optimizes policy by treating others as part of the environment. Despite the empirical successes, the theoretical properties of SP-based methods are limited to two-player zero-sum games. However, for mixed cooperative-competitive games where agents on the same team need to cooperate with each other, we can show a simple counterexample where SP-based methods cannot converge to a global Nash equilibrium (NE) with high probability. Alternatively, Policy-Space Response Oracles (PSRO) is an iterative framework for learning NE, where the best responses w. r. t. previous policies are learned in each iteration. PSRO can be directly extended to mixed cooperativecompetitive settings by jointly learning team best responses with all convergence properties unchanged. However, PSRO requires repeatedly training joint policies from scratch till convergence, which makes it hard to scale to complex games. In this work, we develop a novel algorithm, Fictitious Cross-Play (FXP), which inherits the benefits from both frameworks. FXP simultaneously trains an SP-based main policy and a counter population of best response policies. The main policy is trained by fictitious self-play and crossplay against the counter population, while the counter policies are trained as the best responses to the main policy’s past versions. We validate our method in matrix games and show that FXP converges to global NEs while SP methods fail. We also conduct experiments in a gridworld domain, where FXP achieves higher Elo ratings and lower exploitabilities than baselines, and a more challenging football game, where FXP defeats SOTA models with over 94% win rate.

NeurIPS Conference 2023 Conference Paper

Iteratively Learn Diverse Strategies with State Distance Information

  • Wei Fu
  • Weihua Du
  • Jingwei Li
  • Sunli Chen
  • Jingzhao Zhang
  • Yi Wu

In complex reinforcement learning (RL) problems, policies with similar rewards may have substantially different behaviors. It remains a fundamental challenge to optimize rewards while also discovering as many diverse strategies as possible, which can be crucial in many practical applications. Our study examines two design choices for tackling this challenge, i. e. , diversity measure and computation framework. First, we find that with existing diversity measures, visually indistinguishable policies can still yield high diversity scores. To accurately capture the behavioral difference, we propose to incorporate the state-space distance information into the diversity measure. In addition, we examine two common computation frameworks for this problem, i. e. , population-based training (PBT) and iterative learning (ITR). We show that although PBT is the precise problem formulation, ITR can achieve comparable diversity scores with higher computation efficiency, leading to improved solution quality in practice. Based on our analysis, we further combine ITR with two tractable realizations of the state-distance-based diversity measures and develop a novel diversity-driven RL algorithm, State-based Intrinsic-reward Policy Optimization (SIPO), with provable convergence properties. We empirically examine SIPO across three domains from robot locomotion to multi-agent games. In all of our testing environments, SIPO consistently produces strategically diverse and human-interpretable policies that cannot be discovered by existing baselines.

IJCAI Conference 2023 Conference Paper

KDLGT: A Linear Graph Transformer Framework via Kernel Decomposition Approach

  • Yi Wu
  • Yanyang Xu
  • Wenhao Zhu
  • Guojie Song
  • Zhouchen Lin
  • Liang Wang
  • Shaoguo Liu

In recent years, graph Transformers (GTs) have been demonstrated as a robust architecture for a wide range of graph learning tasks. However, the quadratic complexity of GTs limits their scalability on large-scale data, in comparison to Graph Neural Networks (GNNs). In this work, we propose the Kernel Decomposition Linear Graph Transformer (KDLGT), an accelerating framework for building scalable and powerful GTs. KDLGT employs the kernel decomposition approach to rearrange the order of matrix multiplication, thereby reducing complexity to linear. Additionally, it categorizes GTs into three distinct types and provides tailored accelerating methods for each category to encompass all types of GTs. Furthermore, we provide a theoretical analysis of the performance gap between KDLGT and self-attention to ensure its effectiveness. Under this framework, we select two representative GTs to design our models. Experiments on both real-world and synthetic datasets indicate that KDLGT not only achieves state-of-the-art performance on various datasets but also reaches an acceleration ratio of approximately 10 on graphs of certain sizes.

AAAI Conference 2023 Conference Paper

Maximum Entropy Population-Based Training for Zero-Shot Human-AI Coordination

  • Rui Zhao
  • Jinming Song
  • Yufeng Yuan
  • Haifeng Hu
  • Yang Gao
  • Yi Wu
  • Zhongqian Sun
  • Wei Yang

We study the problem of training a Reinforcement Learning (RL) agent that is collaborative with humans without using human data. Although such agents can be obtained through self-play training, they can suffer significantly from the distributional shift when paired with unencountered partners, such as humans. In this paper, we propose Maximum Entropy Population-based training (MEP) to mitigate such distributional shift. In MEP, agents in the population are trained with our derived Population Entropy bonus to promote the pairwise diversity between agents and the individual diversity of agents themselves. After obtaining this diversified population, a common best agent is trained by paring with agents in this population via prioritized sampling, where the prioritization is dynamically adjusted based on the training progress. We demonstrate the effectiveness of our method MEP, with comparison to Self-Play PPO (SP), Population-Based Training (PBT), Trajectory Diversity (TrajeDi), and Fictitious Co-Play (FCP) in both matrix game and Overcooked game environments, with partners being human proxy models and real humans. A supplementary video showing experimental results is available at https://youtu.be/Xh-FKD0AAKE.

TMLR Journal 2023 Journal Article

Understanding Curriculum Learning in Policy Optimization for Online Combinatorial Optimization

  • Runlong Zhou
  • Zelin He
  • Yuandong Tian
  • Yi Wu
  • Simon Shaolei Du

Over the recent years, reinforcement learning (RL) starts to show promising results in tackling combinatorial optimization (CO) problems, in particular when coupled with curriculum learning to facilitate training. Despite emerging empirical evidence, theoretical study on why RL helps is still at its early stage. This paper presents the first systematic study on policy optimization methods for online CO problems. We show that online CO problems can be naturally formulated as latent Markov Decision Processes (LMDPs), and prove convergence bounds on natural policy gradient (NPG) for solving LMDPs. Furthermore, our theory explains the benefit of curriculum learning: it can find a strong sampling policy and reduce the distribution shift, a critical quantity that governs the convergence rate in our theorem. For a canonical online CO problem, the Best Choice Problem (BCP), we formally prove that distribution shift is reduced exponentially with curriculum learning even if the curriculum is a randomly generated BCP on a smaller scale. Our theory also shows we can simplify the curriculum learning scheme used in prior work from multi-step to single-step. Lastly, we provide extensive experiments on the Best Choice Problem, Online Knapsack, and AdWords to verify our findings.

NeurIPS Conference 2022 Conference Paper

Grounded Reinforcement Learning: Learning to Win the Game under Human Commands

  • Shusheng Xu
  • Huaijie Wang
  • Yi Wu

We consider the problem of building a reinforcement learning (RL) agent that can both accomplish non-trivial tasks, like winning a real-time strategy game, and strictly follow high-level language commands from humans, like “attack”, even if a command is sub-optimal. We call this novel yet important problem, Grounded Reinforcement Learning (GRL). Compared with other language grounding tasks, GRL is particularly non-trivial and cannot be simply solved by pure RL or behavior cloning (BC). From the RL perspective, it is extremely challenging to derive a precise reward function for human preferences since the commands are abstract and the valid behaviors are highly complicated and multi-modal. From the BC perspective, it is impossible to obtain perfect demonstrations since human strategies in complex games are typically sub-optimal. We tackle GRL via a simple, tractable, and practical constrained RL objective and develop an iterative RL algorithm, REinforced demonstration Distillation (RED), to obtain a strong GRL policy. We evaluate the policies derived by RED, BC and pure RL methods on a simplified real-time strategy game, MiniRTS. Experiment results and human studies show that the RED policy is able to consistently follow human commands and achieve a higher win rate than the baselines. We release our code and present more examples at https: //sites. google. com/view/grounded-rl.

NeurIPS Conference 2022 Conference Paper

Pre-Trained Image Encoder for Generalizable Visual Reinforcement Learning

  • Zhecheng Yuan
  • Zhengrong Xue
  • Bo Yuan
  • Xueqian Wang
  • Yi Wu
  • Yang Gao
  • Huazhe Xu

Learning generalizable policies that can adapt to unseen environments remains challenging in visual Reinforcement Learning (RL). Existing approaches try to acquire a robust representation via diversifying the appearances of in-domain observations for better generalization. Limited by the specific observations of the environment, these methods ignore the possibility of exploring diverse real-world image datasets. In this paper, we investigate how a visual RL agent would benefit from the off-the-shelf visual representations. Surprisingly, we find that the early layers in an ImageNet pre-trained ResNet model could provide rather generalizable representations for visual RL. Hence, we propose Pre-trained Image Encoder for Generalizable visual reinforcement learning (PIE-G), a simple yet effective framework that can generalize to the unseen visual scenarios in a zero-shot manner. Extensive experiments are conducted on DMControl Generalization Benchmark, DMControl Manipulation Tasks, Drawer World, and CARLA to verify the effectiveness of PIE-G. Empirical evidence suggests PIE-G improves sample efficiency and significantly outperforms previous state-of-the-art methods in terms of generalization performance. In particular, PIE-G boasts a 55% generalization performance gain on average in the challenging video background setting. Project Page: https: //sites. google. com/view/pie-g/home.

AAAI Conference 2022 Conference Paper

Sequence Level Contrastive Learning for Text Summarization

  • Shusheng Xu
  • Xingxing Zhang
  • Yi Wu
  • Furu Wei

Contrastive learning models have achieved great success in unsupervised visual representation learning, which maximize the similarities between feature representations of different views of the same image, while minimize the similarities between feature representations of views of different images. In text summarization, the output summary is a shorter form of the input document and they have similar meanings. In this paper, we propose a contrastive learning model for supervised abstractive text summarization, where we view a document, its gold summary and its model generated summaries as different views of the same mean representation and maximize the similarities between them during training. We improve over a strong sequence-to-sequence text generation model (i. e. , BART) on three different summarization datasets. Human evaluation also shows that our model achieves better faithfulness ratings compared to its counterpart without contrastive objectives. We release our code at https: //github. com/xssstory/SeqCo.

NeurIPS Conference 2022 Conference Paper

The Surprising Effectiveness of PPO in Cooperative Multi-Agent Games

  • Chao Yu
  • Akash Velu
  • Eugene Vinitsky
  • Jiaxuan Gao
  • Yu Wang
  • Alexandre Bayen
  • Yi Wu

Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, the Hanabi challenge, and Google Research Football, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods are a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at https: //github. com/marlbenchmark/on-policy.

NeurIPS Conference 2021 Conference Paper

Native Chinese Reader: A Dataset Towards Native-Level Chinese Machine Reading Comprehension

  • Shusheng Xu
  • Yichen Liu
  • Xiaoyu Yi
  • Siyuan Zhou
  • Huizi Li
  • Yi Wu

We present Native Chinese Reader (NCR), a new machine reading comprehension MRC) dataset with particularly long articles in both modern and classical Chinese. NCR is collected from the exam questions for the Chinese course in China’s high schools, which are designed to evaluate the language proficiency of native Chinese youth. Existing Chinese MRC datasets are either domain-specific or focusing on short contexts of a few hundred characters in modern Chinese only. By contrast, NCR contains 8390 documents with an average length of 1024 characters covering a wide range of Chinese writing styles, including modern articles, classical literature and classical poetry. A total of 20477 questions on these documents also require strong reasoning abilities and common sense to figure out the correct answers. We implemented multiple baseline models using popular Chinese pre-trained models and additionally launched an online competition using our dataset to examine the limit of current methods. The best model achieves 59% test accuracy while human evaluation shows an average accuracy of 79%, which indicates a significant performance gap between current MRC models and native Chinese speakers.

NeurIPS Conference 2021 Conference Paper

NovelD: A Simple yet Effective Exploration Criterion

  • Tianjun Zhang
  • Huazhe Xu
  • Xiaolong Wang
  • Yi Wu
  • Kurt Keutzer
  • Joseph E. Gonzalez
  • Yuandong Tian

Efficient exploration under sparse rewards remains a key challenge in deep reinforcement learning. Previous exploration methods (e. g. , RND) have achieved strong results in multiple hard tasks. However, if there are multiple novel areas to explore, these methods often focus quickly on one without sufficiently trying others (like a depth-wise first search manner). In some scenarios (e. g. , four corridor environment in Sec 4. 2), we observe they explore in one corridor for long and fail to cover all the states. On the other hand, in theoretical RL, with optimistic initialization and the inverse square root of visitation count as a bonus, it won't suffer from this and explores different novel regions alternatively (like a breadth-first search manner). In this paper, inspired by this, we propose a simple but effective criterion called NovelD by weighting every novel area approximately equally. Our algorithm is very simple but yet shows comparable performance or even outperforms multiple SOTA exploration methods in many hard exploration tasks. Specifically, NovelD solves all the static procedurally-generated tasks in Mini-Grid with just 120M environment steps, without any curriculum learning. In comparison, the previous SOTA only solves 50% of them. NovelD also achieves SOTA on multiple tasks in NetHack, a rogue-like game that contains more challenging procedurally-generated environments. In multiple Atari games (e. g. , MonteZuma's Revenge, Venture, Gravitar), NovelD outperforms RND. We analyze NovelD thoroughly in MiniGrid and found that empirically it helps the agent explore the environment more uniformly with a focus on exploring beyond the boundary.

IJCAI Conference 2021 Conference Paper

Temporal Induced Self-Play for Stochastic Bayesian Games

  • Weizhe Chen
  • Zihan Zhou
  • Yi Wu
  • Fei Fang

One practical requirement in solving dynamic games is to ensure that the players play well from any decision point onward. To satisfy this requirement, existing efforts focus on equilibrium refinement, but the scalability and applicability of existing techniques are limited. In this paper, we propose Temporal-Induced Self-Play (TISP), a novel reinforcement learning-based framework to find strategies with decent performances from any decision point onward. TISP uses belief-space representation, backward induction, policy learning, and non-parametric approximation. Building upon TISP, we design a policy-gradient-based algorithm TISP-PG. We prove that TISP-based algorithms can find approximate Perfect Bayesian Equilibrium in zero-sum one-sided stochastic Bayesian games with finite horizon. We test TISP-based algorithms in various games, including finitely repeated security games and a grid-world game. The results show that TISP-PG is more scalable than existing mathematical programming-based methods and significantly outperforms other learning-based methods.

NeurIPS Conference 2021 Conference Paper

Variational Automatic Curriculum Learning for Sparse-Reward Cooperative Multi-Agent Problems

  • Jiayu Chen
  • Yuanxin Zhang
  • Yuanfan Xu
  • Huimin Ma
  • Huazhong Yang
  • Jiaming Song
  • Yu Wang
  • Yi Wu

We introduce an automatic curriculum algorithm, Variational Automatic Curriculum Learning (VACL), for solving challenging goal-conditioned cooperative multi-agent reinforcement learning problems. We motivate our curriculum learning paradigm through a variational perspective, where the learning objective can be decomposed into two terms: task learning on the current curriculum, and curriculum update to a new task distribution. Local optimization over the second term suggests that the curriculum should gradually expand the training tasks from easy to hard. Our VACL algorithm implements this variational paradigm with two practical components, task expansion and entity curriculum, which produces a series of training tasks over both the task configurations as well as the number of entities in the task. Experiment results show that VACL solves a collection of sparse-reward problems with a large number of agents. Particularly, using a single desktop machine, VACL achieves 98% coverage rate with 100 agents in the simple-spread benchmark and reproduces the ramp-use behavior originally shown in OpenAI’s hide-and-seek project.

NeurIPS Conference 2020 Conference Paper

Multi-Task Reinforcement Learning with Soft Modularization

  • Ruihan Yang
  • Huazhe Xu
  • Yi Wu
  • Xiaolong Wang

Multi-task learning is a very challenging problem in reinforcement learning. While training multiple tasks jointly allow the policies to share parameters across different tasks, the optimization problem becomes non-trivial: It remains unclear what parameters in the network should be reused across tasks, and how the gradients from different tasks may interfere with each other. Thus, instead of naively sharing parameters across tasks, we introduce an explicit modularization technique on policy representation to alleviate this optimization issue. Given a base policy network, we design a routing network which estimates different routing strategies to reconfigure the base network for each task. Instead of directly selecting routes for each task, our task-specific policy uses a method called soft modularization to softly combine all the possible routes, which makes it suitable for sequential tasks. We experiment with various robotics manipulation tasks in simulation and show our method improves both sample efficiency and performance over strong baselines by a large margin.

AAAI Conference 2019 Conference Paper

Deep Reinforcement Learning for Green Security Games with Real-Time Information

  • Yufei Wang
  • Zheyuan Ryan Shi
  • Lantao Yu
  • Yi Wu
  • Rohit Singh
  • Lucas Joppa
  • Fei Fang

Green Security Games (GSGs) have been proposed and applied to optimize patrols conducted by law enforcement agencies in green security domains such as combating poaching, illegal logging and overfishing. However, real-time information such as footprints and agents’ subsequent actions upon receiving the information, e. g. , rangers following the footprints to chase the poacher, have been neglected in previous work. To fill the gap, we first propose a new game model GSG-I which augments GSGs with sequential movement and the vital element of real-time information. Second, we design a novel deep reinforcement learning-based algorithm, DeDOL, to compute a patrolling strategy that adapts to the real-time information against a best-responding attacker. DeDOL is built upon the double oracle framework and the policy-space response oracle, solving a restricted game and iteratively adding best response strategies to it through training deep Q-networks. Exploring the game structure, DeDOL uses domain-specific heuristic strategies as initial strategies and constructs several local modes for efficient and parallelized training. To our knowledge, this is the first attempt to use Deep Q-Learning for security games.

AAAI Conference 2019 Conference Paper

Robust Multi-Agent Reinforcement Learning via Minimax Deep Deterministic Policy Gradient

  • Shihui Li
  • Yi Wu
  • Xinyue Cui
  • Honghua Dong
  • Fei Fang
  • Stuart Russell

Despite the recent advances of deep reinforcement learning (DRL), agents trained by DRL tend to be brittle and sensitive to the training environment, especially in the multi-agent scenarios. In the multi-agent setting, a DRL agent’s policy can easily get stuck in a poor local optima w.r.t. its training partners – the learned policy may be only locally optimal to other agents’ current policies. In this paper, we focus on the problem of training robust DRL agents with continuous actions in the multi-agent learning setting so that the trained agents can still generalize when its opponents’ policies alter. To tackle this problem, we proposed a new algorithm, MiniMax Multi-agent Deep Deterministic Policy Gradient (M3DDPG) with the following contributions: (1) we introduce a minimax extension of the popular multi-agent deep deterministic policy gradient algorithm (MADDPG), for robust policy learning; (2) since the continuous action space leads to computational intractability in our minimax learning objective, we propose Multi-Agent Adversarial Learning (MAAL) to efficiently solve our proposed formulation. We empirically evaluate our M3DDPG algorithm in four mixed cooperative and competitive multi-agent environments and the agents trained by our method significantly outperforms existing baselines.

NeurIPS Conference 2019 Conference Paper

Stochastic Runge-Kutta Accelerates Langevin Monte Carlo and Beyond

  • Xuechen Li
  • Yi Wu
  • Lester Mackey
  • Murat Erdogdu

Sampling with Markov chain Monte Carlo methods typically amounts to discretizing some continuous-time dynamics with numerical integration. In this paper, we establish the convergence rate of sampling algorithms obtained by discretizing smooth It\^o diffusions exhibiting fast $2$-Wasserstein contraction, based on local deviation properties of the integration scheme. In particular, we study a sampling algorithm constructed by discretizing the overdamped Langevin diffusion with the method of stochastic Runge-Kutta. For strongly convex potentials that are smooth up to a certain order, its iterates converge to the target distribution in $2$-Wasserstein distance in $\tilde{\mathcal{O}}(d\epsilon^{-2/3})$ iterations. This improves upon the best-known rate for strongly log-concave sampling based on the overdamped Langevin equation using only the gradient oracle without adjustment. Additionally, we extend our analysis of stochastic Runge-Kutta methods to uniformly dissipative diffusions with possibly non-convex potentials and show they achieve better rates compared to the Euler-Maruyama scheme on the dependence on tolerance $\epsilon$. Numerical studies show that these algorithms lead to better stability and lower asymptotic errors.

NeurIPS Conference 2018 Conference Paper

Meta-Learning MCMC Proposals

  • Tongzhou Wang
  • Yi Wu
  • Dave Moore
  • Stuart Russell

Effective implementations of sampling-based probabilistic inference often require manually constructed, model-specific proposals. Inspired by recent progresses in meta-learning for training learning agents that can generalize to unseen environments, we propose a meta-learning approach to building effective and generalizable MCMC proposals. We parametrize the proposal as a neural network to provide fast approximations to block Gibbs conditionals. The learned neural proposals generalize to occurrences of common structural motifs across different models, allowing for the construction of a library of learned inference primitives that can accelerate inference on unseen models with no model-specific training required. We explore several applications including open-universe Gaussian mixture models, in which our learned proposals outperform a hand-tuned sampler, and a real-world named entity recognition task, in which our sampler yields higher final F1 scores than classical single-site Gibbs sampling.

AAAI Conference 2017 Conference Paper

A Nearly-Black-Box Online Algorithm for Joint Parameter and State Estimation in Temporal Models

  • Yusuf Erol
  • Yi Wu
  • Lei Li
  • Stuart Russell

Online joint parameter and state estimation is a core problem for temporal models. Most existing methods are either restricted to a particular class of models (e. g. , the Storvik filter) or computationally expensive (e. g. , particle MCMC). We propose a novel nearly-black-box algorithm, the Assumed Parameter Filter (APF), a hybrid of particle filtering for state variables and assumed density filtering for parameter variables. It has the following advantages: (a) it is online and computationally efficient; (b) it is applicable to both discrete and continuous parameter spaces with arbitrary transition dynamics. On a variety of toy and real models, APF generates more accurate results within a fixed computation budget compared to several standard algorithms from the literature.

NeurIPS Conference 2017 Conference Paper

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

  • Ryan Lowe
  • Yi Wu
  • Aviv Tamar
  • Jean Harb
  • OpenAI Pieter Abbeel
  • Igor Mordatch

We explore deep reinforcement learning methods for multi-agent domains. We begin by analyzing the difficulty of traditional algorithms in the multi-agent case: Q-learning is challenged by an inherent non-stationarity of the environment, while policy gradient suffers from a variance that increases as the number of agents grows. We then present an adaptation of actor-critic methods that considers action policies of other agents and is able to successfully learn policies that require complex multi-agent coordination. Additionally, we introduce a training regimen utilizing an ensemble of policies for each agent that leads to more robust multi-agent policies. We show the strength of our approach compared to existing methods in cooperative as well as competitive scenarios, where agent populations are able to discover various physical and informational coordination strategies.

IJCAI Conference 2017 Conference Paper

Value Iteration Networks

  • Aviv Tamar
  • Yi Wu
  • Garrett Thomas
  • Sergey Levine
  • Pieter Abbeel

We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains. This paper is a significantly abridged and IJCAI audience targeted version of the original NIPS 2016 paper with the same title, available here: https: //arxiv. org/abs/1602. 02867

AAAI Conference 2016 Conference Paper

MC-HOG Correlation Tracking with Saliency Proposal

  • Guibo Zhu
  • Jinqiao Wang
  • Yi Wu
  • Xiaoyu Zhang
  • Hanqing Lu

Designing effective feature and handling the model drift problem are two important aspects for online visual tracking. For feature representation, gradient and color features are most widely used, but how to effectively combine them for visual tracking is still an open problem. In this paper, we propose a rich feature descriptor, MC-HOG, by leveraging rich gradient information across multiple color channels or spaces. Then MC-HOG features are embedded into the correlation tracking framework to estimate the state of the target. For handling the model drift problem caused by occlusion or distracter, we propose saliency proposals as prior information to provide candidates and reduce background interference. In addition to saliency proposals, a ranking strategy is proposed to determine the importance of these proposals by exploiting the learnt appearance filter, historical preserved object samples and the distracting proposals. In this way, the proposed approach could effectively explore the color-gradient characteristics and alleviate the model drift problem. Extensive evaluations performed on the benchmark dataset show the superiority of the proposed method.

IJCAI Conference 2016 Conference Paper

Swift: Compiled Inference for Probabilistic Programming Languages

  • Yi Wu
  • Lei Li
  • Stuart Russell
  • Rastislav Bodik

A probabilistic program defines a probability measure over its semantic structures. One common goal of probabilistic programming languages (PPLs) is to compute posterior probabilities for arbitrary models and queries, given observed evidence, using a generic inference engine. Most PPL inference engines - even the compiled ones - incur significant runtime interpretation overhead, especially for contingent and open-universe models. This paper describes Swift, a compiler for the BLOG PPL. Swift-generated code incorporates optimizations that eliminate interpretation overhead, maintain dynamic dependencies efficiently, and handle memory management for possible worlds of varying sizes. Experiments comparing Swift with other PPL engines on avariety of inference problems demonstrate speedups ranging from 12x to326x.

NeurIPS Conference 2016 Conference Paper

Value Iteration Networks

  • Aviv Tamar
  • Yi Wu
  • Garrett Thomas
  • Sergey Levine
  • Pieter Abbeel

We introduce the value iteration network (VIN): a fully differentiable neural network with a `planning module' embedded within. VINs can learn to plan, and are suitable for predicting outcomes that involve planning-based reasoning, such as policies for reinforcement learning. Key to our approach is a novel differentiable approximation of the value-iteration algorithm, which can be represented as a convolutional neural network, and trained end-to-end using standard backpropagation. We evaluate VIN based policies on discrete and continuous path-planning domains, and on a natural-language based search task. We show that by learning an explicit planning computation, VIN policies generalize better to new, unseen domains.

NeurIPS Conference 2012 Conference Paper

Dual-Space Analysis of the Sparse Linear Model

  • Yi Wu
  • David Wipf

Sparse linear (or generalized linear) models combine a standard likelihood function with a sparse prior on the unknown coefficients. These priors can conveniently be expressed as a maximization over zero-mean Gaussians with different variance hyperparameters. Standard MAP estimation (Type I) involves maximizing over both the hyperparameters and coefficients, while an empirical Bayesian alternative (Type II) first marginalizes the coefficients and then maximizes over the hyperparameters, leading to a tractable posterior approximation. The underlying cost functions can be related via a dual-space framework from Wipf et al. (2011), which allows both the Type I or Type II objectives to be expressed in either coefficient or hyperparmeter space. This perspective is useful because some analyses or extensions are more conducive to development in one space or the other. Herein we consider the estimation of a trade-off parameter balancing sparsity and data fit. As this parameter is effectively a variance, natural estimators exist by assessing the problem in hyperparameter (variance) space, transitioning natural ideas from Type II to solve what is much less intuitive for Type I. In contrast, for analyses of update rules and sparsity properties of local and global solutions, as well as extensions to more general likelihood models, we can leverage coefficient-space techniques developed for Type I and apply them to Type II. For example, this allows us to prove that Type II-inspired techniques can be successful recovering sparse coefficients when unfavorable restricted isometry properties (RIP) lead to failure of popular L1 reconstructions. It also facilitates the analysis of Type II when non-Gaussian likelihood models lead to intractable integrations.

AAAI Conference 2012 Conference Paper

Polynomially Decomposable Global Cost Functions in Weighted Constraint Satisfaction

  • Jimmy Lee
  • Ka Lun Leung
  • Yi Wu

In maintaining consistencies, such as GAC*, FDGAC* and weak EDGAC*, for global cost functions, Weighted CSP (WCSP) solvers rely on the projection and extension operations, which entail the computation of the cost functions’ minima. Tractability of this minimum computation is essential for efficient execution. Since projections/extensions modify the cost functions, an important issue is tractable projection-safety, concerning whether minimum cost computation remains tractable after projections/extensions. In this paper, we prove that tractable projection-safety is always possible for projections/extensions to/from the nullary cost function (W∅), and always impossible for projections/extensions to/from n-ary cost functions for n ≥ 2. When n = 1, the answer is indefinite. We give a simple negative example, while Lee and Leung’s flow-based projectionsafe cost functions are also tractable projection-safe. We propose polynomially decomposable cost functions, which are amenable to tractable minimum computation. We further prove that the polynomial decomposability property is unaffected by projections/extensions to/from unary cost functions. Thus, polynomially decomposable cost functions are tractable projection-safe. We show that the SOFT AMONG, SOFT REGULAR, SOFT GRAMMAR and MAX WEIGHT/MIN WEIGHT are polynomially decomposable. They are embedded in a WCSP solver for extensive experiments to confirm the feasibility and efficiency of our proposal.

IJCAI Conference 2011 Conference Paper

Feature Selection via Joint Embedding Learning and Sparse Regression

  • Chenping Hou
  • Feiping Nie
  • Dongyun Yi
  • Yi Wu

The problem of feature selection has aroused considerable research interests in the past few years. Traditional learning based feature selection methods separate embedding learning and feature ranking. In this paper, we introduce a novel unsupervised feature selection approach via Joint Embedding Learning and Sparse Regression (JELSR). Instead of simply employing the graph laplacian for embedding learning and then regression, we use the weight via locally linear approximation to construct graph and unify embedding learning and sparse regression to perform feature selection. By adding the l2, 1-norm regularization, we can learn a sparse matrix for feature ranking. We also provide an effective method to solve the proposed problem. Compared with traditional unsupervised feature selection methods, our approach could integrate the merits of embedding learning and sparse regression simultaneously. Plenty of experimental results are provided to show the validity.

IJCAI Conference 2011 Conference Paper

Local and Structural Consistency for Multi-Manifold Clustering

  • Yong Wang
  • Yuan Jiang
  • Yi Wu
  • Zhi-Hua Zhou

Data sets containing multi-manifold structures are ubiquitous in real-world tasks, and effective grouping of such data is an important yet challenging problem. Though there were many studies on this problem, it is not clear on how to design principled methods for the grouping of multiple hybrid manifolds. In this paper, we show that spectral methods are potentially helpful for hybridmanifold clustering when the neighborhood graph is constructed to connect the neighboring samples from the same manifold. However, traditional algorithms which identify neighbors according to Euclidean distance will easily connect samples belonging to different manifolds. To handle this drawback, we propose a new criterion, i. e. , local and structural consistency criterion, which considers the neighboring information as well as the structural information implied by the samples. Based on this criterion, we develop a simple yet effective algorithm, named Local and Structural Consistency (LSC), for clustering with multiple hybrid manifolds. Experiments show that LSC achieves promising performance.

AAAI Conference 2011 Conference Paper

Localized K-Flats

  • Yong Wang
  • Yuan Jiang
  • Yi Wu
  • Zhi-Hua Zhou

K-flats is a model-based linear manifold clustering algorithm which has been successfully applied in many real-world scenarios. Though some previous works have shown that K-flats doesn’t always provide good performance, little effort has been devoted to analyze its inherent deficiency. In this paper, we address this challenge by showing that the deteriorative performance of K-flats can be attributed to the usual reconstruction error measure and the infinitely extending representations of linear models. Then we propose Localized K-flats algorithm (LKF), which introduces localized representations of linear models and a new distortion measure, to remove confusion among different clusters. Experiments on both synthetic and real-world data sets demonstrate the efficiency of the proposed algorithm. Moreover, preliminary experiments show that LKF has the potential to group manifolds with nonlinear structure.