Arrow Research search

Author name cluster

Kyle Wray

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
1 author row

Possible papers

6

JAIR Journal 2026 Journal Article

Backward Monte Carlo Tree Search: Charting Unsafe Regions in the Belief-Space

  • Anil Yildiz
  • Esen Yel
  • Marcell Vazquez-Chanlatte
  • Kyle Wray
  • Mykel J. Kochenderfer
  • Stefan J. Witwicki

Safety-critical systems often operate in partially observable environments, where assessing the safety of the underlying policy remains a fundamental challenge. This study focuses on evaluating policies by identifying regions of the belief-space that can lead the system’s policy to an undesirable state with a non-negligible probability. In this paper, we introduce Backward Monte Carlo Tree Search, the first Monte Carlo tree search framework that expands backward in time within the belief-space. The tree search begins from an undesired terminal belief and recursively explores its possible predecessors, constructing a tree of belief transitions that could lead to an unsafe outcome within a given horizon. Evaluations in gridworld and autonomous driving domains show that identifying beliefs from which failures may occur enables runtime risk forecasting and targeted policy retraining, marking a conceptual shift in how safety is validated under uncertainty.

AAMAS Conference 2024 Conference Paper

Decision Making in Non-Stationary Environments with Policy-Augmented Search

  • Ava Pettet
  • Yunuo Zhang
  • Baiting Luo
  • Kyle Wray
  • Hendrik Baier
  • Aron Laszka
  • Abhishek Dubey
  • Ayan Mukhopadhyay

Sequential decision-making is challenging in non-stationary environments, where the environment in which an agent operates can change over time. Policies learned before execution become stale when the environment changes, and relearning takes time and computational effort. Online search, on the other hand, can return sub-optimal actions when there are limitations on allowed runtime. In this paper, we introduce Policy-Augmented Monte Carlo tree search (PA-MCTS), which combines action-value estimates from an out-of-date policy with an online search using an up-to-date model of the environment. We prove several theoretical results about PA-MCTS. We also compare and contrast our approach with AlphaZero, another hybrid planning approach, and Deep Q Learning on several OpenAI Gym environments and show that PA-MCTS outperforms these baselines.

AAAI Conference 2018 Conference Paper

Integrated Cooperation and Competition in Multi-Agent Decision-Making

  • Kyle Wray
  • Akshat Kumar
  • Shlomo Zilberstein

Observing that many real-world sequential decision problems are not purely cooperative or purely competitive, we propose a new model—cooperative-competitive process (CCP)—that can simultaneously encapsulate both cooperation and competition. First, we discuss how the CCP model bridges the gap between cooperative and competitive models. Next, we investigate a specific class of group-dominant CCPs, in which agents cooperate to achieve a common goal as their primary objective, while also pursuing individual goals as a secondary objective. We provide an approximate solution for this class of problems that leverages stochastic finite-state controllers. The model is grounded in two multi-robot meeting and boxpushing domains that are implemented in simulation and demonstrated on two real robots.

AAAI Conference 2017 Conference Paper

Fast SSP Solvers Using Short-Sighted Labeling

  • Luis Pineda
  • Kyle Wray
  • Shlomo Zilberstein

State-of-the-art methods for solving SSPs often work by limiting planning to restricted regions of the state space. The resulting problems can then be solved quickly, and the process is repeated during execution when states outside the restricted region are encountered. Typically, these approaches focus on states that are within some distance measure of the start state (e. g. , number of actions or probability of being reached). However, these short-sighted approaches make it difficult to propagate information from states that are closer to a goal than to the start state, thus missing opportunities to improve planning. We present an alternative approach in which short-sightedness is used only to determine whether a state should be labeled as solved or not, but otherwise the set of states that can be accounted for during planning is unrestricted. Based on this idea, we propose the FLARES algorithm and show that it performs consistently well on a wide range of benchmark problems.

AAAI Conference 2016 Conference Paper

A POMDP Formulation of Proactive Learning

  • Kyle Wray
  • Shlomo Zilberstein

We cast the Proactive Learning (PAL) problem—Active Learning (AL) with multiple reluctant, fallible, cost-varying oracles—as a Partially Observable Markov Decision Process (POMDP). The agent selects an oracle at each time step to label a data point while it maintains a belief over the true underlying correctness of its current dataset’s labels. The goal is to minimize labeling costs while considering the value of obtaining correct labels, thus maximizing final resultant classifier accuracy. We prove three properties that show our particular formulation leads to a structured and bounded-size set of belief points, enabling strong performance of pointbased methods to solve the POMDP. Our method is compared with the original three algorithms proposed by Donmez and Carbonell and a simple baseline. We demonstrate that our approach matches or improves upon the original approach within five different oracle scenarios, each on two datasets. Finally, our algorithm provides a general, well-defined mathematical foundation to build upon.

AAAI Conference 2015 Conference Paper

Multi-Objective MDPs with Conditional Lexicographic Reward Preferences

  • Kyle Wray
  • Shlomo Zilberstein
  • Abdel-Illah Mouaddib

Sequential decision problems that involve multiple objectives are prevalent. Consider for example a driver of a semiautonomous car who may want to optimize competing objectives such as travel time and the effort associated with manual driving. We introduce a rich model called Lexicographic MDP (LMDP) and a corresponding planning algorithm called LVI that generalize previous work by allowing for conditional lexicographic preferences with slack. We analyze the convergence characteristics of LVI and establish its game theoretic properties. The performance of LVI in practice is tested within a realistic benchmark problem in the domain of semi-autonomous driving. Finally, we demonstrate how GPU-based optimization can improve the scalability of LVI and other value iteration algorithms for MDPs.