Author name cluster

Roy Fox

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

24 papers

2 author rows

KR Conference 2025 Conference Paper

Explanations for Unrealizability of Infinite-State Safety Shields

Andoni Rodríguez
Irfansha Shaik
Davide Corsi
Roy Fox
César Sánchez

Safe Reinforcement Learning focuses on developing optimal policies while ensuring safety. A popular method to address such task is shielding, in which a correct-by-construction safety component is synthetised from logical specifications. Recently, shield synthesis has been extended to infinite-state domains, such as continuous environments. This makes shielding more applicable to realistic scenarios. However, often shields might be unrealizable because the specification is inconsistent. In order to address this gap, we present a method to obtain simple unconditional and conditional explanations that witness unrealizability, which goes by temporal formula unrolling: bounded strategy search. In this paper, we show different variants of the technique as well as its applicability

PDF Details DOI

RLC Conference 2025 Conference Paper

Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions

Kyungmin Kim
JB Lanier
Roy Fox

Model-Based Reinforcement Learning (MBRL) has shown promise in visual control tasks due to its data efficiency. However, training MBRL agents to develop generalizable perception remains challenging, especially amid visual distractions that introduce noise in representation learning. We introduce Segmentation Dreamer (SD), a framework that facilitates representation learning in MBRL by incorporating a novel auxiliary task. Assuming that task-relevant components in images can be easily identified with prior knowledge in a given task, SD uses segmentation masks on image observations to reconstruct only task-relevant regions, reducing representation complexity. SD can leverage either ground-truth masks available in simulation or potentially imperfect segmentation foundation models. The latter is further improved by selectively applying the image reconstruction loss to mitigate misleading learning signals from mask prediction errors. In modified DeepMind Control suite and Meta-World tasks with added visual distractions, SD achieves significantly better sample efficiency and greater final performance than prior work and is especially effective in sparse reward tasks that had been unsolvable by prior work. In a real-world robotic lane-following task, our method trained with intentional distractions provides early evidence that a model-based method can transfer from simulation to a real robot under visual domain randomization.

PDF Details

RLJ Journal 2025 Journal Article

Make the Pertinent Salient: Task-Relevant Reconstruction for Visual Control with Distractions

Kyungmin Kim
JB Lanier
Roy Fox

PDF Details

PRL Workshop 2024 Workshop Paper

Q* Search: Heuristic Search with Deep Q-Networks

Forest Agostinelli
Shahaf S. Shperberg
Alexander Shmakov
Stephen Marcus McAleer
Roy Fox
Pierre Baldi

Efficiently solving problems with large action spaces using A* search has been of importance to the artificial intelligence community for decades. This is because the computation and memory requirements of A* search grow linearly with the size of the action space. This burden becomes even more apparent when A* search uses a heuristic function learned by computationally expensive function approximators, such as deep neural networks. To address this problem, we introduce Q* search, a search algorithm that uses deep Q-networks to guide search in order to take advantage of the fact that the sum of the transition costs and heuristic values of the children of a node can be computed with a single forward pass through a deep Q-network without explicitly generating those children. This significantly reduces computation time and requires only one node to be generated per iteration. We use Q* search on different domains and action spaces, showing that Q* suffers from only a small runtime overhead as the action size increases. In addition, our empirical results show Q* search is up to 129 times faster and generates up to 1288 times fewer nodes than A* search. Finally, although obtaining admissible heuristic functions from deep neural networks is an ongoing area of research, we prove that Q* search is guaranteed to find a shortest path given a heuristic function does not overestimate the sum of the transition cost and cost-to-go of the state.

PDF Details

RLJ Journal 2024 Journal Article

Reinforcement Learning from Delayed Observations via World Models

Armin Karamzade
Kyungmin Kim
Montek Kalsi
Roy Fox

In standard reinforcement learning settings, agents typically assume immediate feedback about the effects of their actions after taking them. However, in practice, this assumption may not hold true due to physical constraints and can significantly impact the performance of learning algorithms. In this paper, we address observation delays in partially observable environments. We propose leveraging world models, which have shown success in integrating past observations and learning dynamics, to handle observation delays. By reducing delayed POMDPs to delayed MDPs with world models, our methods can effectively handle partial observability, where existing approaches achieve sub-optimal performance or degrade quickly as observability decreases. Experiments suggest that one of our methods can outperform a naive model-based approach by up to 250%. Moreover, we evaluate our methods on visual delayed environments, for the first time showcasing delay-aware reinforcement learning continuous control with visual observations.

PDF Details

RLC Conference 2024 Conference Paper

Reinforcement Learning from Delayed Observations via World Models

Armin Karamzade
Kyungmin Kim
Montek Kalsi
Roy Fox

PDF Details

ICML Conference 2024 Conference Paper

Skill Set Optimization: Reinforcing Language Model Behavior via Transferable Skills

Kolby Nottingham
Bodhisattwa Prasad Majumder
Bhavana Dalvi Mishra
Sameer Singh 0001
Peter Clark
Roy Fox

Large language models (LLMs) have recently been used for sequential decision making in interactive environments. However, leveraging environment reward signals for continual LLM actor improvement is not straightforward. We propose Skill Set Optimization (SSO) for improving LLM actor performance through constructing and refining sets of transferable skills. SSO constructs skills by extracting common subtrajectories with high rewards and generating subgoals and instructions to represent each skill. These skills are provided to the LLM actor in-context to reinforce behaviors with high rewards. Then, SSO further refines the skill set by pruning skills that do not continue to result in high rewards. We evaluate our method in the classic videogame NetHack and the text environment ScienceWorld to demonstrate SSO’s ability to optimize a set of skills and perform in-context policy improvement. SSO outperforms baselines by 40% in our custom NetHack task and outperforms the previous state-of-the-art in ScienceWorld by 35%.

Details

ICLR Conference 2024 Conference Paper

Toward Optimal Policy Population Growth in Two-Player Zero-Sum Games

Stephen Marcus McAleer
JB Lanier
Kevin A. Wang
Pierre Baldi
Tuomas Sandholm
Roy Fox

In competitive two-agent environments, deep reinforcement learning (RL) methods like Policy Space Response Oracles (PSRO) often increase exploitability between iterations, which is problematic when training in large games. To address this issue, we introduce anytime double oracle (ADO), an algorithm that ensures exploitability does not increase between iterations, and its approximate extensive-form version, anytime PSRO (APSRO). ADO converges to a Nash equilibrium while iteratively reducing exploitability. However, convergence in these algorithms may require adding all of a game's deterministic policies. To improve this, we propose Self-Play PSRO (SP-PSRO), which incorporates an approximately optimal stochastic policy into the population in each iteration. APSRO and SP-PSRO demonstrate lower exploitability and near-monotonic exploitability reduction in games like Leduc poker and Liar's Dice. Empirically, SP-PSRO often converges much faster than APSRO and PSRO, requiring only a few iterations in many games.

Details

RLJ Journal 2024 Journal Article

Verification-Guided Shielding for Deep Reinforcement Learning

Davide Corsi
Guy Amir
Andoni Rodríguez
Guy Katz
César Sánchez
Roy Fox

In recent years, Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. However, despite their successes, DRL-based policies suffer from poor reliability, which limits their deployment in safety-critical domains. Various methods have been put forth to address this issue by providing formal safety guarantees. Two main approaches include shielding and verification. While shielding ensures the safe behavior of the policy by employing an external online component (i.e., a ``shield'') that overrides potentially dangerous actions, this approach has a significant computational cost as the shield must be invoked at runtime to validate every decision. On the other hand, verification is an offline process that can identify policies that are unsafe, prior to their deployment, yet, without providing alternative actions when such a policy is deemed unsafe. In this work, we present verification-guided shielding --- a novel approach that bridges the DRL reliability gap by integrating these two methods. Our approach combines both formal and probabilistic verification tools to partition the input domain into safe and unsafe regions. In addition, we employ clustering and symbolic representation procedures that compress the unsafe regions into a compact representation. This, in turn, allows to temporarily activate the shield solely in (potentially) unsafe regions, in an efficient manner. Our novel approach allows to significantly reduce runtime overhead while still preserving formal safety guarantees. We extensively evaluate our approach on two benchmarks from the robotic navigation domain, as well as provide an in-depth analysis of its scalability and completeness.

PDF Details

RLC Conference 2024 Conference Paper

Verification-Guided Shielding for Deep Reinforcement Learning

Davide Corsi
Guy Amir
Andoni Rodríguez
Guy Katz
César Sánchez
Roy Fox

In recent years, Deep Reinforcement Learning (DRL) has emerged as an effective approach to solving real-world tasks. However, despite their successes, DRL-based policies suffer from poor reliability, which limits their deployment in safety-critical domains. Various methods have been put forth to address this issue by providing formal safety guarantees. Two main approaches include shielding and verification. While shielding ensures the safe behavior of the policy by employing an external online component (i. e. , a ``shield'') that overrides potentially dangerous actions, this approach has a significant computational cost as the shield must be invoked at runtime to validate every decision. On the other hand, verification is an offline process that can identify policies that are unsafe, prior to their deployment, yet, without providing alternative actions when such a policy is deemed unsafe. In this work, we present verification-guided shielding --- a novel approach that bridges the DRL reliability gap by integrating these two methods. Our approach combines both formal and probabilistic verification tools to partition the input domain into safe and unsafe regions. In addition, we employ clustering and symbolic representation procedures that compress the unsafe regions into a compact representation. This, in turn, allows to temporarily activate the shield solely in (potentially) unsafe regions, in an efficient manner. Our novel approach allows to significantly reduce runtime overhead while still preserving formal safety guarantees. We extensively evaluate our approach on two benchmarks from the robotic navigation domain, as well as provide an in-depth analysis of its scalability and completeness.

PDF Details

ICML Conference 2023 Conference Paper

Do Embodied Agents Dream of Pixelated Sheep: Embodied Decision Making using Language Guided World Modelling

Kolby Nottingham
Prithviraj Ammanabrolu
Alane Suhr
Yejin Choi 0001
Hannaneh Hajishirzi
Sameer Singh 0001
Roy Fox

Reinforcement learning (RL) agents typically learn tabula rasa, without prior knowledge of the world. However, if initialized with knowledge of high-level subgoals and transitions between subgoals, RL agents could utilize this Abstract World Model (AWM) for planning and exploration. We propose using few-shot large language models (LLMs) to hypothesize an AWM, that will be verified through world experience, to improve sample efficiency of RL agents. Our DECKARD agent applies LLM-guided exploration to item crafting in Minecraft in two phases: (1) the Dream phase where the agent uses an LLM to decompose a task into a sequence of subgoals, the hypothesized AWM; and (2) the Wake phase where the agent learns a modular policy for each subgoal and verifies or corrects the hypothesized AWM. Our method of hypothesizing an AWM with LLMs and then verifying the AWM based on agent experience not only increases sample efficiency over contemporary methods by an order of magnitude but is also robust to and corrects errors in the LLM, successfully blending noisy internet-scale information from LLMs with knowledge grounded in environment dynamics.

Details

ICML Conference 2023 Conference Paper

Learning to Design Analog Circuits to Meet Threshold Specifications

Dmitrii Krylov
Pooya Khajeh
Junhan Ouyang
Thomas Reeves
Tongkai Liu
Hiba Ajmal
Hamidreza Aghasi
Roy Fox

Automated design of analog and radio-frequency circuits using supervised or reinforcement learning from simulation data has recently been studied as an alternative to manual expert design. It is straightforward for a design agent to learn an inverse function from desired performance metrics to circuit parameters. However, it is more common for a user to have threshold performance criteria rather than an exact target vector of feasible performance measures. In this work, we propose a method for generating from simulation data a dataset on which a system can be trained via supervised learning to design circuits to meet threshold specifications. We moreover perform the to-date most extensive evaluation of automated analog circuit design, including experimenting in a significantly more diverse set of circuits than in prior work, covering linear, nonlinear, and autonomous circuit configurations, and show that our method consistently reaches success rate better than 90% at 5% error margin, while also improving data efficiency by upward of an order of magnitude.

Details

ICML Conference 2022 Conference Paper

Reducing Variance in Temporal-Difference Value Estimation via Ensemble of Deep Networks

Litian Liang
Yaosheng Xu
Stephen Marcus McAleer
Dailin Hu
Alexander Ihler
Pieter Abbeel
Roy Fox

In temporal-difference reinforcement learning algorithms, variance in value estimation can cause instability and overestimation of the maximal target value. Many algorithms have been proposed to reduce overestimation, including several recent ensemble methods, however none have shown success in sample-efficient learning through addressing estimation variance as the root cause of overestimation. In this paper, we propose MeanQ, a simple ensemble method that estimates target values as ensemble means. Despite its simplicity, MeanQ shows remarkable sample efficiency in experiments on the Atari Learning Environment benchmark. Importantly, we find that an ensemble of size 5 sufficiently reduces estimation variance to obviate the lagging target network, eliminating it as a source of bias and further gaining sample efficiency. We justify intuitively and empirically the design choices in MeanQ, including the necessity of independent experience sampling. On a set of 26 benchmark Atari environments, MeanQ outperforms all tested baselines, including the best available baseline, SUNRISE, at 100K interaction steps in 16/26 environments, and by 68% on average. MeanQ also outperforms Rainbow DQN at 500K steps in 21/26 environments, and by 49% on average, and achieves average human-level performance using 200K ($\pm$100K) interaction steps. Our implementation is available at https: //github. com/indylab/MeanQ.

Details

PRL Workshop 2021 Workshop Paper

Obtaining Approximately Admissible Heuristic Functions through Deep Reinforcement Learning and A* Search

Forest Agostinelli
Stephen McAleer
Alexander Shmakov
Roy Fox
Marco Valtorta
Biplav Srivastava
Pierre Baldi

real world applications would ensure that artificial intelligence agents can solve problems in the most efficient way Deep reinforcement learning has been shown to be able to possible, or close to the most efficient way possible, which train deep neural networks to implement effective heuristic could significantly reduce the resources consumed by such functions that can be used with A* search to solve probagents. lems with large state spaces. However, these learned heuristic Obtaining an admissible heuristic function often requires functions are not guaranteed to be admissible. We introduce domain-specific knowledge. For example, pattern databases approximately admissible conversion, an algorithm that can convert any inadmissible heuristic function into a heuristic (PDBs) (Culberson and Schaeffer 1998) have been sucfunction that is admissible in the vast majority of cases with cessful at finding optimal solutions to puzzles such as the no domain-specific heuristic information. We apply approxiRubik’s cube (Korf 1997), 15-puzzle, and 24-puzzle (Korf mately admissible conversion to heuristic functions parameand Felner 2002; Felner, Korf, and Hanan 2004). Howterized by deep neural networks and show that these heuristic ever, ensuring that these PDBs produce admissible heurisfunctions can be used to find optimal solutions, or bounded tics requires knowledge about how the puzzle pieces intersuboptimal solutions, even when doing a batched version of act. There has been previous research on using deep neuA* search. We test our method on the 15-puzzle and 24ral networks to learn heuristic functions (Chen and Wei puzzle and obtain a heuristic function that is empirically ad2011; Wang et al. 2019; Ferber, Helmert, and Hoffmann missible over 99. 99% of the time and that finds optimal so2020) including the DeepCubeA algorithm (McAleer et al. lutions for 100% of all test configurations. To the best of our knowledge, this is the first demonstration that approximately 2019; Agostinelli et al. 2019) which used deep reinforceadmissible heuristics can be obtained using deep neural netment learning and weighted A* search (Pohl 1970) to solve works in a domain independent fashion. the aforementioned puzzles. However, the heuristic functions produced by DeepCubeA are not admissible. In this paper, we define an approximately admissible

PDF Details

NeurIPS Conference 2021 Conference Paper

XDO: A Double Oracle Algorithm for Extensive-Form Games

Stephen McAleer
JB Lanier
Kevin A Wang
Pierre Baldi
Roy Fox

Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algorithm for two-player zero-sum games that has been empirically shown to find approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to an approximate Nash equilibrium and can handle continuous actions, it may take an exponential number of iterations as the number of information states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guaranteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations an order of magnitude smaller than PSRO. Experiments on a modified Leduc poker game and Oshi-Zumo show that tabular XDO achieves a lower exploitability than CFR with the same amount of computation. We also find that NXDO outperforms PSRO and NFSP on a sequential multidimensional continuous-action game. NXDO is the first deep RL method that can find an approximate Nash equilibrium in high-dimensional continuous-action sequential games.

PDF Details

NeurIPS Conference 2020 Conference Paper

Pipeline PSRO: A Scalable Approach for Finding Approximate Nash Equilibria in Large Games

Stephen McAleer
JB Lanier
Roy Fox
Pierre Baldi

Finding approximate Nash equilibria in zero-sum imperfect-information games is challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRO-based method for finding approximate Nash equilibria in large zero-sum imperfect-information games. P2SRO is able to parallelize PSRO with convergence guarantees by maintaining a hierarchical pipeline of reinforcement learning workers, each training against the policies generated by lower levels in the hierarchy. We show that unlike existing methods, P2SRO converges to an approximate Nash equilibrium, and does so faster as the number of parallel workers increases, across a variety of imperfect information games. We also introduce an open-source environment for Barrage Stratego, a variant of Stratego with an approximate game tree complexity of 10^50. P2SRO is able to achieve state-of-the-art performance on Barrage Stratego and beats all existing bots. Experiment code is available at https: //github. com/JBLanier/pipeline-psro.

PDF Details

ICRA Conference 2018 Conference Paper

Fast and Reliable Autonomous Surgical Debridement with Cable-Driven Robots Using a Two-Phase Calibration Procedure

Daniel Seita
Sanjay Krishnan
Roy Fox
Stephen McKinley
John F. Canny
Ken Goldberg

Automating precision subtasks such as debridement (removing dead or diseased tissue fragments) with Robotic Surgical Assistants (RSAs) such as the da Vinci Research Kit (dVRK) is challenging due to inherent nOnlinearities in cable-driven systems. We propose and evaluate a novel two-phase coarse-to-fine calibration method. In Phase I (coarse), we place a red calibration marker on the end effector and let it randomly move through a set of open-loop trajectories to obtain a large sample set of camera pixels and internal robot end-effector configurations. This coarse data is then used to train a Deep Neural Network (DNN) to learn the coarse transformation bias. In Phase II (fine), the bias from Phase I is applied to move the end -effector toward a small set of specific target points on a printed sheet. For each target, a human operator manually adjusts the end -effector position by direct contact (not through teleoperation) and the residual compensation bias is recorded. This fine data is then used to train a Random Forest (RF) to learn the fine transformation bias. Subsequent experiments suggest that without calibration, position errors average 4. 55mm. Phase I can reduce average error to 2. 14mm and the combination of Phase I and Phase II can reduces average error to 1. 08mm. We apply these results to debridement of raisins and pumpkin seeds as fragment phantoms. Using an endoscopic stereo camera with standard edge detection, experiments with 120 trials achieved average success rates of 94. 5 %, exceeding prior results with much larger fragments (89. 4%) and achieving a speedup of 2. 1x, decreasing time per fragment from 15. 8 seconds to 7. 3 seconds. Source code, data, and videos are available at https://sites.google.com/view/calib-icra/.

Details

ICML Conference 2018 Conference Paper

RLlib: Abstractions for Distributed Reinforcement Learning

Eric Liang
Richard Liaw
Robert Nishihara
Philipp Moritz
Roy Fox
Ken Goldberg
Joseph E. Gonzalez
Michael I. Jordan

Reinforcement learning (RL) algorithms involve the deep nesting of highly irregular computation patterns, each of which typically exhibits opportunities for distributed computation. We argue for distributing RL components in a composable way by adapting algorithms for top-down hierarchical control, thereby encapsulating parallelism and resource requirements within short-running compute tasks. We demonstrate the benefits of this principle through RLlib: a library that provides scalable software primitives for RL. These primitives enable a broad range of algorithms to be implemented with high performance, scalability, and substantial code reuse. RLlib is available as part of the open source Ray project at http: //rllib. io/.

Details

ICRA Conference 2018 Conference Paper

Robustly Adjusting Indoor Drip Irrigation Emitters with the Toyota HSR Robot

Ron Berenstein
Roy Fox
Stephen McKinley
Stefano Carpin
Ken Goldberg

Indoor plants in homes and commercial buildings such as malls, offices, airports, and hotels, can benefit from precision irrigation to maintain healthy growth and reduce water consumption. As active valves are too costly, and ongoing precise manual adjustment of drip emitters is impractical, we explore how the Toyota HSR mobile manipulator robot can autonomously adjust low-cost passive emitters. To provide sufficient accuracy for gripper alignment, we designed a lightweight, modular Emitter Localization Device (ELD) with cameras and LEDs that can be non-invasively mounted on the arm. This paper presents details of the design, algorithms, and experiments with adjusting emitters using a two-phase procedure: (1) aligning the robot base using the build-in hand camera, and (2) aligning the gripper axis with the emitter axis using the ELD. We report success rates and sensitivity analysis to tune computer vision parameters and joint motor gains. Experiments suggest that emitters can be adjusted with 95 % success rate in approximately 20 seconds.

Details

EWRL Workshop 2016 Workshop Paper

Principled Option Learning in Markov Decision Processes

Roy Fox
Michal Moshkovitz
Naftali Tishby

It is well known that options can make planning more efficient, among their many benefits. Thus far, algorithms for autonomously discovering a set of useful options were heuristic. Naturally, a principled way of finding a set of useful options may be more promising and insightful. In this paper we suggest a mathematical characterization of good sets of options using tools from information theory. This characterization enables us to find conditions for a set of options to be optimal and an algorithm that outputs a useful set of options and illustrate the proposed algorithm in simulation.

PDF Details

UAI Conference 2016 Conference Paper

Taming the Noise in Reinforcement Learning via Soft Updates

Roy Fox
Ari Pakman
Naftali Tishby

Model-free reinforcement learning algorithms, such as Q-learning, perform poorly in the early stages of learning in noisy environments, because much effort is spent unlearning biased estimates of the state-action value function. The bias results from selecting, among several noisy estimates, the apparent optimum, which may actually be suboptimal. We propose G-learning, a new off-policy learning algorithm that regularizes the value estimates by penalizing deterministic policies in the beginning of the learning process. We show that this method reduces the bias of the value-function estimation, leading to faster convergence to the optimal value and the optimal policy. Moreover, G-learning enables the natural incorporation of prior domain knowledge, when available. The stochastic nature of G-learning also makes it avoid some exploration costs, a property usually attributed only to on-policy algorithms. We illustrate these ideas in several examples, where G-learning results in significant improvements of the convergence rate and the cost of the learning process.

Details

NeurIPS Conference 2013 Conference Paper

A multi-agent control framework for co-adaptation in brain-computer interfaces

Josh Merel
Roy Fox
Tony Jebara
Liam Paninski

In a closed-loop brain-computer interface (BCI), adaptive decoders are used to learn parameters suited to decoding the user's neural response. Feedback to the user provides information which permits the neural tuning to also adapt. We present an approach to model this process of co-adaptation between the encoding model of the neural signal and the decoding algorithm as a multi-agent formulation of the linear quadratic Gaussian (LQG) control problem. In simulation we characterize how decoding performance improves as the neural encoding and adaptive decoder optimize, qualitatively resembling experimentally demonstrated closed-loop improvement. We then propose a novel, modified decoder update rule which is aware of the fact that the encoder is also changing and show it can improve simulated co-adaptation dynamics. Our modeling approach offers promise for gaining insights into co-adaptation as well as improving user learning of BCI control in practical settings.

PDF Details

ICML Conference 2012 Conference Paper

Bounded Planning in Passive POMDPs

Roy Fox
Naftali Tishby

Details

AAAI Conference 2007 Conference Paper

A Reinforcement Learning Algorithm with Polynomial Interaction Complexity for Only-Costly-Observable MDPs

Roy Fox

An Unobservable MDP (UMDP) is a POMDP in which there are no observations. An Only-Costly-Observable MDP (OCOMDP) is a POMDP which extends an UMDP by allowing a particular costly action which completely observes the state. We introduce UR-MAX, a reinforcement learning algorithm with polynomial interaction complexity for unknown OCOMDPs.

PDF Details