Author name cluster

David Abel

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

2 author rows

ICLR Conference 2025 Conference Paper

A Black Swan Hypothesis: The Role of Human Irrationality in AI Safety

Hyunin Lee
Chanwoo Park
David Abel
Ming Jin 0002

Black swan events are statistically rare occurrences that carry extremely high risks. A typical view of defining black swan events is heavily assumed to originate from an unpredictable time-varying environments; however, the community lacks a comprehensive definition of black swan events. To this end, this paper challenges that the standard view is incomplete and claims that high-risk, statistically rare events can also occur in unchanging environments due to human misperception of their value and likelihood, which we call as spatial black swan event. We first carefully categorize black swan events, focusing on spatial black swan events, and mathematically formalize the definition of black swan events. We hope these definitions can pave the way for the development of algorithms to prevent such events by rationally correcting human perception.

NeurIPS Conference 2025 Conference Paper

Enhancing Tactile-based Reinforcement Learning for Robotic Control

Elle Miller
Trevor McInroe
David Abel
Oisin Mac Aodha
Sethu Vijayakumar

Achieving safe, reliable real-world robotic manipulation requires agents to evolve beyond vision and incorporate tactile sensing to overcome sensory deficits and reliance on idealised state information. Despite its potential, the efficacy of tactile sensing in reinforcement learning (RL) remains inconsistent. We address this by developing self-supervised learning (SSL) methodologies to more effectively harness tactile observations, focusing on a scalable setup of proprioception and sparse binary contacts. We empirically demonstrate that sparse binary tactile signals are critical for dexterity, particularly for interactions that proprioceptive control errors do not register, such as decoupled robot-object motions. Our agents achieve superhuman dexterity in complex contact tasks (ball bouncing and Baoding ball rotation). Furthermore, we find that decoupling the SSL memory from the on-policy memory can improve performance. We release the Robot Tactile Olympiad ($\texttt{RoTO}$) benchmark to standardise and promote future research in tactile-based manipulation. Project page: https: //elle-miller. github. io/tactile_rl.

ICML Conference 2025 Conference Paper

General agents need world models

Jonathan Richens
Tom Everitt
David Abel

Are world models a necessary ingredient for flexible, goal-directed behaviour, or is model-free learning sufficient? We provide a formal answer to this question, showing that any agent capable of generalizing to multi-step goal-directed tasks must have learned a predictive model of its environment. We show that this model can be extracted from the agent’s policy, and that increasing the agents performance or the complexity of the goals it can achieve requires learning increasingly accurate world models. This has a number of consequences: from developing safe and general agents, to bounding agent capabilities in complex environments, and providing new algorithms for eliciting world models from agents.

JMLR Journal 2025 Journal Article

Optimizing Return Distributions with Distributional Dynamic Programming

Bernardo Ávila Pires
Mark Rowland
Diana Borsa
Zhaohan Daniel Guo
Khimya Khetarpal
André Barreto
David Abel
Rémi Munos

We introduce distributional dynamic programming (DP) methods for optimizing statistical functionals of the return distribution, with standard reinforcement learning as a special case. Previous distributional DP methods could optimize the same class of expected utilities as classic DP. To go beyond, we combine distributional DP with stock augmentation, a technique previously introduced for classic DP in the context of risk-sensitive RL, where the MDP state is augmented with a statistic of the rewards obtained since the first time step. We find that a number of recently studied problems can be formulated as stock-augmented return distribution optimization, and we show that we can use distributional DP to solve them. We analyze distributional value and policy iteration, with bounds and a study of what objectives these distributional DP methods can or cannot optimize. We describe a number of applications outlining how to use distributional DP to solve different stock-augmented return distribution optimization problems, for example maximizing conditional value-at-risk, and homeostatic regulation. To highlight the practical potential of stock-augmented return distribution optimization and distributional DP, we introduce an agent that combines DQN and the core ideas of distributional DP, and empirically evaluate it for solving instances of the applications discussed. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

Plasticity as the Mirror of Empowerment

David Abel
Michael Bowling
Andre Barreto
Will Dabney
Shi Dong
Steven Hansen
Anna Harutyunyan
Khimya Khetarpal

Agents are minimally entities that are influenced by their past observations and act to influence future observations. This latter capacity is captured by empowerment, which has served as a vital framing concept across artificial intelligence and cognitive science. This former capacity, however, is equally foundational: In what ways, and to what extent, can an agent be influenced by what it observes? In this paper, we ground this concept in a universal agent-centric measure that we refer to as plasticity, and reveal a fundamental connection to empowerment. Following a set of desiderata on a suitable definition, we define plasticity using a new information-theoretic quantity we call the generalized directed information. We show that this new quantity strictly generalizes the directed information introduced by Massey (1990) while preserving all of its desirable properties. Under this definition, we find that plasticity is well thought of as the mirror of empowerment: The two concepts are defined using the same measure, with only the direction of influence reversed. Our main result establishes a tension between the plasticity and empowerment of an agent, suggesting that agent design needs to be mindful of both characteristics. We explore the implications of these findings, and suggest that plasticity, empowerment, and their relationship are essential to understanding agency.

NeurIPS Conference 2025 Conference Paper

Skill-Driven Neurosymbolic State Abstractions

Alper Ahmetoglu
Steven James
Cameron Allen
Sam Lobel
David Abel
George Konidaris

We consider how to construct state abstractions compatible with a given set of abstract actions, to obtain a well-formed abstract Markov decision process (MDP). We show that the Bellman equation suggests that abstract states should represent distributions over states in the ground MDP; we characterize the conditions under which the resulting process is Markov and approximately model-preserving, derive algorithms for constructing and planning with the abstract MDP, and apply them to a visual maze task. We generalize these results to the factored actions case, characterizing the conditions that result in factored abstract states and apply the resulting algorithm to Montezuma's Revenge. These results provide a powerful and principled framework for constructing neurosymbolic abstract Markov decision processes.

ICLR Conference 2025 Conference Paper

Studying the Interplay Between the Actor and Critic Representations in Reinforcement Learning

Samuel Garcin
Trevor McInroe
Pablo Samuel Castro
Christopher G. Lucas
David Abel
Prakash Panangaden
Stefano V. Albrecht

Extracting relevant information from a stream of high-dimensional observations is a central challenge for deep reinforcement learning agents. Actor-critic algorithms add further complexity to this challenge, as it is often unclear whether the same information will be relevant to both the actor and the critic. To this end, we here explore the principles that underlie effective representations for the actor and for the critic in on-policy algorithms. We focus our study on understanding whether the actor and critic will benefit from separate, rather than shared, representations. Our primary finding is that when separated, the representations for the actor and critic systematically specialise in extracting different types of information from the environment---the actor's representation tends to focus on action-relevant information, while the critic's representation specialises in encoding value and dynamics information. We conduct a rigourous empirical study to understand how different representation learning approaches affect the actor and critic's specialisations and their downstream performance, in terms of sample efficiency and generation capabilities. Finally, we discover that a separated critic plays an important role in exploration and data collection during training. Our code, trained models and data are accessible at https://github.com/francelico/deac-rep.

ICML Conference 2024 Conference Paper

Pragmatic Feature Preferences: Learning Reward-Relevant Preferences from Human Input

Andi Peng
Yuying Sun
Tianmin Shu
David Abel

Humans use context to specify preferences over behaviors, i. e. their reward functions. Yet, algorithms for inferring reward models from preference data do not take this social learning view into account. Inspired by pragmatic human communication, we study how to extract fine-grained data regarding why an example is preferred that is useful for learning an accurate reward model. We propose to enrich preference queries to ask both (1) which features of a given example are preferable in addition to (2) comparisons between objects. We derive an approach for learning from these feature-level preferences, both for cases where users specify which features are reward-relevant, and when users do not. We evaluate our approach on linear bandit settings in both visual and language-based domains. Results support the efficiency of our approach in quickly converging to accurate rewards with less comparisons vs. example-only labels. Finally, we validate the real-world applicability with a behavioral experiment on a mushroom foraging task. Our findings suggest that incorporating pragmatic feature preferences is a promising approach for more efficient user-aligned reward learning.

RLC Conference 2024 Conference Paper

Three Dogmas of Reinforcement Learning

David Abel
Mark K Ho
Anna Harutyunyan

Modern reinforcement learning has been conditioned by at least three dogmas. The first is the environment spotlight, which refers to our tendency to focus on modeling environments rather than agents. The second is our treatment of learning as finding the solution to a task, rather than adaptation. The third is the reward hypothesis, which states that all goals and purposes can be well thought of as maximization of a reward signal. These three dogmas shape much of what we think of as the science of reinforcement learning. While each of the dogmas have played an important role in developing the field, it is time we bring them to the surface and reflect on whether they belong as basic ingredients of our scientific paradigm. In order to realize the potential of reinforcement learning as a canonical frame for researching intelligent agents, we suggest that it is time we shed dogmas one and two entirely, and embrace a nuanced approach to the third.

RLJ Journal 2024 Journal Article

Three Dogmas of Reinforcement Learning

David Abel
Mark K Ho
Anna Harutyunyan

Modern reinforcement learning has been conditioned by at least three dogmas. The first is the environment spotlight, which refers to our tendency to focus on modeling environments rather than agents. The second is our treatment of learning as finding the solution to a task, rather than adaptation. The third is the reward hypothesis, which states that all goals and purposes can be well thought of as maximization of a reward signal. These three dogmas shape much of what we think of as the science of reinforcement learning. While each of the dogmas have played an important role in developing the field, it is time we bring them to the surface and reflect on whether they belong as basic ingredients of our scientific paradigm. In order to realize the potential of reinforcement learning as a canonical frame for researching intelligent agents, we suggest that it is time we shed dogmas one and two entirely, and embrace a nuanced approach to the third.

NeurIPS Conference 2023 Conference Paper

A Definition of Continual Reinforcement Learning

David Abel
Andre Barreto
Benjamin Van Roy
Doina Precup
Hado P. van Hasselt
Satinder Singh

In a standard view of the reinforcement learning problem, an agent’s goal is to efficiently identify a policy that maximizes long-term reward. However, this perspective is based on a restricted view of learning as finding a solution, rather than treating learning as endless adaptation. In contrast, continual reinforcement learning refers to the setting in which the best agents never stop learning. Despite the importance of continual reinforcement learning, the community lacks a simple definition of the problem that highlights its commitments and makes its primary concepts precise and clear. To this end, this paper is dedicated to carefully defining the continual reinforcement learning problem. We formalize the notion of agents that “never stop learning” through a new mathematical language for analyzing and cataloging agents. Using this new language, we define a continual learning agent as one that can be understood as carrying out an implicit search process indefinitely, and continual reinforcement learning as the setting in which the best agents are all continual learning agents. We provide two motivating examples, illustrating that traditional views of multi-task reinforcement learning and continual supervised learning are special cases of our definition. Collectively, these definitions and perspectives formalize many intuitive concepts at the heart of learning, and open new research pathways surrounding continual learning agents.

ICML Conference 2023 Conference Paper

Settling the Reward Hypothesis

Michael H. Bowling
John D. Martin
David Abel
Will Dabney

The reward hypothesis posits that, "all of what we mean by goals and purposes can be well thought of as maximization of the expected value of the cumulative sum of a received scalar signal (reward). " We aim to fully settle this hypothesis. This will not conclude with a simple affirmation or refutation, but rather specify completely the implicit requirements on goals and purposes under which the hypothesis holds.

IJCAI Conference 2022 Conference Paper

On the Expressivity of Markov Reward (Extended Abstract)

David Abel
Will Dabney
Anna Harutyunyan
Mark K. Ho
Michael L. Littman
Doina Precup
Satinder Singh

Reward is the driving force for reinforcement-learning agents. We here set out to understand the expressivity of Markov reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of "task": (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to perform each task type, and correctly determine when no such reward function exists.

PDF Details DOI

AAAI Conference 2021 Conference Paper

Lipschitz Lifelong Reinforcement Learning

Erwan Lecarpentier
David Abel
Kavosh Asadi
Yuu Jinnai
Emmanuel Rachelson
Michael L. Littman

We consider the problem of knowledge transfer when an agent is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes and establish that close MDPs have close optimal value functions. Formally, the optimal value functions are Lipschitz continuous with respect to the tasks space. These theoretical results lead us to a value-transfer method for Lifelong RL, which we use to build a PAC-MDP algorithm with improved convergence rate. Further, we show the method to experience no negative transfer with high probability. We illustrate the benefits of the method in Lifelong RL experiments.

NeurIPS Conference 2021 Conference Paper

On the Expressivity of Markov Reward

David Abel
Will Dabney
Anna Harutyunyan
Mark K. Ho
Michael Littman
Doina Precup
Satinder Singh

Reward is the driving force for reinforcement-learning agents. This paper is dedicated to understanding the expressivity of reward as a way to capture tasks that we would want an agent to perform. We frame this study around three new abstract notions of “task” that might be desirable: (1) a set of acceptable behaviors, (2) a partial ordering over behaviors, or (3) a partial ordering over trajectories. Our main results prove that while reward can express many of these tasks, there exist instances of each task type that no Markov reward function can capture. We then provide a set of polynomial-time algorithms that construct a Markov reward function that allows an agent to optimize tasks of each of these three types, and correctly determine when no such reward function exists. We conclude with an empirical study that corroborates and illustrates our theoretical findings.

ICML Conference 2021 Conference Paper

Revisiting Peng's Q(λ) for Modern Reinforcement Learning

Tadashi Kozuno
Yunhao Tang
Mark Rowland 0001
Rémi Munos
Steven Kapturowski
Will Dabney
Michal Valko
David Abel

Off-policy multi-step reinforcement learning algorithms consist of conservative and non-conservative algorithms: the former actively cut traces, whereas the latter do not. Recently, Munos et al. (2016) proved the convergence of conservative algorithms to an optimal Q-function. In contrast, non-conservative algorithms are thought to be unsafe and have a limited or no theoretical guarantee. Nonetheless, recent studies have shown that non-conservative algorithms empirically outperform conservative ones. Motivated by the empirical results and the lack of theory, we carry out theoretical analyses of Peng’s Q($\lambda$), a representative example of non-conservative algorithms. We prove that \emph{it also converges to an optimal policy} provided that the behavior policy slowly tracks a greedy policy in a way similar to conservative policy iteration. Such a result has been conjectured to be true but has not been proven. We also experiment with Peng’s Q($\lambda$) in complex continuous control tasks, confirming that Peng’s Q($\lambda$) often outperforms conservative algorithms despite its simplicity. These results indicate that Peng’s Q($\lambda$), which was thought to be unsafe, is a theoretically-sound and practically effective algorithm.

AAAI Conference 2020 Conference Paper

People Do Not Just Plan,They Plan to Plan

Mark Ho
David Abel
Jonathan Cohen
Michael Littman
Thomas Griffiths

Planning is useful. It lets people take actions that have desirable long-term consequences. But, planning is hard. It requires thinking about consequences, which consumes limited computational and cognitive resources. Thus, people should plan their actions, but they should also be smart about how they deploy resources used for planning their actions. Put another way, people should also “plan their plans”. Here, we formulate this aspect of planning as a meta-reasoning problem and formalize it in terms of a recursive Bellman objective that incorporates both task rewards and information-theoretic planning costs. Our account makes quantitative predictions about how people should plan and meta-plan as a function of the overall structure of a task, which we test in two experiments with human participants. We ﬁnd that people’s reaction times reﬂect a planned use of information processing, consistent with our account. This formulation of planning to plan provides new insight into the function of hierarchical planning, state abstraction, and cognitive control in both humans and machines.

ICML Conference 2020 Conference Paper

What can I do here? A Theory of Affordances in Reinforcement Learning

Khimya Khetarpal
Zafarali Ahmed
Gheorghe Comanici
David Abel
Doina Precup

Reinforcement learning algorithms usually assume that all actions are always available to an agent. However, both people and animals understand the general link between the features of their environment and the actions that are feasible. Gibson (1977) coined the term "affordances" to describe the fact that certain states enable an agent to do certain actions, in the context of embodied agents. In this paper, we develop a theory of affordances for agents who learn and plan in Markov Decision Processes. Affordances play a dual role in this case. On one hand, they allow faster planning, by reducing the number of actions available in any given situation. On the other hand, they facilitate more efficient and precise learning of transition models from data, especially when such models require function approximation. We establish these properties through theoretical results as well as illustrative examples. We also propose an approach to learn affordances and use it to estimate transition models that are simpler and generalize better.

AAAI Conference 2019 Short Paper

A Theory of State Abstraction for Reinforcement Learning

David Abel

Reinforcement learning presents a challenging problem: agents must generalize experiences, efficiently explore the world, and learn from feedback that is delayed and often sparse, all while making use of a limited computational budget. Abstraction is essential to all of these endeavors. Through abstraction, agents can form concise models of both their surroundings and behavior, supporting effective decision making in diverse and complex environments. To this end, the goal of my doctoral research is to characterize the role abstraction plays in reinforcement learning, with a focus on state abstraction. I offer three desiderata articulating what it means for a state abstraction to be useful, and introduce classes of state abstractions that provide a partial path toward satisfying these desiderata. Collectively, I develop theory for state abstractions that can 1) preserve near-optimal behavior, 2) be learned and computed efficiently, and 3) can lower the time or data needed to make effective decisions. I close by discussing extensions of these results to an information theoretic paradigm of abstraction, and an extension to hierarchical abstraction that enjoys the same desirable properties.

ICML Conference 2019 Conference Paper

Discovering Options for Exploration by Minimizing Cover Time

Yuu Jinnai
Jee Won Park
David Abel
George Konidaris 0001

One of the main challenges in reinforcement learning is solving tasks with sparse reward. We show that the difficulty of discovering a distant rewarding state in an MDP is bounded by the expected cover time of a random walk over the graph induced by the MDP’s transition dynamics. We therefore propose to accelerate exploration by constructing options that minimize cover time. We introduce a new option discovery algorithm that diminishes the expected cover time by connecting the most distant states in the state-space graph with options. We show empirically that the proposed algorithm improves learning in several domains with sparse rewards.

ICML Conference 2019 Conference Paper

Finding Options that Minimize Planning Time

Yuu Jinnai
David Abel
D. Ellis Hershkowitz
Michael L. Littman
George Konidaris 0001

We formalize the problem of selecting the optimal set of options for planning as that of computing the smallest set of options so that planning converges in less than a given maximum of value-iteration passes. We first show that the problem is $\NP$-hard, even if the task is constrained to be deterministic—the first such complexity result for option discovery. We then present the first polynomial-time boundedly suboptimal approximation algorithm for this setting, and empirically evaluate it against both the optimal options and a representative collection of heuristic approaches in simple grid-based domains.

AAAI Conference 2019 Conference Paper

State Abstraction as Compression in Apprenticeship Learning

David Abel
Dilip Arumugam
Kavosh Asadi
Yuu Jinnai
Michael L. Littman
Lawson L.S. Wong

State abstraction can give rise to models of environments that are both compressed and useful, thereby enabling efficient sequential decision making. In this work, we offer the first formalism and analysis of the trade-off between compression and performance made in the context of state abstraction for Apprenticeship Learning. We build on Rate-Distortion theory, the classic Blahut-Arimoto algorithm, and the Information Bottleneck method to develop an algorithm for computing state abstractions that approximate the optimal tradeoff between compression and performance. We illustrate the power of this algorithmic structure to offer insights into effective abstraction, compression, and reinforcement learning through a mixture of analysis, visuals, and experimentation.

IJCAI Conference 2019 Conference Paper

The Expected-Length Model of Options

David Abel
John Winder
Marie desJardins
Michael Littman

Effective options can make reinforcement learning easier by enhancing an agent's ability to both explore in a targeted manner and plan further into the future. However, learning an appropriate model of an option's dynamics in hard, requiring estimating a highly parameterized probability distribution. This paper introduces and motivates the Expected-Length Model (ELM) for options, an alternate model for transition dynamics. We prove ELM is a (biased) estimator of the traditional Multi-Time Model (MTM), but provide a non-vacuous bound on their deviation. We further prove that, in stochastic shortest path problems, ELM induces a value function that is sufficiently similar to the one induced by MTM, and is thus capable of supporting near-optimal behavior. We explore the practical utility of this option model experimentally, finding consistent support for the thesis that ELM is a suitable replacement for MTM. In some cases, we find ELM leads to more sample efficient learning, especially when options are arranged in a hierarchy.

RLDM Conference 2019 Conference Abstract

Value Preserving State-Action Abstractions

David Abel
Nate Umbanhowar
Dilip Arumugam
Doina Precup
Michael L. Littman

We here introduce combinations of state abstractions and options that preserve representation of near-optimal policies. We define φ-relative options, a general formalism for analyzing the value loss of options paired with a state abstraction, and prove that there exist classes of φ-relative options that preserve near-optimal behavior in any MDP.

AAAI Conference 2018 Conference Paper

Bandit-Based Solar Panel Control

David Abel
Edward Williams
Stephen Brawner
Emily Reif
Michael Littman

Solar panels sustainably harvest energy from the sun. To improve performance, panels are often equipped with a tracking mechanism that computes the sun’s position in the sky throughout the day. Based on the tracker’s estimate of the sun’s location, a controller orients the panel to minimize the angle of incidence between solar radiant energy and the photovoltaic cells on the surface of the panel, increasing total energy harvested. Prior work has developed efﬁcient tracking algorithms that accurately compute the sun’s location to facilitate solar tracking and control. However, always pointing a panel directly at the sun does not account for diffuse irradiance in the sky, reﬂected irradiance from the ground and surrounding surfaces, power required to reorient the panel, shading effects from neighboring panels and foliage, or changing weather conditions (such as clouds), all of which are contributing factors to the total energy harvested by a ﬂeet of solar panels. In this work, we show that a bandit-based approach can increase the total energy harvested by solar panels by learning to dynamically account for such other factors. Our contribution is threefold: (1) the development of a test bed based on typical solar and irradiance models for experimenting with solar panel control using a variety of learning methods, (2) simulated validation that bandit algorithms can effectively learn to control solar panels, and (3) the design and construction of an intelligent solar panel prototype that learns to angle itself using bandit algorithms.

ICML Conference 2018 Conference Paper

Policy and Value Transfer in Lifelong Reinforcement Learning

David Abel
Yuu Jinnai
Yue Guo 0003
George Konidaris 0001
Michael L. Littman

We consider the problem of how best to use prior experience to bootstrap lifelong learning, where an agent faces a series of task instances drawn from some task distribution. First, we identify the initial policy that optimizes expected performance over the distribution of tasks for increasingly complex classes of policy and task distributions. We empirically demonstrate the relative performance of each policy class’ optimal element in a variety of simple task distributions. We then consider value-function initialization methods that preserve PAC guarantees while simultaneously minimizing the learning required in two learning algorithms, yielding MaxQInit, a practical new method for value-function-based transfer. We show that MaxQInit performs well in simple lifelong RL experiments.

ICML Conference 2018 Conference Paper

State Abstractions for Lifelong Reinforcement Learning

David Abel
Dilip Arumugam
Lucas Lehnert
Michael L. Littman

In lifelong reinforcement learning, agents must effectively transfer knowledge across tasks while simultaneously addressing exploration, credit assignment, and generalization. State abstraction can help overcome these hurdles by compressing the representation used by an agent, thereby reducing the computational and statistical burdens of learning. To this end, we here develop theory to compute and use state abstractions in lifelong reinforcement learning. We introduce two new classes of abstractions: (1) transitive state abstractions, whose optimal form can be computed efficiently, and (2) PAC state abstractions, which are guaranteed to hold with respect to a distribution of tasks. We show that the joint family of transitive PAC abstractions can be acquired efficiently, preserve near optimal-behavior, and experimentally reduce sample complexity in simple domains, thereby yielding a family of desirable abstractions for use in lifelong reinforcement learning. Along with these positive results, we show that there are pathological cases where state abstractions can negatively impact performance.

RLDM Conference 2017 Conference Abstract

Improving Solar Panel Efficiency Using Reinforcement Learning

David Abel
Emily Reif
Michael Littman

Solar panels sustainably harvest energy from the sun. To improve performance, panels are often equipped with a tracking mechanism that computes the sun’s position in the sky throughout the day. Based on the tracker’s estimate of the sun’s location, a controller orients the panel to minimize the angle of inci- dence between solar radiant energy and the photovoltaic cells on the surface of the panel, increasing total energy harvested. Prior work has developed efficient tracking algorithms that accurately compute the sun’s location to facilitate solar tracking and control. However, always pointing a panel directly at the sun does not account for diffuse irradiance in the sky, reflected irradiance from the ground and surrounding surfaces, or changing weather conditions (such as cloud coverage), all of which are contributing factors to the total energy harvested by a solar panel. In this work, we show that a reinforcement learning (RL) approach can increase the total energy harvested by solar panels by learning to dynamically account for such other factors. We advocate for the use of RL for solar panel control due to its effectiveness, negligible cost, and versatility. Our contribution is twofold: (1) an adaption of typical RL algorithms to the task of improving solar panel performance, and (2) an experimental validation in simulation based on typical solar and irradiance models for experimenting with solar panel control. We evaluate the utility of various RL approaches compared to an idealized controller, an efficient state-of-the-art direct tracking algorithm, and a fixed panel in our simulated environment. We experiment across different time scales, in different places on earth, and with dramati- cally different percepts (sun coordinates and raw images of the sky with and without clouds), consistently demonstrating that simple RL algorithms improve over existing baselines.

ICML Conference 2016 Conference Paper

Near Optimal Behavior via Approximate State Abstraction

David Abel
D. Ellis Hershkowitz
Michael L. Littman

The combinatorial explosion that plagues planning and reinforcement learning (RL) algorithms can be moderated using state abstraction. Prohibitively large task representations can be condensed such that essential information is preserved, and consequently, solutions are tractably computable. However, exact abstractions, which treat only fully-identical situations as equivalent, fail to present opportunities for abstraction in environments where no two situations are exactly alike. In this work, we investigate approximate state abstractions, which treat nearly-identical situations as equivalent. We present theoretical guarantees of the quality of behaviors derived from four types of approximate abstractions. Additionally, we empirically demonstrate that approximate abstractions lead to reduction in task complexity and bounded loss of optimality of behavior in a variety of environments.

ICAPS Conference 2015 Conference Paper

Goal-Based Action Priors

David Abel
D. Ellis Hershkowitz
Gabriel Barth-Maron
Stephen Brawner
Kevin O'Farrell
James MacGlashan
Stefanie Tellex

Robots that interact with people must flexibly respond to requests by planning in stochastic state spaces that are often too large to solve for optimal behavior. In this work, we develop a framework for goal and state dependent action priors that can be used to prune away irrelevant actions based on the robot’s current goal, thereby greatly accelerating planning in a variety of complex stochastic environments. Our framework allows these goal-based action priors to be specified by an expert or to be learned from prior experience in related problems. We evaluate our approach in the video game Minecraft, whose complexity makes it an effective robot simulator. We also evaluate our approach in a robot cooking domain that is executed on a two-handed manipulator robot. In both cases, goal-based action priors enhance baseline planners by dramatically reducing the time taken to find a near-optimal plan.