Cameron Allen Papers

RLJ Journal 2025 Journal Article

Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains

Ruo Yu Tao
Kaicheng Guo
Cameron Allen
George Konidaris

Mitigating partial observability is a necessary but challenging task for general reinforcement learning algorithms. To improve an algorithm's ability to mitigate partial observability, researchers need comprehensive benchmarks to gauge progress. Most algorithms tackling partial observability are only evaluated on benchmarks with simple forms of state aliasing, such as feature masking and Gaussian noise. These existing benchmarks do not represent the many forms of partial observability seen in real domains, such as visual occlusion and unknown opponent intent. We argue that a partially observable benchmark should have two key properties. The first is coverage in its forms of partial observability, to ensure an algorithm's generalizability. The second is a large gap between the performance of a memoryless agent and an agent with more state information. This gap implies that an environment is memory improvable: where performance gains in a domain are from an algorithm's ability to learn memory for mitigating partial observability as opposed to other factors. We introduce best-practice experimental guidelines for benchmarking reinforcement learning under partial observability, as well as the open-source library POBAX: Partially Observable Benchmarks in JAX. We characterize the types of partial observability present in various environments and select representative environments for our benchmark. These environments include localization and mapping, visual control, games, and more. Additionally, these tasks are all memory improvable and require hard-to-learn memory functions, providing a concrete signal for partial observability research. This framework includes recommended hyperparameters for out-of-the-box evaluation, as well as highly performant environments implemented in JAX for GPU-scalable experimentation.

PDF Details

RLC Conference 2025 Conference Paper

Benchmarking Partial Observability in Reinforcement Learning with a Suite of Memory-Improvable Domains

Ruo Yu Tao
Kaicheng Guo
Cameron Allen
George Konidaris

Mitigating partial observability is a necessary but challenging task for general reinforcement learning algorithms. To improve an algorithm's ability to mitigate partial observability, researchers need comprehensive benchmarks to gauge progress. Most algorithms tackling partial observability are only evaluated on benchmarks with simple forms of state aliasing, such as feature masking and Gaussian noise. These existing benchmarks do not represent the many forms of partial observability seen in real domains, such as visual occlusion and unknown opponent intent. We argue that a partially observable benchmark should have two key properties. The first is coverage in its forms of partial observability, to ensure an algorithm's generalizability. The second is a large gap between the performance of a memoryless agent and an agent with more state information. This gap implies that an environment is memory improvable: where performance gains in a domain are from an algorithm's ability to learn memory for mitigating partial observability as opposed to other factors. We introduce best-practice experimental guidelines for benchmarking reinforcement learning under partial observability, as well as the open-source library POBAX: Partially Observable Benchmarks in JAX. We characterize the types of partial observability present in various environments and select representative environments for our benchmark. These environments include localization and mapping, visual control, games, and more. Additionally, these tasks are all memory improvable and require hard-to-learn memory functions, providing a concrete signal for partial observability research. This framework includes recommended hyperparameters for out-of-the-box evaluation, as well as highly performant environments implemented in JAX for GPU-scalable experimentation.

PDF Details

RLC Conference 2025 Conference Paper

Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Jonathan Colaço Carr
Qinyi Sun
Cameron Allen

Skills are essential for unlocking higher levels of problem solving. A common approach to discovering these skills is to learn ones that reliably reach different states, thus empowering the agent to control its environment. However, existing skill discovery algorithms often overlook the natural state variables present in many reinforcement learning problems, meaning that the discovered skills lack control of specific state variables. This can significantly hamper exploration efficiency, make skills more challenging to learn with, and lead to negative side effects in downstream tasks when the goal is under-specified. We introduce a general method that enables these skill discovery algorithms to learn focused skills---skills that target and control specific state variables. Our approach improves state space coverage by a factor of three, unlocks new learning capabilities, and automatically avoids negative side effects in downstream tasks.

PDF Details

RLJ Journal 2025 Journal Article

Focused Skill Discovery: Learning to Control Specific State Variables while Minimizing Side Effects

Jonathan Colaço Carr
Qinyi Sun
Cameron Allen

Skills are essential for unlocking higher levels of problem solving. A common approach to discovering these skills is to learn ones that reliably reach different states, thus empowering the agent to control its environment. However, existing skill discovery algorithms often overlook the natural state variables present in many reinforcement learning problems, meaning that the discovered skills lack control of specific state variables. This can significantly hamper exploration efficiency, make skills more challenging to learn with, and lead to negative side effects in downstream tasks when the goal is under-specified. We introduce a general method that enables these skill discovery algorithms to learn focused skills---skills that target and control specific state variables. Our approach improves state space coverage by a factor of three, unlocks new learning capabilities, and automatically avoids negative side effects in downstream tasks.

PDF Details

NeurIPS Conference 2025 Conference Paper

Skill-Driven Neurosymbolic State Abstractions

Alper Ahmetoglu
Steven James
Cameron Allen
Sam Lobel
David Abel
George Konidaris

We consider how to construct state abstractions compatible with a given set of abstract actions, to obtain a well-formed abstract Markov decision process (MDP). We show that the Bellman equation suggests that abstract states should represent distributions over states in the ground MDP; we characterize the conditions under which the resulting process is Markov and approximately model-preserving, derive algorithms for constructing and planning with the abstract MDP, and apply them to a visual maze task. We generalize these results to the factored actions case, characterizing the conditions that result in factored abstract states and apply the resulting algorithm to Montezuma's Revenge. These results provide a powerful and principled framework for constructing neurosymbolic abstract Markov decision processes.

PDF Details

NeurIPS Conference 2024 Conference Paper

Evidence of Learned Look-Ahead in a Chess-Playing Neural Network

Erik Jenner
Shreyas Kapur
Vasil Georgiev
Cameron Allen
Scott Emmons
Stuart Russell

Do neural networks learn to implement algorithms such as look-ahead or search "in the wild"? Or do they rely purely on collections of simple heuristics? We present evidence of learned look-ahead in the policy and value network of Leela Chess Zero, the currently strongest deep neural chess engine. We find that Leela internally represents future optimal moves and that these representations are crucial for its final output in certain board states. Concretely, we exploit the fact that Leela is a transformer that treats every chessboard square like a token in language models, and give three lines of evidence: (1) activations on certain squares of future moves are unusually important causally; (2) we find attention heads that move important information "forward and backward in time, " e. g. , from squares of future moves to squares of earlier ones; and (3) we train a simple probe that can predict the optimal move 2 turns ahead with 92% accuracy (in board states where Leela finds a single best line). These findings are clear evidence of learned look-ahead in neural networks and might be a step towards a better understanding of their capabilities.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Mitigating Partial Observability in Sequential Decision Processes via the Lambda Discrepancy

Cameron Allen
Aaron Kirtland
Ruo Yu Tao
Sam Lobel
Daniel Scott
Nicholas Petrocelli
Omer Gottesman
Ronald Parr

Reinforcement learning algorithms typically rely on the assumption that the environment dynamics and value function can be expressed in terms of a Markovian state representation. However, when state information is only partially observable, how can an agent learn such a state representation, and how can it detect when it has found one? We introduce a metric that can accomplish both objectives, without requiring access to---or knowledge of---an underlying, unobservable state space. Our metric, the λ-discrepancy, is the difference between two distinct temporal difference (TD) value estimates, each computed using TD(λ) with a different value of λ. Since TD(λ=0) makes an implicit Markov assumption and TD(λ=1) does not, a discrepancy between these estimates is a potential indicator of a non-Markovian state representation. Indeed, we prove that the λ-discrepancy is exactly zero for all Markov decision processes and almost always non-zero for a broad class of partially observable environments. We also demonstrate empirically that, once detected, minimizing the λ-discrepancy can help with learning a memory function to mitigate the corresponding partial observability. We then train a reinforcement learning agent that simultaneously constructs two recurrent value networks with different λ parameters and minimizes the difference between them as an auxiliary loss. The approach scales to challenging partially observable domains, where the resulting agent frequently performs significantly better (and never performs worse) than a baseline recurrent agent with only a single value network.

PDF Details DOI

PRL Workshop 2023 Workshop Paper

Task Scoping: Generating Task-Specific Simplifications of Open-Scope Planning Problems

Michael Fishman
Nishanth Kumar
Cameron Allen
Natasha Danas
Michael Littman
Stefanie Tellex
George Konidaris

A general-purpose agent must learn an open-scope world model: one rich enough to tackle any of the wide range of tasks it may be asked to solve over its operational lifetime. This stands in contrast with typical planning approaches, where the scope of a model is limited to a specific family of tasks that share significant structure. Unfortunately, planning to solve any specific task within an open-scope model is computationally intractable---even for state-of-the-art methods---due to the many states and actions that are necessarily present in the model but irrelevant to that problem. We propose task scoping: a method that exploits knowledge of the initial state, goal conditions, and transition system to automatically and efficiently remove provably irrelevant variables and actions from grounded planning problems. Our approach leverages causal link analysis and backwards reachability over state variables (rather than states) along with operator merging (when effects on relevant variables are identical). Using task scoping as a pre-planning step can shrink the search space by orders of magnitude and dramatically decrease planning time. We empirically demonstrate that these improvements occur across a variety of open-scope domains, including Minecraft, where our approach reduces search time by a factor of $75$ for a state-of-the-art numeric planner, even after including the time required for task scoping itself.

PDF Details

AAAI Conference 2022 Conference Paper

Optimistic Initialization for Exploration in Continuous Control

Sam Lobel
Omer Gottesman
Cameron Allen
Akhil Bagaria
George Konidaris

Optimistic initialization underpins many theoretically sound exploration schemes in tabular domains; however, in the deep function approximation setting, optimism can quickly disappear if initialized naı̈vely. We propose a framework for more effectively incorporating optimistic initialization into reinforcement learning for continuous control. Our approach uses metric information about the state-action space to estimate which transitions are still unexplored, and explicitly maintains the initial Q-value optimism for the corresponding state-action pairs. We also develop methods for efficiently approximating these training objectives, and for incorporating domain knowledge into the optimistic envelope to improve sample efficiency. We empirically evaluate these approaches on a variety of hard exploration problems in continuous control, where our method outperforms existing exploration techniques.

PDF Details

IJCAI Conference 2021 Conference Paper

Efficient Black-Box Planning Using Macro-Actions with Focused Effects

Cameron Allen
Michael Katz
Tim Klinger
George Konidaris
Matthew Riemer
Gerald Tesauro

The difficulty of deterministic planning increases exponentially with search-tree depth. Black-box planning presents an even greater challenge, since planners must operate without an explicit model of the domain. Heuristics can make search more efficient, but goal-aware heuristics for black-box planning usually rely on goal counting, which is often quite uninformative. In this work, we show how to overcome this limitation by discovering macro-actions that make the goal-count heuristic more accurate. Our approach searches for macro-actions with focused effects (i. e. macros that modify only a small number of state variables), which align well with the assumptions made by the goal-count heuristic. Focused macros dramatically improve black-box planning efficiency across a wide range of planning domains, sometimes beating even state-of-the-art planners with access to a full domain model.

PDF Details DOI

NeurIPS Conference 2021 Conference Paper

Learning Markov State Abstractions for Deep Reinforcement Learning

Cameron Allen
Neev Parikh
Omer Gottesman
George Konidaris

A fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features---often matching or exceeding the performance achieved with hand-designed compact state information.

PDF Details

Possible papers