Abhinav Bhatia Papers

RLC Conference 2025 Conference Paper

RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$

Abhinav Bhatia
Samer B. Nashed
Shlomo Zilberstein

Meta reinforcement learning (Meta-RL) methods such as RL$^2$ have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, they show poor asymptotic performance and struggle with out-of-distribution tasks because they rely on sequence models, such as recurrent neural networks or transformers, to process experiences rather than summarize them using general-purpose RL components such as value functions. In contrast, traditional RL algorithms are data-inefficient as they do not use domain knowledge, but do converge to an optimal policy in the limit. We propose RL$^3$, a principled hybrid approach that incorporates action-values, learned per task via traditional RL, in the inputs to Meta-RL. We show that RL$^3$ earns a greater cumulative reward in the long term compared to RL$^2$ while drastically reducing meta-training time and generalizes better to out-of-distribution tasks. Experiments are conducted on Meta-RL benchmarks and custom discrete domains that exhibit a range of short-term, long-term, and complex dependencies.

PDF Details

RLJ Journal 2025 Journal Article

RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$

Abhinav Bhatia
Samer B. Nashed
Shlomo Zilberstein

Meta reinforcement learning (Meta-RL) methods such as RL$^2$ have emerged as promising approaches for learning data-efficient RL algorithms tailored to a given task distribution. However, they show poor asymptotic performance and struggle with out-of-distribution tasks because they rely on sequence models, such as recurrent neural networks or transformers, to process experiences rather than summarize them using general-purpose RL components such as value functions. In contrast, traditional RL algorithms are data-inefficient as they do not use domain knowledge, but do converge to an optimal policy in the limit. We propose RL$^3$, a principled hybrid approach that incorporates action-values, learned per task via traditional RL, in the inputs to Meta-RL. We show that RL$^3$ earns a greater cumulative reward in the long term compared to RL$^2$ while drastically reducing meta-training time and generalizes better to out-of-distribution tasks. Experiments are conducted on Meta-RL benchmarks and custom discrete domains that exhibit a range of short-term, long-term, and complex dependencies.

PDF Details

IROS Conference 2022 Conference Paper

Selecting the Partial State Abstractions of MDPs: A Metareasoning Approach with Deep Reinforcement Learning

Samer B. Nashed
Justin Svegliato
Abhinav Bhatia
Stuart Russell 0001
Shlomo Zilberstein

Markov decision processes (MDPs) are a common general-purpose model used in robotics for representing sequential decision-making problems. Given the complexity of robotics applications, a popular approach for approximately solving MDPs relies on state aggregation to reduce the size of the state space but at the expense of policy fidelity-offering a trade-off between policy quality and computation time. Naturally, this poses a challenging metareasoning problem: how can an autonomous system dynamically select different state abstractions that optimize this trade-off as it operates online? In this paper, we formalize this metareasoning problem with a notion of time-dependent utility and solve it using deep reinforcement learning. To do this, we develop several general, cheap heuristics that summarize the reward structure and transition topology of the MDP at hand to serve as effective features. Empirically, we demonstrate that our metareasoning approach outperforms several baseline approaches and a strong heuristic approach on a standard benchmark domain.

Details

ICAPS Conference 2022 Conference Paper

Tuning the Hyperparameters of Anytime Planning: A Metareasoning Approach with Deep Reinforcement Learning

Abhinav Bhatia
Justin Svegliato
Samer B. Nashed
Shlomo Zilberstein

Anytime planning algorithms often have hyperparameters that can be tuned at runtime to optimize their performance. While work on metareasoning has focused on when to interrupt an anytime planner and act on the current plan, the scope of metareasoning can be expanded to tuning the hyperparameters of the anytime planner at runtime. This paper introduces a general, decision-theoretic metareasoning approach that optimizes both the stopping point and hyperparameters of anytime planning. We begin by proposing a generalization of the standard meta-level control problem for anytime algorithms. We then offer a meta-level control technique that monitors and controls an anytime algorithm using deep reinforcement learning. Finally, we show that our approach boosts performance on a common benchmark domain that uses anytime weighted A* to solve a range of heuristic search problems and a mobile robot application that uses RRT* to solve motion planning problems.

Details

SoCS Conference 2021 Conference Paper

On the Benefits of Randomly Adjusting Anytime Weighted A

Abhinav Bhatia
Justin Svegliato
Shlomo Zilberstein

Anytime Weighted A*---an anytime heuristic search algorithm that uses a weight to scale the heuristic value of each node in the open list---has proven to be an effective way to manage the trade-off between solution quality and computation time in heuristic search. Finding the best weight, however, is challenging because it depends on not only the characteristics of the domain and the details of the instance at hand, but also the available computation time. We propose a randomized version of this algorithm, called Randomized Weighted A*, that randomly adjusts its weight at runtime and show a counterintuitive phenomenon: RWA* generally performs as well or better than AWA* with the best static weight on a range of benchmark problems. The result is a simple algorithm that is easy to implement and performs consistently well without any offline experimentation or parameter tuning.

Details

ICAPS Conference 2019 Conference Paper

Resource Constrained Deep Reinforcement Learning

Abhinav Bhatia
Pradeep Varakantham
Akshat Kumar

In urban environments, resources have to be constantly matched to the “right” locations where customer demand is present. For instance, ambulances have to be matched to base stations regularly so as to reduce response time for emergency incidents in ERS (Emergency Response Systems); vehicles (cars, bikes among others) have to be matched to docking stations to reduce lost demand in shared mobility systems. Such problems are challenging owing to the demand uncertainty, combinatorial action spaces and constraints on allocation of resources (e. g. , total resources, minimum and maximum number of resources at locations and regions). Existing systems typically employ myopic and greedy optimization approaches to optimize resource allocation. Such approaches typically are unable to handle surges or variances in demand patterns well. Recent work has demonstrated the ability of Deep RL methods in adapting well to highly uncertain environments. However, existing Deep RL methods are unable to handle combinatorial action spaces and constraints on allocation of resources. To that end, we have developed three approaches on top of the well known actor-critic approach, DDPG (Deep Deterministic Policy Gradient) that are able to handle constraints on resource allocation. We also demonstrate that they are able to outperform leading approaches on simulators validated on semi-real and real data sets.

Details

Possible papers

RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$

RL$^3$: Boosting Meta Reinforcement Learning via RL inside RL$^2$

Selecting the Partial State Abstractions of MDPs: A Metareasoning Approach with Deep Reinforcement Learning

Tuning the Hyperparameters of Anytime Planning: A Metareasoning Approach with Deep Reinforcement Learning

On the Benefits of Randomly Adjusting Anytime Weighted A

Resource Constrained Deep Reinforcement Learning