Author name cluster

Prashant Doshi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

74 papers

2 author rows

AIJ Journal 2026 Journal Article

Decision-theoretic planning and cognitive modeling for active cyber deception

Aditya Shinde
Prashant Doshi

Cyber defense is evolving to include deception as a key strategy to thwart adversaries. Cyber deception elevates cyber defense by shifting the focus from intrusion detection and prevention to strategically influencing the attacker’s beliefs and perceptions. However, in its current form, deception is employed passively to mislead and misdirect adversaries using decoy systems called honeypots. We present a decision-theoretic approach to active intent recognition using honeypots. We model cyber deception as a sequential decision-making problem in a two-agent context situated on a single honeypot host. To explicitly reason about the influence of deception on the attacker’s beliefs, we introduce factored finitely-nested interactive POMDPs (I-POMDP X ), a factored variant of the I-POMDP framework. We utilize the I-POMDP X framework to model the problem with multiple candidate attacker types, each of which models a cyber attack across various stages from the attacker’s initial entry to reaching its adversarial objective. Recursive reasoning facilitated by I-POMDPs enables the defender to simulate interactions where the attacker is oblivious of a defender, and also scenarios where the attacker reasons about the defender’s actions. The defending I-POMDP X -based agent uses decoys to engage the attacker at multiple phases to form increasingly accurate predictions of the attacker’s behavior and intent. Subsequently, we leverage the explicit and subjective reasoning capability of the I-POMDP X to model cognitive biases known to play a role in deception. Specifically, we model the fundamental attribution error (FAE) and confirmation bias. We show that the cognitive modeling of these biases using the I-POMDP X framework plays a crucial role in deceiving sophisticated adversaries. We evaluate our framework in both simulations and with the I-POMDP X agent deployed on a honeypot host with instrumentation. Our experiments show that the I-POMDP X -based agent outperforms commonly used deception strategies in intent recognition on honeypots. We explore how the defender’s deception evolves as the attacker becomes more strategic. At higher levels of reasoning, we demonstrate how the defender can leverage the computational modeling of the attacker’s cognitive biases to facilitate deception against sophisticated adversaries. This emerging application of autonomous agents offers a new approach to cyber defense that contrasts with the traditional action-reaction dynamic that has defined interactions between cyber attackers and defenders for years.

Details DOI

ICRA Conference 2025 Conference Paper

A Novel Computational Framework of Robot Trust for Human-Robot Teams

Bhavana Nare
John Frericks
Anusha Challa
Prashant Doshi
Kyle Johnsen 0001

When humans collaborate, they form positive or negative experiences with each other. These experiences depend on various factors such as the individual's skills, abilities, and agency. In this paper, we consider human-robot collaborations and present a novel model of an autonomous robot's trust in humans based on the probability of the robot having a positive experience with the human. The model defines a dynamic trust-building process that translates into a computationallyaccessible implementation. We hypothesize predictors of a positive experience with human teammates and derive trust in individual humans. As the interactions continue, team members develop an affinity toward each other. The robot's affinity towards humans can be viewed as kinship, and we also investigate how kinship affects trust and distrust. We present an algorithm for how the robot may use kinship-mediated trust in its decision-making, and demonstrate its use in simulated missions truly requiring human-robot collaboration.

Details

AIJ Journal 2025 Journal Article

Active legibility in multiagent reinforcement learning

Yanyu Liu
Yinghui Pan
Yifeng Zeng
Biyang Ma
Prashant Doshi

A multiagent sequential decision problem has been seen in many critical applications including urban transportation, autonomous driving cars, military operations, etc. Its widely known solution, namely multiagent reinforcement learning, has evolved tremendously in recent years. Among them, the solution paradigm of modeling other agents attracts our interest, which is different from traditional value decomposition or communication mechanisms. It enables agents to understand and anticipate others' behaviors and facilitates their collaboration. Inspired by recent research on the legibility that allows agents to reveal their intentions through their behavior, we propose a multiagent active legibility framework to improve their performance. The legibility-oriented framework drives agents to conduct legible actions so as to help others optimise their behaviors. In addition, we design a series of problem domains that emulate a common legibility-needed scenario and effectively characterize the legibility in multiagent reinforcement learning. The experimental results demonstrate that the new framework is more efficient and requires less training time compared to several multiagent reinforcement learning algorithms. • We propose the multiagent active legibility framework to develop legible plans in MARL. • We propose the legibility reward shaping technique and prove its correctness. • We design multiple problem domains to showcase the plan recognition and legibility.

Details DOI

UAI Conference 2025 Conference Paper

Adaptive Human-Robot Collaboration using Type-Based IRL

Prasanth Sengadu Suresh
Prashant Doshi
Bikramjit Banerjee

Human-robot collaboration (HRC) integrates the consistency and precision of robotic systems with the dexterity and cognitive abilities of humans to create synergy. However, human performance may degrade due to various factors (e. g. , fatigue, trust) which can manifest unpredictably, and typically results in diminished output and reduced quality. To address this challenge toward successful HRCs, we present a human-aware approach to collaboration using a novel multi-agent decision-making framework. Type-based decentralized Markov decision processes (TB-DecMDP) additionally model latent, causal decision-making factors influencing agent behavior (e. g. , fatigue), leading to dynamic agent types. In this framework, agents can switch between types and each maintains a belief about others’ current type based on observed actions while aiming to achieve a shared objective. We introduce a new inverse reinforcement learning (IRL) algorithm, TB-DecAIRL, which uses TB-DecMDP to model complex HRCs. TB-DecAIRL learns a type-contingent reward function and corresponding vector of policies from team demonstrations. Our evaluations in a realistic HRC problem setting establish that modeling human types in TB-DecAIRL improves robot behavior on the default of ignoring human factors, by increasing throughput in a human-robot produce sorting task.

Details

IROS Conference 2025 Conference Paper

Analyzing Human Perceptions of a MEDEVAC Robot in a Simulated Evacuation Scenario

Tyson Jordan
Pranav Pandey
Prashant Doshi
Ramviyas Parasuraman
Adam Goodie

The use of autonomous systems in medical evacuation (MEDEVAC) scenarios is promising, but existing implementations overlook key insights from human-robot interaction (HRI) research. Studies on human-machine teams demonstrate that human perceptions of a machine teammate are critical in governing the machine’s performance. Consequently, it is essential to identify the factors that contribute to positive human perceptions in human-machine teams. Here, we present a mixed factorial design to assess human perceptions of a MEDEVAC robot in a simulated evacuation scenario. Participants were assigned to the role of casualty (CAS) or bystander (BYS) and subjected to three within-subjects conditions based on the MEDEVAC robot’s operating mode: autonomous-slow (AS), autonomous-fast (AF), and teleoperation (TO). During each trial, a MEDEVAC robot navigated an 11-meter path, acquiring a casualty and transporting them to an ambulance exchange point while avoiding an idle bystander. Following each trial, subjects completed a questionnaire measuring their emotional states, perceived safety, and social compatibility with the robot. Results indicate a consistent main effect of operating mode on reported emotional states and perceived safety. Pairwise analyses suggest that the employment of the AF operating mode negatively impacted perceptions along these dimensions. There were no persistent differences between CAS and BYS responses.

Details

ECAI Conference 2025 Conference Paper

Inferring Hidden Behavioral Signatures of Cyber Adversaries Using Inverse Reinforcement Learning

Aditya Shinde
Prashant Doshi

This paper presents an emerging approach to attacker preference modeling from system-level audit logs using inverse reinforcement learning (IRL). Adversary modeling is an important capability in cybersecurity that lets defenders characterize behaviors of potential attackers, which enables attribution to known cyber adversary groups. Existing approaches rely on documenting an ever-evolving set of attacker tools and techniques to track known threat actors. Although attacks evolve constantly, attacker behavioral preferences are intrinsic and less volatile. Our approach learns the behavioral preferences of cyber adversaries from forensics data on their tools and techniques. We model the attacker as an expert decision-making agent with unknown behavioral preferences situated in a computer host. We leverage attack provenance graphs of audit logs to derive a state-action trajectory of the attack. We test our approach on open datasets of audit logs containing real attack data. Our results demonstrate for the first time that low-level forensics data can automatically reveal an adversary’s subjective preferences, which serves as an additional dimension to modeling and documenting cyber adversaries. Attackers’ preferences tend to be less dynamic despite their different tools and indicate predispositions that are inherent to the attacker. As such, these inferred preferences can potentially serve as unique behavioral signatures of attackers and improve threat attribution.

Details

UAI Conference 2025 Conference Paper

MOHITO: Multi-Agent Reinforcement Learning using Hypergraphs for Task-Open Systems

Gayathri Anil
Prashant Doshi
Daniel Redder
Adam Eck
Leen-Kiat Soh

Open agent systems are prevalent in the real world, where the sets of agents and tasks change over time. In this paper, we focus on task-open multi-agent systems, exemplified by applications such as ridesharing, where passengers (tasks) appear spontaneously over time and disappear if not attended to promptly. Task-open settings challenge us with an action space which changes dynamically. This renders existing reinforcement learning (RL) methods–intended for fixed state and action spaces–inapplicable. Whereas multi-task learning approaches learn policies generalized to multiple known and related tasks, they struggle to adapt to previously unseen tasks. Conversely, lifelong learning adapts to new tasks over time, but generally assumes that tasks come sequentially from a static and known distribution rather than simultaneously and unpredictably. We introduce a novel category of RL for addressing task openness, modeled using a task-open Markov game. Our approach, MOHITO, is a multi-agent actor-critic schema which represents knowledge about the relationships between agents and changing tasks and actions as dynamically evolving 3-uniform hypergraphs. As popular multi-agent RL testbeds do not exhibit task openness, we evaluate MOHITO on two realistic and naturally task-open domains to establish its efficacy and provide a benchmark for future work in this setting.

Details

NeurIPS Conference 2024 Conference Paper

An Autoencoder-Like Nonnegative Matrix Co-Factorization for Improved Student Cognitive Modeling

Shenbao Yu
Yinghui Pan
Yifeng Zeng
Prashant Doshi
Guoquan Liu
Kim-Leng Poh
Mingwei Lin

Student cognitive modeling (SCM) is a fundamental task in intelligent education, with applications ranging from personalized learning to educational resource allocation. By exploiting students' response logs, SCM aims to predict their exercise performance as well as estimate knowledge proficiency in a subject. Data mining approaches such as matrix factorization can obtain high accuracy in predicting student performance on exercises, but the knowledge proficiency is unknown or poorly estimated. The situation is further exacerbated if only sparse interactions exist between exercises and students (or knowledge concepts). To solve this dilemma, we root monotonicity (a fundamental psychometric theory on educational assessments) in a co-factorization framework and present an autoencoder-like nonnegative matrix co-factorization (AE-NMCF), which improves the accuracy of estimating the student's knowledge proficiency via an encoder-decoder learning pipeline. The resulting estimation problem is nonconvex with nonnegative constraints. We introduce a projected gradient method based on block coordinate descent with Lipschitz constants and guarantee the method's theoretical convergence. Experiments on several real-world data sets demonstrate the efficacy of our approach in terms of both performance prediction accuracy and knowledge estimation ability, when compared with existing student cognitive models.

PDF Details DOI

JAAMAS Journal 2024 Journal Article

Modeling and reinforcement learning in partially observable many-agent systems

Keyang He
Prashant Doshi
Bikramjit Banerjee

Abstract There is a prevalence of multiagent reinforcement learning (MARL) methods that engage in centralized training. These methods rely on all the agents sharing various types of information, such as their actions or gradients, with a centralized trainer or each other during the learning. Subsequently, the methods produce agent policies whose prescriptions and performance are contingent on other agents engaging in behavior assumed by the centralized training. But, in many contexts, such as mixed or adversarial settings, this assumption may not be feasible. In this article, we present a new line of methods that relaxes this assumption and engages in decentralized training resulting in the agent’s individual policy. The interactive advantage actor-critic (IA2C) maintains and updates beliefs over other agents’ candidate behaviors based on (noisy) observations, thus enabling learning at the agent’s own level. We also address MARL’s prohibitive curse of dimensionality due to the presence of many agents in the system. Under assumptions of action anonymity and population homogeneity, often exhibited in practice, large numbers of other agents can be modeled aggregately by the count vectors of their actions instead of individual agent models. More importantly, we may model the distribution of these vectors and its update using the Dirichlet-multinomial model, which offers an elegant way to scale IA2C to many-agent systems. We evaluate the performance of the fully decentralized IA2C along with other known baselines on a novel Organization domain, which we introduce, and on instances of two existing domains. Experimental comparisons with prominent and recent baselines show that IA2C is more sample efficient, more robust to noise, and can scale to learning in systems with up to a hundred agents.

Details DOI

AAMAS Conference 2024 Conference Paper

Modeling Cognitive Biases in Decision-theoretic Planning for Active Cyber Deception

Aditya Shinde
Prashant Doshi

This paper presents an approach to modeling and exploiting cognitive biases of cyber attackers in planning for active deception. Sophisticated cyber attacks are primarily orchestrated by human actors. Hence, we focus on the human aspect of the attacker’s decision-making process. Humans deviate from rational decisionmaking due to various cognitive biases. Here, we focus on fundamental attribution error (FAE) and confirmation bias and their role in cyber deception because these biases contribute to humans being deceived. We use the decision-theoretic planning framework of finitely-nested factored I-POMDP (I-POMDPX), which allows us to explicitly model FAE in multi-agent settings and build cognitive models of the attackers. We show how these biases impact their beliefs as they act and obtain more information about the environment and the adversary. The tractability of the I-POMDPX also allows for modeling agents at a higher strategy level where the optimal policy relies on induction and exploitation of these biases. Hence, we also present an I-POMDPX-based rational defender agent that can model the attacker’s beliefs under the influence of FAE and confirmation bias from a higher strategic level, and exploit them. Our experiments in simulated interactions show that the I-POMDPX-based defender agent can induce FAE in an attacker to distort the attacker’s beliefs. Consequently, the defender agent can exploit the attacker’s cognitive biases to extend the duration of the attack to facilitate the attacker’s intent recognition in a controlled environment. Our work provides a general decision-theoretic formulation of FAE and confirmation bias, and demonstrates its role in planning for agent-based active cyber deception.

PDF

IROS Conference 2024 Conference Paper

Open Human-Robot Collaboration using Decentralized Inverse Reinforcement Learning

Prasanth Sengadu Suresh
Siddarth Jain
Prashant Doshi
Diego Romeres

The growing interest in human-robot collaboration (HRC), where humans and robots cooperate towards shared goals, has seen significant advancements over the past decade. While previous research has addressed various challenges, several key issues remain unresolved. Many domains within HRC involve activities that do not necessarily require human presence throughout the entire task. Existing literature typically models HRC as a closed system, where all agents are present for the entire duration of the task. In contrast, an open model offers flexibility by allowing an agent to enter and exit the collaboration as needed, enabling them to concurrently manage other tasks. In this paper, we introduce a novel multiagent framework called oDec-MDP, designed specifically to model open HRC scenarios where agents can join or leave tasks flexibly during execution. We generalize a recent multiagent inverse reinforcement learning method - Dec-AIRL to learn from open systems modeled using the oDec-MDP. Our method is validated through experiments conducted in both a simplified toy firefighting domain and a realistic dyadic human-robot collaborative assembly. Results show that our framework and learning method improves upon its closed system counterpart.

Details

AAMAS Conference 2023 Conference Paper

Dec-AIRL: Decentralized Adversarial IRL for Human-Robot Teaming

Prasanth Sengadu Suresh
Yikang Gui
Prashant Doshi

We present a new method for inverse reinforcement learning (IRL) that allows an agent to learn from expert demonstrations and then spontaneously collaborate with a human on the same task. We generalize adversarial IRL (AIRL) to work in a decentralized setting using a decentralized Markov decision process (Dec-MDP) as the underlying model. We posit that a Dec-MDP is a better-suited model for pragmatic multi-agent IRL compared to the multi-agent Markov decision process (MMDP) or the Markov game, which have been utilized thus far. This is because the latter models require an agent to know the global state of the environment, which is impractical in the real world as it may include agent-specific attributes (e. g. joint angles) that may not be directly observable by the other agents. We test our method on two domains: a formative simulated patient assistance scenario and a summative real-world use-inspired domain of sorting onions on a line conveyor. Our method (Dec-AIRL) significantly improves on the previous techniques in both domains. These results indicate that a decentralized multi-agent IRL formalism promotes effective teaming in human-robot collaborative tasks.

PDF

AAMAS Conference 2022 Conference Paper

A Hierarchical Bayesian Process for Inverse RL in Partially-Controlled Environments

Kenneth Bogert
Prashant Doshi

Robots learning from observations in the real world may encounter objects or agents in the environment, other than the expert giving the demonstration, that cause nuisance observations. These confounding elements are typically removed in fully-controlled environments such as virtual simulations or lab settings. When complete removal is impossible the nuisance observations must be filtered out. However, identifying the sources of observations when large amounts of observations are made is difficult. To address this, we present a hierarchical Bayesian process that models both the expert’s and the confounding elements’ observations thereby explicitly modeling the diverse observations a robot may receive. We extend an existing inverse reinforcement learning algorithm originally designed to work under partial occlusion of the expert to consider the diverse and noisy observations. In a simulated robotic produce-sorting domain containing both occlusion and confounding elements, we demonstrate the model’s effectiveness. In particular, our technique outperforms several other comparative methods, second only to having perfect knowledge of the subject’s trajectory.

PDF

UAI Conference 2022 Conference Paper

Decision-theoretic planning with communication in open multiagent systems

Anirudh Kakarlapudi
Gayathri Anil
Adam Eck
Prashant Doshi
Leen-Kiat Soh

In open multiagent systems, the set of agents operating in the environment changes over time and in ways that are nontrivial to predict. For example, if collaborative robots were tasked with fighting wildfires, they may run out of suppressants and be temporarily unavailable to assist their peers. Because an agent’s optimal action depends on the actions of others, each agent must not only predict the actions of its peers, but, before that, reason whether they are even present to perform an action. Addressing openness thus requires agents to model each other’s presence, which can be enhanced through agents communicating about their presence in the environment. At the same time, communicative acts can also incur costs (e. g. , consuming limited bandwidth), and thus an agent must tradeoff the benefits of enhanced coordination with the costs of communication. We present a new principled, decision-theoretic method in the context provided by the recent communicative interactive POMDP framework for planning in open agent settings that balances this tradeoff. Simulations of multiagent wildfire suppression problems demonstrate how communication can improve planning in open agent environments, as well as how agents tradeoff the benefits and costs of communication under different scenarios.

Details

UAI Conference 2022 Conference Paper

Marginal MAP estimation for inverse RL under occlusion with observer noise

Prasanth Sengadu Suresh
Prashant Doshi

We consider the problem of learning the behavioral preferences of an expert engaged in a task from noisy and partially-observable demonstrations. This is motivated by real-world applications such as a line robot learning from observing a human worker, where some observations are occluded by environmental elements. Furthermore, robotic perception tends to be imperfect and noisy. Previous techniques for inverse reinforcement learning (IRL) take the approach of either omitting the missing portions or inferring it as part of expectation-maximization, which tends to be slow and prone to local optima. We present a new method that generalizes the well-known Bayesian maximum-a-posteriori (MAP) IRL method by marginalizing the occluded portions of the trajectory. This is then extended with an observation model to account for perception noise. This novel application of marginal MAP (MMAP) to IRL significantly improves on the previous IRL technique under occlusion in both formative evaluations on a toy problem and in a summative evaluation on a produce sorting line task by a physical robot.

Details

UAI Conference 2022 Conference Paper

Reinforcement learning in many-agent settings under partial observability

Keyang He
Prashant Doshi
Bikramjit Banerjee

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep RL, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity to a recently presented actor-critic based MARL algorithm, interactive A2C. We introduce a Dirichlet-multinomial model for maintaining beliefs over the agent population when agents’ actions are not perfectly observable. We show that the posterior is a mixture of Dirichlet distributions that we approximate as a single component for tractability. We also show that the prediction accuracy of this method increases with more agents. Finally we show empirically that our method can learn optimal behaviors in two recently introduced pragmatic domains with large agent population, and demonstrates robustness in partially observable environments.

Details

AIJ Journal 2021 Journal Article

A survey of inverse reinforcement learning: Challenges, methods and progress

Saurabh Arora
Prashant Doshi

Inverse reinforcement learning (IRL) is the problem of inferring the reward function of an agent, given its policy or observed behavior. Analogous to RL, IRL is perceived both as a problem and as a class of methods. By categorically surveying the extant literature in IRL, this article serves as a comprehensive reference for researchers and practitioners of machine learning as well as those new to it to understand the challenges of IRL and select the approaches best suited for the problem on hand. The survey formally introduces the IRL problem along with its central challenges such as the difficulty in performing accurate inference and its generalizability, its sensitivity to prior knowledge, and the disproportionate growth in solution complexity with problem size. The article surveys a vast collection of foundational methods grouped together by the commonality of their objectives, and elaborates how these methods mitigate the challenges. We further discuss extensions to the traditional IRL methods for handling imperfect perception, an incomplete model, learning multiple reward functions and nonlinear reward functions. The article concludes the survey with a discussion of some broad advances in the research area and currently open research questions.

Details DOI

AAMAS Conference 2021 Conference Paper

Cooperative-Competitive Reinforcement Learning with History-Dependent Rewards

Keyang He
Bikramjit Banerjee
Prashant Doshi

Consider a typical organization whose worker agents seek to collectively cooperate for its general betterment. However, each individual agent simultaneously seeks to act to secure a larger chunk than its co-workers of the annual increment in compensation, which usually comes from a fixed pot. As such, the agents in an organization must cooperate and compete. Another feature of many organizations is that a worker receives a bonus, which is often a fraction of previous year’s total profit. As such, the agent derives a reward that is also partly dependent on historical performance. How should the individual agent decide to act in this context? Few methods for the mixed cooperative-competitive setting have been presented in recent years, but these are challenged by problem domains whose reward functions additionally depend on historical information. Recent deep multi-agent reinforcement learning (MARL) methods using long short-term memory (LSTM) may be used, but these adopt a joint perspective to the interaction or require explicit exchange of information among the agents to promote cooperation, which may not be possible under competition. In this paper, we first show that the agent’s decision-making problem can be modeled as an interactive partially observable Markov decision process (I-POMDP) that captures the dynamic of a history-dependent reward. We present an interactive advantage actor-critic method (IA2C+), which combines the independent advantage actor-critic network with a belief filter that maintains a belief distribution over other agents’ models. Empirical results show that IA2C+ learns the optimal policy faster and more robustly than several baselines.

PDF

AAMAS Conference 2021 Conference Paper

Cyber Attack Intent Recognition and Active Deception using Factored Interactive POMDPs

Aditya Shinde
Prashant Doshi
Omid Setayeshfar

This paper presents an intelligent and adaptive agent that employs deception to recognize a cyber adversary’s intent on a honeypot host. Unlike previous approaches to cyber deception, which mainly focus on delaying or confusing the attackers, we focus on engaging with them to learn their intent. We model cyber deception as a sequential decision-making problem in a two-agent context. We introduce factored finitely-nested interactive POMDPs (I-POMDPX) and use this framework to model the problem with multiple attacker types. Our approach models cyber attacks on a single honeypot host across multiple phases from the attacker’s initial entry to reaching its adversarial objective. The defending I-POMDPX-based agent uses decoys to engage with the attacker at multiple phases to form increasingly accurate predictions of the attacker’s behavior and intent. The use of I-POMDPs also enables us to model the adversary’s mental state and investigate how deception affects their beliefs. Our experiments in both simulation and with the agent deployed on a host system show that the I-POMDPX-based agent performs significantly better at intent recognition than commonly used deception strategies on honeypots. This emerging application of autonomous agents offers a new approach that contrasts with the traditional action-reaction dynamic that has defined interactions between cyber attackers and defenders for years.

PDF

ICAPS Conference 2021 Conference Paper

Data-Driven Decision-Theoretic Planning using Recurrent Sum-Product-Max Networks

Hari Teja Tatavarti
Prashant Doshi
Layton Hayes

Sum-product networks (SPN) are knowledge compilation models and are related to other graphical models for efficient probabilistic inference such as arithmetic circuits and AND/OR graphs. Recent investigations into generalizing SPNs have yielded sum-product-max networks (SPMN) which offer a data-driven alternative for decision making that has predominantly relied on handcrafted models. However, SPMNs are not suited for decision-theoretic planning which involves sequential decision making over multiple time steps. In this paper, we present recurrent SPMNs (RSPMN) that learn from and model decision-making data over time. RSPMNs utilize a template network that is unfolded as needed depending on the length of the data sequence. This is significant as RSPMNs not only inherit the benefits of SPNs in being data driven and mostly tractable, they are also well suited for planning problems. We establish soundness conditions on the template network, which guarantee that the resulting SPMN is valid, and present a structure learning algorithm to learn a sound template. RSPMNs learned on a testbed of data sets, some generated using RDDLSim, yield MEUs and policies that are close to the optimal on perfectly-observed domains and easily improve on a recent batch-constrained RL method, which is important because RSPMNs offer a new model-based approach to offline RL.

Details

ICRA Conference 2021 Conference Paper

Min-Max Entropy Inverse RL of Multiple Tasks

Saurabh Arora
Prashant Doshi
Bikramjit Banerjee

Multi-task IRL recognizes that expert(s) could be switching between multiple ways of solving the same problem, or interleaving demonstrations of multiple tasks. The learner aims to learn the reward functions that individually guide these distinct ways. We present a new method for multi-task IRL that generalizes the well-known maximum entropy approach by combining it with a Dirichlet process based minimum entropy clustering of the observed data. This yields a single nonlinear optimization problem, called MinMaxEnt Multi-task IRL (MME-MTIRL), which can be solved using the Lagrangian relaxation and gradient descent methods. We evaluate MME-MTIRL on the robotic task of sorting onions on a processing line where the expert utilizes multiple ways of detecting and removing blemished onions. The method is able to learn the underlying reward functions to a high level of accuracy and it improves on the previous approaches.

Details

IJCAI Conference 2021 Conference Paper

State-Based Recurrent SPMNs for Decision-Theoretic Planning under Partial Observability

Layton Hayes
Prashant Doshi
Swaraj Pawar
Hari Teja Tatavarti

The sum-product network (SPN) has been extended to model sequence data with the recurrent SPN (RSPN), and to decision-making problems with sum-product-max networks (SPMN). In this paper, we build on the concepts introduced by these extensions and present state-based recurrent SPMNs (S-RSPMNs) as a generalization of SPMNs to sequential decision-making problems where the state may not be perfectly observed. As with recurrent SPNs, S-RSPMNs utilize a repeatable template network to model sequences of arbitrary lengths. We present an algorithm for learning compact template structures by identifying unique belief states and the transitions between them through a state matching process that utilizes augmented data. In our knowledge, this is the first data-driven approach that learns graphical models for planning under partial observability, which can be solved efficiently. S-RSPMNs retain the linear solution complexity of SPMNs, and we demonstrate significant improvements in compactness of representation and the run time of structure learning and inference in sequential domains.

PDF Details DOI

JAAMAS Journal 2020 Journal Article

I2RL: online inverse reinforcement learning under occlusion

Saurabh Arora
Prashant Doshi
Bikramjit Banerjee

Abstract Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from observing its behavior on a task. It inverts RL which focuses on learning an agent’s behavior on a task based on the reward signals received. IRL is witnessing sustained attention due to promising applications in robotics, computer games, and finance, as well as in other sectors. Methods for IRL have, for the most part, focused on batch settings where the observed agent’s behavioral data has already been collected. However, the related problem of online IRL—where observations are incrementally accrued, yet the real-time demands of the application often prohibit a full rerun of an IRL method—has received significantly less attention. We introduce the first formal framework for online IRL, called incremental IRL (I2RL), which can serve as a common ground for online IRL methods. We demonstrate the usefulness of this framework by casting existing online IRL techniques into this framework. Importantly, we present a new method that advances maximum entropy IRL with hidden variables to the online setting. Our analysis shows that the new method has monotonically improving performance with more demonstration data as well as probabilistically bounded error, both under full and partial observability. Simulated and physical robot experiments in a multi-robot patrolling application situated in varied-sized worlds, which involves learning under high levels of occlusion, show a significantly improved performance of I2RL as compared to both batch IRL and an online imitation learning method.

Details DOI

AIJ Journal 2020 Journal Article

Recursively modeling other agents for decision making: A research perspective

Prashant Doshi
Piotr Gmytrasiewicz
Edmund Durfee

Individuals exhibit theory of mind, attributing beliefs, intent, and mental states to others as explanations of observed actions. Dennett's intentional stance offers an analogous abstraction for computational agents seeking to understand, explain, or predict others' behaviors. These recognized theories provide a formal basis to ongoing investigations of recursive modeling. We review and situate various frameworks for recursive modeling that have been studied in game- and decision- theories, and have yielded methods useful to AI researchers. Sustained attention given to these frameworks has produced new analyses and methods with an aim toward making recursive modeling practicable. Indeed, we also review some emerging uses and the insights these yielded, which are indicative of pragmatic progress in this area. The significance of these frameworks is that higher-order reasoning is critical to correctly recognizing others' intent or outthinking opponents. Such reasoning has been utilized in academic, business, military, security, and other contexts both to train and inform decision-making agents in organizational and strategic contexts, and also to more realistically predict and best respond to other agents' intent.

Details DOI

ICRA Conference 2020 Conference Paper

SA-Net: Robust State-Action Recognition for Learning from Observations

Nihal Soans
Ehsan Asali
Yi Hong
Prashant Doshi

Learning from observation (LfO) offers a new paradigm for transferring task behavior to robots. LfO requires the robot to observe the task being performed and decompose the sensed streaming data into sequences of state-action pairs, which are then input to LfO methods. Thus, recognizing the state-action pairs correctly and quickly in sensed data is a crucial prerequisite. We present SA-Net a deep neural network architecture that recognizes state-action pairs from RGB-D data streams. SA-Net performs well in two replicated robotic applications of LfO - one involving mobile ground robots and another involving a robotic manipulator - which demonstrates that the architecture could generalize well to differing contexts. Comprehensive evaluations including deployment on a physical robot show that SA-Net significantly improves on the accuracy of the previous methods under various conditions.

Details

AAAI Conference 2020 Conference Paper

Scalable Decision-Theoretic Planning in Open and Typed Multiagent Systems

Adam Eck
Maulik Shah
Prashant Doshi
Leen-Kiat Soh

In open agent systems, the set of agents that are cooperating or competing changes over time and in ways that are nontrivial to predict. For example, if collaborative robots were tasked with ﬁghting wildﬁres, they may run out of suppressants and be temporarily unavailable to assist their peers. We consider the problem of planning in these contexts with the additional challenges that the agents are unable to communicate with each other and that there are many of them. Because an agent’s optimal action depends on the actions of others, each agent must not only predict the actions of its peers, but, before that, reason whether they are even present to perform an action. Addressing openness thus requires agents to model each other’s presence, which becomes computationally intractable with high numbers of agents. We present a novel, principled, and scalable method in this context that enables an agent to reason about others’ presence in its shared environment and their actions. Our method extrapolates models of a few peers to the overall behavior of the many-agent system, and combines it with a generalization of Monte Carlo tree search to perform individual agent reasoning in manyagent open environments. Theoretical analyses establish the number of agents to model in order to achieve acceptable worst case bounds on extrapolation error, as well as regret bounds on the agent’s utility from modeling only some neighbors. Simulations of multiagent wildﬁre suppression problems demonstrate our approach’s efﬁcacy compared with alternative baselines.