Arrow Research search

Author name cluster

Bikramjit Banerjee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

37 papers
2 author rows

Possible papers

37

UAI Conference 2025 Conference Paper

Adaptive Human-Robot Collaboration using Type-Based IRL

  • Prasanth Sengadu Suresh
  • Prashant Doshi
  • Bikramjit Banerjee

Human-robot collaboration (HRC) integrates the consistency and precision of robotic systems with the dexterity and cognitive abilities of humans to create synergy. However, human performance may degrade due to various factors (e. g. , fatigue, trust) which can manifest unpredictably, and typically results in diminished output and reduced quality. To address this challenge toward successful HRCs, we present a human-aware approach to collaboration using a novel multi-agent decision-making framework. Type-based decentralized Markov decision processes (TB-DecMDP) additionally model latent, causal decision-making factors influencing agent behavior (e. g. , fatigue), leading to dynamic agent types. In this framework, agents can switch between types and each maintains a belief about others’ current type based on observed actions while aiming to achieve a shared objective. We introduce a new inverse reinforcement learning (IRL) algorithm, TB-DecAIRL, which uses TB-DecMDP to model complex HRCs. TB-DecAIRL learns a type-contingent reward function and corresponding vector of policies from team demonstrations. Our evaluations in a realistic HRC problem setting establish that modeling human types in TB-DecAIRL improves robot behavior on the default of ignoring human factors, by increasing throughput in a human-robot produce sorting task.

TMLR Journal 2025 Journal Article

Quasipseudometric Value Functions with Dense Rewards

  • Khadichabonu Valieva
  • Bikramjit Banerjee

As a generalization of reinforcement learning (RL) to parametrizable goals, goal conditioned RL (GCRL) has a broad range of applications, particularly in challenging tasks in robotics. Recent work has established that the optimal value function of GCRL $Q^\ast(s, a, g)$ has a quasipseudometric structure, leading to targetted neural architectures that respect such structure. However, the relevant analyses assume a sparse reward setting—a known aggravating factor to sample complexity. We show that the key property underpinning a quasipseudometric, viz., the triangle inequality, is preserved under a dense reward setting as well, specifically identifying the key condition necessary for triangle inequality. Contrary to earlier findings where dense rewards were shown to be detrimental to GCRL, we conjecture that dense reward functions that satisfy this condition can only improve, never worsen, sample complexity. We evaluate this proposal in 12 standard benchmark environments in GCRL featuring challenging continuous control tasks. Our empirical results confirm that training a quasipseudometric value function in our dense reward setting indeed either improves upon, or preserves, the sample complexity of training with sparse rewards. This opens up opportunities to train efficient neural architectures with dense rewards, compounding their benefits to sample complexity.

JAAMAS Journal 2024 Journal Article

Modeling and reinforcement learning in partially observable many-agent systems

  • Keyang He
  • Prashant Doshi
  • Bikramjit Banerjee

Abstract There is a prevalence of multiagent reinforcement learning (MARL) methods that engage in centralized training. These methods rely on all the agents sharing various types of information, such as their actions or gradients, with a centralized trainer or each other during the learning. Subsequently, the methods produce agent policies whose prescriptions and performance are contingent on other agents engaging in behavior assumed by the centralized training. But, in many contexts, such as mixed or adversarial settings, this assumption may not be feasible. In this article, we present a new line of methods that relaxes this assumption and engages in decentralized training resulting in the agent’s individual policy. The interactive advantage actor-critic (IA2C) maintains and updates beliefs over other agents’ candidate behaviors based on (noisy) observations, thus enabling learning at the agent’s own level. We also address MARL’s prohibitive curse of dimensionality due to the presence of many agents in the system. Under assumptions of action anonymity and population homogeneity, often exhibited in practice, large numbers of other agents can be modeled aggregately by the count vectors of their actions instead of individual agent models. More importantly, we may model the distribution of these vectors and its update using the Dirichlet-multinomial model, which offers an elegant way to scale IA2C to many-agent systems. We evaluate the performance of the fully decentralized IA2C along with other known baselines on a novel Organization domain, which we introduce, and on instances of two existing domains. Experimental comparisons with prominent and recent baselines show that IA2C is more sample efficient, more robust to noise, and can scale to learning in systems with up to a hundred agents.

KER Journal 2024 Journal Article

Reinforcement actor-critic learning as a rehearsal in MicroRTS

  • Shiron Manandhar
  • Bikramjit Banerjee

Abstract Real-time strategy (RTS) games have provided a fertile ground for AI research with notable recent successes based on deep reinforcement learning (RL). However, RL remains a data-hungry approach featuring a high sample complexity. In this paper, we focus on a sample complexity reduction technique called reinforcement learning as a rehearsal (RLaR) and on the RTS game of MicroRTS to formulate and evaluate it. RLaR has been formulated in the context of action-value function based RL before. Here, we formulate it for a different RL framework, called actor-critic RL. We show that on the one hand the actor-critic framework allows RLaR to be much simpler, but on the other hand, it leaves room for a key component of RLaR–a prediction function that relates a learner’s observations with that of its opponent. This function, when leveraged for exploration, accelerates RL as our experiments in MicroRTS show. Further experiments provide evidence that RLaR may reduce actor noise compared to a variant that does not utilize RLaR’s exploration. This study provides the first evaluation of RLaR’s efficacy in a domain with a large strategy space.

UAI Conference 2022 Conference Paper

Reinforcement learning in many-agent settings under partial observability

  • Keyang He
  • Prashant Doshi
  • Bikramjit Banerjee

Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep RL, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity to a recently presented actor-critic based MARL algorithm, interactive A2C. We introduce a Dirichlet-multinomial model for maintaining beliefs over the agent population when agents’ actions are not perfectly observable. We show that the posterior is a mixture of Dirichlet distributions that we approximate as a single component for tractability. We also show that the prediction accuracy of this method increases with more agents. Finally we show empirically that our method can learn optimal behaviors in two recently introduced pragmatic domains with large agent population, and demonstrates robustness in partially observable environments.

AAMAS Conference 2021 Conference Paper

Cooperative-Competitive Reinforcement Learning with History-Dependent Rewards

  • Keyang He
  • Bikramjit Banerjee
  • Prashant Doshi

Consider a typical organization whose worker agents seek to collectively cooperate for its general betterment. However, each individual agent simultaneously seeks to act to secure a larger chunk than its co-workers of the annual increment in compensation, which usually comes from a fixed pot. As such, the agents in an organization must cooperate and compete. Another feature of many organizations is that a worker receives a bonus, which is often a fraction of previous year’s total profit. As such, the agent derives a reward that is also partly dependent on historical performance. How should the individual agent decide to act in this context? Few methods for the mixed cooperative-competitive setting have been presented in recent years, but these are challenged by problem domains whose reward functions additionally depend on historical information. Recent deep multi-agent reinforcement learning (MARL) methods using long short-term memory (LSTM) may be used, but these adopt a joint perspective to the interaction or require explicit exchange of information among the agents to promote cooperation, which may not be possible under competition. In this paper, we first show that the agent’s decision-making problem can be modeled as an interactive partially observable Markov decision process (I-POMDP) that captures the dynamic of a history-dependent reward. We present an interactive advantage actor-critic method (IA2C+), which combines the independent advantage actor-critic network with a belief filter that maintains a belief distribution over other agents’ models. Empirical results show that IA2C+ learns the optimal policy faster and more robustly than several baselines.

ICRA Conference 2021 Conference Paper

Min-Max Entropy Inverse RL of Multiple Tasks

  • Saurabh Arora
  • Prashant Doshi
  • Bikramjit Banerjee

Multi-task IRL recognizes that expert(s) could be switching between multiple ways of solving the same problem, or interleaving demonstrations of multiple tasks. The learner aims to learn the reward functions that individually guide these distinct ways. We present a new method for multi-task IRL that generalizes the well-known maximum entropy approach by combining it with a Dirichlet process based minimum entropy clustering of the observed data. This yields a single nonlinear optimization problem, called MinMaxEnt Multi-task IRL (MME-MTIRL), which can be solved using the Lagrangian relaxation and gradient descent methods. We evaluate MME-MTIRL on the robotic task of sorting onions on a processing line where the expert utilizes multiple ways of detecting and removing blemished onions. The method is able to learn the underlying reward functions to a high level of accuracy and it improves on the previous approaches.

KER Journal 2020 Journal Article

Human–agent transfer from observations

  • Bikramjit Banerjee
  • Sneha Racharla

Abstract Learning from human demonstration (LfD), among many speedup techniques for reinforcement learning (RL), has seen many successful applications. We consider one LfD technique called human–agent transfer (HAT), where a model of the human demonstrator’s decision function is induced via supervised learning and used as an initial bias for RL. Some recent work in LfD has investigated learning from observations only, that is, when only the demonstrator’s states (and not its actions) are available to the learner. Since the demonstrator’s actions are treated as labels for HAT, supervised learning becomes untenable in their absence. We adapt the idea of learning an inverse dynamics model from the data acquired by the learner’s interactions with the environment and deploy it to fill in the missing actions of the demonstrator. The resulting version of HAT—called state-only HAT (SoHAT) —is experimentally shown to preserve some advantages of HAT in benchmark domains with both discrete and continuous actions. This paper also establishes principled modifications of an existing baseline algorithm—called A3C—to create its HAT and SoHAT variants that are used in our experiments.

JAAMAS Journal 2020 Journal Article

I2RL: online inverse reinforcement learning under occlusion

  • Saurabh Arora
  • Prashant Doshi
  • Bikramjit Banerjee

Abstract Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from observing its behavior on a task. It inverts RL which focuses on learning an agent’s behavior on a task based on the reward signals received. IRL is witnessing sustained attention due to promising applications in robotics, computer games, and finance, as well as in other sectors. Methods for IRL have, for the most part, focused on batch settings where the observed agent’s behavioral data has already been collected. However, the related problem of online IRL—where observations are incrementally accrued, yet the real-time demands of the application often prohibit a full rerun of an IRL method—has received significantly less attention. We introduce the first formal framework for online IRL, called incremental IRL (I2RL), which can serve as a common ground for online IRL methods. We demonstrate the usefulness of this framework by casting existing online IRL techniques into this framework. Importantly, we present a new method that advances maximum entropy IRL with hidden variables to the online setting. Our analysis shows that the new method has monotonically improving performance with more demonstration data as well as probabilistically bounded error, both under full and partial observability. Simulated and physical robot experiments in a multi-robot patrolling application situated in varied-sized worlds, which involves learning under high levels of occlusion, show a significantly improved performance of I2RL as compared to both batch IRL and an online imitation learning method.

AAAI Conference 2019 Conference Paper

Model-Free IRL Using Maximum Likelihood Estimation

  • Vinamra Jain
  • Prashant Doshi
  • Bikramjit Banerjee

We propose a probabilistic model for estimating population flow, which is defined as populations of the transition between areas over time, given aggregated spatio-temporal population data. Since there is no information about individual trajectories in the aggregated data, it is not straightforward to estimate population flow. With the proposed method, we utilize a collective graphical model with which we can learn individual transition models from the aggregated data by analytically marginalizing the individual locations. Learning a spatio-temporal collective graphical model only from the aggregated data is an ill-posed problem since the number of parameters to be estimated exceeds the number of observations. The proposed method reduces the effective number of parameters by modeling the transition probabilities with a neural network that takes the locations of the origin and the destination areas and the time of day as inputs. By this modeling, we can automatically learn nonlinear spatio-temporal relationships flexibly among transitions, locations, and times. With four real-world population data sets in Japan and China, we demonstrate that the proposed method can estimate the transition population more accurately than existing methods.

AAMAS Conference 2019 Conference Paper

Online Inverse Reinforcement Learning Under Occlusion

  • Saurabh Arora
  • Prashant Doshi
  • Bikramjit Banerjee

Inverse reinforcement learning (IRL) is the problem of learning the preferences of an agent from observing its behavior on a task. While this problem is witnessing sustained attention, the related problem of online IRL – where the observations are incrementally accrued, yet the real-time demands of the application often prohibit a full rerun of an IRL method – has received much less attention. We introduce a formal framework for online IRL, called incremental IRL (I2RL), and a new method that advances maximum entropy IRL with hidden variables, to this setting. Our analysis shows that the new method has a monotonically improving performance with more demonstration data, as well as probabilistically bounded error, both under full and partial observability. Experiments in a simulated robotic application, which involves learning under occlusion, show the significantly improved performance of I2RL as compared to both batch IRL and an online imitation learning method.

KER Journal 2019 Journal Article

Team learning from human demonstration with coordination confidence

  • Bikramjit Banerjee
  • Syamala Vittanala
  • Matthew Edmund Taylor

Abstract Among an array of techniques proposed to speed-up reinforcement learning (RL), learning from human demonstration has a proven record of success. A related technique, called Human-Agent Transfer, and its confidence-based derivatives have been successfully applied to single-agent RL. This article investigates their application to collaborative multi-agent RL problems. We show that a first-cut extension may leave room for improvement in some domains, and propose a new algorithm called coordination confidence (CC). CC analyzes the difference in perspectives between a human demonstrator (global view) and the learning agents (local view) and informs the agents’ action choices when the difference is critical and simply following the human demonstration can lead to miscoordination. We conduct experiments in three domains to investigate the performance of CC in comparison with relevant baselines.

IROS Conference 2018 Conference Paper

Autonomous Acquisition of Behavior Trees for Robot Control

  • Bikramjit Banerjee

Behavior trees (BT) are a popular control architecture in the computer game industry, and have been more recently applied in robotics. One open question is how can intelligent agents/robots autonomously acquire their behavior trees for task level control? In contrast with existing approaches that either refine an initially given BT, or directly build the BT based on human feedback/demonstration, we leverage reinforcement learning (RL) that allows robots to autonomously learn control policies by repeated task interaction, but often expressed in a language more difficult to interpret than BTs. The learned control policy is then converted to a behavior tree via our proposed decanonicalization algorithm. The feasibility of this idea is based on a proposed notion of canonical behavior trees (CBT). In particular, we show (1) CBTs are sufficiently expressive to capture RL control policies, and (2) that RL can be independent of an optimal behavior permutation, despite the BT convention of left-to-right priority, thus obviating the need for a combinatorial search. Two evaluation domains help illustrate our approach.

IS Journal 2017 Journal Article

Multirobot Systems

  • Tsz-Chiu Au
  • Bikramjit Banerjee
  • Prithviraj Dasgupta
  • Peter Stone

The guest editors describe the six articles appearing in this special issue on multirobot systems.

AAAI Conference 2016 Conference Paper

Detection of Plan Deviation in Multi-Agent Systems

  • Bikramjit Banerjee
  • Steven Loscalzo
  • Daniel Thompson

Plan monitoring in a collaborative multi-agent system requires an agent to not only monitor the execution of its own plan, but also to detect possible deviations or failures in the plan execution of its teammates. In domains featuring partial observability and uncertainty in the agents’ sensing and actuation, especially where communication among agents is sparse (as a part of a cost-minimized plan), plan monitoring can be a significant challenge. We design an Expectation Maximization (EM) based algorithm for detection of plan deviation of teammates in such a multi-agent system. However, a direct implementation of this algorithm is intractable, so we also design an alternative approach grounded on the agents’ plans, for tractability. We establish its equivalence to the intractable version, and evaluate these techniques in some challenging tasks.

TAAS Journal 2014 Journal Article

Reinforcement Learning of Informed Initial Policies for Decentralized Planning

  • Landon Kraemer
  • Bikramjit Banerjee

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a formal model for planning in cooperative multiagent systems where agents operate with noisy sensors and actuators, as well as local information. Prevalent solution techniques are centralized and model based—limitations that we address by distributed reinforcement learning (RL). We particularly favor alternate learning, where agents alternately learn best responses to each other, which appears to outperform concurrent RL. However, alternate learning requires an initial policy. We propose two principled approaches to generating informed initial policies: a naive approach that lays the foundation for a more sophisticated approach. We empirically demonstrate that the refined approach produces near-optimal solutions in many challenging benchmark settings, staking a claim to being an efficient (and realistic) approximate solver in its own right. Furthermore, alternate best response learning seeded with such policies quickly learns high-quality policies as well.

JAAMAS Journal 2014 Journal Article

The complexity of multi-agent plan recognition

  • Bikramjit Banerjee
  • Jeremy Lyle
  • Landon Kraemer

Abstract Multi-agent plan recognition (MAPR) seeks to identify the dynamic team structures and team plans from observations of the action sequences of a set of intelligent agents, based on a library of known team plans (plan library), and an evaluation function. It has important applications in decision support, team work, analyzing data from automated monitoring, surveillance, and intelligence analysis in general. We introduce a general model for MAPR that accommodates different representations of the plan library, and includes single agent plan recognition as a special case. Thus it provides an ideal substrate to investigate and contrast the complexities of single and multi-agent plan recognition. Using this model we generate theoretical insights on hardness, with practical implications. A key feature of these results is that they are baseline, i. e. , the polynomial solvability results are given in terms of a compact and expressive plan language (context free language), while the hardness results are given in terms of a less compact language. Consequently the hardness results continue to hold in virtually all realistic plan languages, while the polynomial solvability results extend to the subsets of the context free plan language. In particular, we show that MAPR is in P (polynomial in the size of the plan library and the observation trace) if the number of agents is fixed (in particular 1) but NP-complete otherwise. If the number of agents is a variable, then even the one step MAPR problem is NP-complete. While these results pertain to abduction, we also investigate a related question: adaptation, i. e. , the problem of refining the evaluation function based on feedback. We show that adaptation is also NP-hard for a variable number of agents, but easy for a single agent. These results establish a clear distinction between the hardnesses of single and multi-agent plan recognition even in idealized settings, indicating the necessity of a fundamentally different set of techniques for the latter.

AAMAS Conference 2013 Conference Paper

Concurrent Reinforcement Learning as a Rehearsal for Decentralized Planning Under Uncertainty

  • Landon Kraemer
  • Bikramjit Banerjee

Decentralized partially-observable Markov decision processes (Dec-POMDPs) are a powerful tool for modeling multi-agent planning and decision-making under uncertainty. Prevalent Dec-POMDP solution techniques require centralized computation given full knowledge of the underlying model. Reinforcement learning (RL) based approaches have been recently proposed for distributed solution of Dec-POMDPs without full prior knowledge of the model, but these methods assume that conditions during learning and policy execution are identical. This assumption may not always be necessary and may make learning difficult. We propose a novel RL approach in which agents rehearse with information that will not be available during policy execution, yet learn policies that do not explicitly rely on this information. We show experimentally that incorporating such information can ease the difficulties faced by non-rehearsal-based learners, and demonstrate fast, (near) optimal performance on many existing benchmark Dec-POMDP problems.

AAAI Conference 2013 Conference Paper

Pruning for Monte Carlo Distributed Reinforcement Learning in Decentralized POMDPs

  • Bikramjit Banerjee

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a powerful modeling technique for realistic multi-agent coordination problems under uncertainty. Prevalent solution techniques are centralized and assume prior knowledge of the model. Recently a Monte Carlo based distributed reinforcement learning approach was proposed, where agents take turns to learn best responses to each other’s policies. This promotes decentralization of the policy computation problem, and relaxes reliance on the full knowledge of the problem parameters. However, this Monte Carlo approach has a large sample complexity, which we address in this paper. In particular, we propose and analyze a modified version of the previous algorithm that adaptively eliminates parts of the experience tree from further exploration, thus requiring fewer samples while ensuring unchanged confidence in the learned value function. Experiments demonstrate significant reduction in sample complexity – the maximum reductions ranging from 61% to 91% over different benchmark Dec-POMDP problems – with the final policies being often better due to more focused exploration.

AAMAS Conference 2012 Conference Paper

Efficient Context Free Parsing of Multi-agent Activities for Team and Plan Recognition

  • Bikramjit Banerjee
  • Jeremy Lyle
  • Landon Kraemer

We extend a recent formalization of multi-agent plan recognition (MAPR), to accommodate compact multi-agent plan libraries in the form of context free grammars (CFG), incomplete plan executions, and uncertainty in the observation trace. Some existing approaches for single agent plan recognition cast it as a problem of parsing a single agent activity trace. With the help of our multi-agent CFG, we do the same for MAPR. However, known hardness results from multi-agent plan recognition constrain our options for efficient parsing, but we claim that static teams are a necessary (though not sufficient) condition for polynomial parsing. The necessity is supported by the fact that MAPR becomes NP-complete when teams can change dynamically. For sufficiency, we impose additional restrictions and claim that if the social structure among the agents is of certain types, then polynomial time parsing is possible.

AAAI Conference 2012 Conference Paper

Sample Bounded Distributed Reinforcement Learning for Decentralized POMDPs

  • Bikramjit Banerjee
  • Jeremy Lyle
  • Landon Kraemer
  • Rajesh Yellamraju

Decentralized partially observable Markov decision processes (Dec-POMDPs) offer a powerful modeling technique for realistic multi-agent coordination problems under uncertainty. Prevalent solution techniques are centralized and assume prior knowledge of the model. We propose a distributed reinforcement learning approach, where agents take turns to learn best responses to each other’s policies. This promotes decentralization of the policy computation problem, and relaxes reliance on the full knowledge of the problem parameters. We derive the relation between the sample complexity of best response learning and error tolerance. Our key contribution is to show that sample complexity could grow exponentially with the problem horizon. We show empirically that even if the sample requirement is set lower than what theory demands, our learning approach can produce (near) optimal policies in some benchmark Dec-POMDP problems.

AAAI Conference 2011 Conference Paper

Branch and Price for Multi-Agent Plan Recognition

  • Bikramjit Banerjee
  • Landon Kraemer

The problem of identifying the (dynamic) team structures and team behaviors from the observed activities of multiple agents is called Multi-Agent Plan Recognition (MAPR). We extend a recent formalization of this problem to accommodate a compact, partially ordered, multi-agent plan language, as well as complex plan execution models – particularly plan abandonment and activity interleaving. We adopt a branch and price approach to solve MAPR in such a challenging setting, and fully instantiate the (generic) pricing problem for MAPR. We show experimentally that this approach outperforms a recently proposed hypothesis pruning algorithm in two domains: multi-agent blocks word, and intrusion detection. The key benefit of the branch and price approach is its ability to grow the necessary component (occurrence) space from which the hypotheses are constructed, rather than begin with a fully enumerated component space that has an intractable size, and search it with pruning. Our formulation of MAPR has the broad objective of bringing mature Operations Research methodologies to bear upon MAPR, envisaged to have a similar impact as mature SAT-solvers had on planning.

AAMAS Conference 2010 Conference Paper

Action Discovery for Reinforcement Learning

  • Bikramjit Banerjee
  • Landon Kraemer

The design of reinforcement learning solutions to many problemsartificially constrain the action set available to an agent, in orderto limit the exploration/sample complexity. While exploring, if anagent can discover new actions that can break through the constraintsof its basic/atomic action set, then the quality of the learneddecision policy could improve. On the flipside, considering allpossible non-atomic actions might explode the explorationcomplexity. We present a potential based solution to this dilemma, andempirically evaluate it in grid navigation tasks. In particular, weshow that both the solution quality and the sample complexity improvesignificantly when basic reinforcement learning is coupled with actiondiscovery. Our approach relies on reducing the number of decisionspoints, which is particularly suited for multiagent coordinationlearning, since agents tend to learn more easily with fewercoordination problems (CPs). To demonstrate this we extend actiondiscovery to multi-agent reinforcement learning. We show that JointAction Learners (JALs) indeed learn coordination policies of higherquality with lower sample complexity when coupled with actiondiscovery, in a multi-agent box-pushing task.

AAMAS Conference 2010 Conference Paper

Coalition Structure Generation in Multi-Agent Systems with Mixed Externalities

  • Bikramjit Banerjee
  • Landon Kraemer

Coalition structure generation (CSG) for multi-agent systems is awell-studied problem. A vast majority of the previous work and thestate-of-the-art approaches to CSG assume a characteristic functionform of the coalition values, where a coalition's value is independentof the other coalitions in the coalition structure. Recently, therehas been interest in the more realistic {\em partition function} formof coalition values, where the value of a coalition is affected by howthe other agents are partitioned, via {\em externalities}. However, the most recent studies in this direction have focused on cases whereall externalities are either {\em always positive or always negative}, and results on coalition structure generation in more general settings(in particular, {\em mixed externalities}) are lacking. In this paperwe propose a framework based on {\em agent-types} that incorporatesmixed externalities and demonstrate that it includes the previoussettings as special cases. We also generalize some previous results inanytime CSG, showing that those results are again special cases. Inparticular, we extend the existing branch and bound algorithm to thisnew setting and show empirically that significant pruning can beachieved when searching for the optimal coalition structure. Thisextends the state-of-the-art in CSG for multi-agent systems.

AAAI Conference 2010 Conference Paper

Multi-Agent Plan Recognition: Formalization and Algorithms

  • Bikramjit Banerjee
  • Landon Kraemer
  • Jeremy Lyle

Multi-Agent Plan Recognition (MAPR) seeks to identify the dynamic team structures and team behaviors from the observations of the activity-sequences of a set of intelligent agents, based on a library of known team-activities (plan library). It has important applications in analyzing data from automated monitoring, surveillance, and intelligence analysis in general. In this paper, we formalize MAPR using a basic model that explicates the cost of abduction in single agent plan recognition by ”flattening” or decompressing the (usually compact, hierarchical) plan library. We show that single-agent plan recognition with a decompressed library can be solved in time polynomial in the input size, while it is known that with a compressed (by partial ordering constraints) library it is NP-complete. This leads to an important insight: that although the compactness of the plan library plays an important role in the hardness of single-agent plan recognition (as recognized in the existing literature), that is not the case with multiple agents. We show, for the first time, that MAPR is NP-complete even when the (multi-agent) plan library is fully decompressed. As with previous solution approaches, we break the problem into two stages: hypothesis generation and hypothesis search. We show that Knuth’s “Algorithm X” (with the efficient “dancing links” representation) is particularly suited for our model, and can be adapted to perform a branch and bound search for the second stage, in this model. We show empirically that this new approach leads to significant pruning of the hypothesis space in MAPR.

AAMAS Conference 2010 Conference Paper

Validation of Agent Based Crowd Egress Simulation

  • Bikramjit Banerjee
  • Landon Kraemer

Crowd behavior simulation has been an active field of research because of its utility in several applications such asemergency planning and evacuations, designing and planning pedestrian areas, subway or rail-road stations, besidesin education, training and entertainment. The most advanced and realistic simulation systems employ intelligentautonomous agents with a balance between individual andgroup intelligence for scalability of the architectures. Although several systems have even been commercialized, little attention has been accorded to the problem of validatingthe outcomes of these simulations in a generalized manner, against reality. The extent of validation fails to go muchbeyond visual matching between the simulation and the actual scenarios (with recordings of human crowds), which canlead to highly subjective and often questionable conclusions. The existing numerical measures often rely on ad-hoc applications, e. g. , local crowd densities are measured to verifypatterns, without a systematic procedure to identify at whattimes in the simulation and the scenarios can the densities becompared. Furthermore, if there are multiple systems thatsimulate crowd behavior in the same scenario in the samevirtual environment, then no technique is currently known toquantitatively compare these systems in terms of realism. Inthis abstract, we present the first (to the best of our knowledge) principled, unified, and automated technique to quantitatively validate and compare the global performance ofcrowd egress simulation systems. We also evaluate a multiagent based crowd egress simulation system (that we haverecently developed, but we do not discuss this system here)using our technique and demonstrate a high degree of validity of that system as well.

IJCAI Conference 2007 Conference Paper

  • Bikramjit Banerjee
  • Peter Stone

We present a reinforcement learning game player that can interact with a General Game Playing system and transfer knowledge learned in one game to expedite learning in many other games. We use the technique of value-function transfer where general features are extracted from the state space of a previous game and matched with the completely different state space of a new game. To capture the underlying similarity of vastly disparate state spaces arising from different games, we use a game-tree lookahead structure for features. We show that such feature-based value function transfer learns superior policies faster than a reinforcement learning agent that does not use knowledge transfer. Furthermore, knowledge transfer using lookahead features can capture opponent-specific value-functions, i. e. can exploit an opponent's weaknesses to learn faster than a reinforcement learner that uses lookahead with minimax (pessimistic) search against the same opponent.

JAAMAS Journal 2007 Journal Article

Generalized multiagent learning with performance bound

  • Bikramjit Banerjee
  • Jing Peng

Abstract We present new Multiagent learning (MAL) algorithms with the general philosophy of policy convergence against some classes of opponents but otherwise ensuring high payoffs. We consider a 3-class breakdown of opponent types: (eventually) stationary, self-play and “other” (see Definition 4) agents. We start with ReDVaLeR that can satisfy policy convergence against the first two types and no-regret against the third, but it needs to know the type of the opponents. This serves as a baseline to delineate the difficulty of achieving these goals. We show that a simple modification on ReDVaLeR yields a new algorithm, RV σ( t ), that achieves no-regret payoffs in all games, and convergence to Nash equilibria in self-play (and to best response against eventually stationary opponents—a corollary of no-regret) simultaneously, without knowing the opponent types, but in a smaller class of games than ReDVaLeR. RV σ( t ) effectively ensures the performance of a learner during the process of learning, as opposed to the performance of a learned behavior. We show that the expression for regret of RV σ( t ) can have a slightly better form than those of other comparable algorithms like GIGA and GIGA-WoLF though, contrastingly, our analysis is in continuous time. Moreover, experiments show that RV σ( t ) can converge to an equilibrium in some cases where GIGA, GIGA-WoLF would fail, and to better equilibria where GIGA, GIGA-WoLF converge to undesirable equilibria (coordination games). This important class of coordination games also highlights the key desirability of policy convergence as a criterion for MAL in self-play instead of high average payoffs. To our knowledge, this is also the first successful (guaranteed) attempt at policy convergence of a no-regret algorithm in the Shapley game.

AAAI Conference 2005 Conference Paper

Efficient No-Regret Multiagent Learning

  • Bikramjit Banerjee

We present new results on the efficiency of no-regret algorithms in the context of multiagent learning. We use a known approach to augment a large class of no-regret algorithms to allow stochastic sampling of actions and observation of scalar reward of only the action played. We show that the average actual payoffs of the resulting learner gets (1) close to the best response against (eventually) stationary opponents, (2) close to the asymptotic optimal payoff against opponents that play a converging sequence of policies, and (3) close to at least a dynamic variant of minimax payoff against arbitrary opponents, with a high probability in polynomial time. In addition the polynomial bounds are shown to be significantly better than previously known bounds. Furthermore, we do not need to assume that the learner knows the game matrices and can observe the opponents’ actions, unlike previous work.

AAAI Conference 2004 Conference Paper

Performance Bounded Reinforcement Learning in Strategic Interactions

  • Bikramjit Banerjee
  • Jing Peng

Despite increasing deployment of agent technologies in several business and industry domains, user confidence in fully automated agent driven applications is noticeably lacking. The main reasons for such lack of trust in complete automation are scalability and non-existence of reasonable guarantees in the performance of selfadapting software. In this paper we address the latter issue in the context of learning agents in a Multiagent System (MAS). Performance guarantees for most existing on-line Multiagent Learning (MAL) algorithms are realizable only in the limit, thereby seriously limiting its practical utility. Our goal is to provide certain meaningful guarantees about the performance of a learner in a MAS, while it is learning. In particular, we present a novel MAL algorithm that (i) converges to a best response against stationary opponents, (ii) converges to a Nash equilibrium in self-play and (iii) achieves a constant bounded expected regret at any time (no-averageregret asymptotically) in arbitrary sized general-sum games with non-negative payoffs, and against any number of opponents.