Arrow Research search

Author name cluster

Pradeep Varakantham

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

117 papers
2 author rows

Possible papers

117

AAAI Conference 2026 Conference Paper

Optimizing Ride-Pooling Operations with Extended Pickup and Drop-Off Flexibility

  • Hao Jiang
  • Yixin Xu
  • Pradeep Varakantham

The core of efficient on-demand ride-pooling lies in solving the Ride-Pool Matching Problem (RMP), which involves assigning multiple customer requests to single vehicles under various service constraints (e.g., pickup windows, detour allowances, and vehicle occupancy). A significant missed opportunity in most current RMP approaches is the assumption that passengers must be picked up and dropped off exactly at their requested locations. Allowing passengers to walk even short distances to meet vehicles could unlock substantial improvements in ride-pooling operations. Building upon the limitations of existing Ride-Pool Matching Problem (RMP) solutions that neglect passenger walkability, this paper introduces a novel matching method that strategically incorporates flexible pickup and drop-off locations. Our approach simultaneously determines the optimal assignment of vehicles to requests (one vehicle to potentially multiple requests and each request to at most one vehicle), identifies advantageous meeting points for passengers, and plans efficient vehicle routes. This comprehensive optimization respects all service constraints and considers the long-term implications of routing decisions. To achieve this integrated solution, we first employ a tree-based approach to enumerate all feasible pairings between passengers and vehicles. Subsequently, we calculate an optimal route for each of these feasible matches. Finally, we evaluate the quality of all possible assignments and select the most advantageous matching for implementation. In our experimental evaluation on city-scale taxi datasets, we demonstrate that our method improves the number of served requests by up to 13% and reduces the average vehicle travel distance by up to 21%. By serving more passengers with less driving distance, our approach achieves greater efficiency in a more sustainable manner — using fewer resources to deliver better service and creating a win-win outcome for all stakeholders, including customers, drivers, the aggregator, and the environment.

ICLR Conference 2025 Conference Paper

Bootstrapping Language Models with DPO Implicit Rewards

  • Changyu Chen
  • Zichen Liu
  • Chao Du
  • Tianyu Pang
  • Qian Liu 0033
  • Arunesh Sinha
  • Pradeep Varakantham
  • Min Lin

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate two refinements to further improve our approach: 1) length-regularized reward shaping to make the preference dataset length-unbiased; 2) experience replay to enhance the quality of the preference dataset. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment. It achieves an increase of more than 8$\\%$ in lengthcontrolled win rate on AlpacaEval 2 for all the different base models that we tried, without relying on external feedback. Our code is available at https://github.com/sail-sg/dice.

AAMAS Conference 2025 Conference Paper

EduQate: Generating Adaptive Curricula through RMABs in Education Settings

  • Sidney Tio
  • Dexun Li
  • Pradeep Varakantham

There has been significant interest in the development of personalized and adaptive educational tools that cater to a student’s individual learning progress. A crucial aspect in developing such tools is in exploring how mastery can be achieved across a diverse yet related range of content in an efficient manner. While Reinforcement Learning and Multi-armed Bandits have shown promise in educational settings, existing works often assume the independence of learning content, neglecting the prevalent interdependencies between such content. In response, we introduce Education Network Restless Multi-armed Bandits (EdNetRMABs), utilizing a network to represent the relationships between interdependent arms. Subsequently, we propose EduQate, a method employing interdependency-aware Q-learning to make informed decisions on arm selection at each time step. We establish the optimality guarantee of EduQate and demonstrate its efficacy compared to baseline policies, using students modeled from both synthetic and real-world data.

AAAI Conference 2025 Conference Paper

Marginal Benefit Driven RL Teacher for Unsupervised Environment Design

  • Dexun Li
  • Wenjun Li
  • Pradeep Varakantham

Training generally capable agents in complex environments is a challenging task that involves identifying "right" environments at the training stage. Recent research has highlighted the potential of the Unsupervised Environment Design framework, which generates environment instances/levels adaptively at the frontier of the agent’s capabilities using regret measures. While regret approaches have shown great promise in generating feasible environments, they can produce difficult environments that are challenging for an RL agent to learn from. This is because regret represents the best-case (upper bound) learning potential and not the actual learning potential of an environment. To address this limitation, we propose an alternative mechanism that employs marginal benefit, focusing on the improvement (in terms of generalized performance) the agent policy gets for a given environment. The advantage of this new mechanism is that it is agent-focused (and not environment focused) and generates the "right" environments depending on the agent's policy. Additionally, to improve the generalizability of the agent, we introduce representative state diversity metric that aims to generate varied experiences for the agent. Finally, we provide detailed experimental results and ablation analysis to showcase the effectiveness of our new methods. We obtain SOTA results among RL based environment generation methods.

NeurIPS Conference 2025 Conference Paper

No Experts, No Problem: Avoidance Learning from Bad Demonstrations

  • Huy Hoang
  • Tien Mai
  • Pradeep Varakantham

This paper addresses the problem of learning avoidance behavior within the context of offline imitation learning. In contrast to conventional methodologies that prioritize the replication of expert or near-expert demonstrations, our work investigates a setting where expert (or desirable) data is absent, and the objective is to learn to eschew undesirable actions by leveraging demonstrations of such behavior (i. e. , learning from negative examples). To address this challenge, we propose a novel training objective grounded in the maximum entropy principle. We further characterize the fundamental properties of this objective function, reformulating the learning process as a cooperative inverse Q-learning task. Moreover, we introduce an efficient strategy for the integration of unlabeled data (i. e. , data of indeterminate quality) to facilitate unbiased and practical offline training. The efficacy of our method is evaluated across standard benchmark environments, where it consistently outperforms state-of-the-art baselines.

AAAI Conference 2025 Conference Paper

Offline Safe Reinforcement Learning Using Trajectory Classification

  • Ze Gong
  • Akshat Kumar
  • Pradeep Varakantham

Offline safe reinforcement learning (RL) has emerged as a promising approach for learning safe behaviors without engaging in risky online interactions with the environment. Most existing methods in offline safe RL rely on cost constraints at each time step (derived from global cost constraints) and this can result in either overly conservative policies or violation of safety constraints. In this paper, we propose to learn a policy that generates desirable trajectories and avoids undesirable trajectories. To be specific, we first partition the pre-collected dataset of state-action trajectories into desirable and undesirable subsets. Intuitively, the desirable set contains high reward and safe trajectories, and undesirable set contains unsafe trajectories and low-reward safe trajectories. Second, we learn a policy that generates desirable trajectories and avoids undesirable trajectories, where (un)desirability scores are provided by a classifier learnt from the dataset of desirable and undesirable trajectories. This approach bypasses the computational complexity and stability issues of a min-max objective that is employed in existing methods. Theoretically, we also show our approach’s strong connections to existing learning paradigms involving human feedback. Finally, we extensively evaluate our method using the DSRL benchmark for offline safe RL. Empirically, our method outperforms competitive baselines, achieving higher rewards and better constraint satisfaction across a wide variety of benchmark tasks.

ICLR Conference 2025 Conference Paper

On Generalization Across Environments In Multi-Objective Reinforcement Learning

  • Jayden Teoh
  • Pradeep Varakantham
  • Peter Vamplew 0001

Real-world sequential decision-making tasks often require balancing trade-offs between multiple conflicting objectives, making Multi-Objective Reinforcement Learning (MORL) an increasingly prominent field of research. Despite recent advances, existing MORL literature has narrowly focused on performance within static environments, neglecting the importance of generalizing across diverse settings. Conversely, existing research on generalization in RL has always assumed scalar rewards, overlooking the inherent multi-objectivity of real-world problems. Generalization in the multi-objective context is fundamentally more challenging, as it requires learning a Pareto set of policies addressing varying preferences across multiple objectives. In this paper, we formalize the concept of generalization in MORL and how it can be evaluated. We then contribute a novel benchmark featuring diverse multi-objective domains with parameterized environment configurations to facilitate future studies in this area. Our baseline evaluations of state-of-the-art MORL algorithms on this benchmark reveals limited generalization capabilities, suggesting significant room for improvement. Our empirical findings also expose limitations in the expressivity of scalar rewards, emphasizing the need for multi-objective specifications to achieve effective generalization. We further analyzed the algorithmic complexities within current MORL approaches that could impede the transfer in performance from the single- to multiple-environment settings. This work fills a critical gap and lays the groundwork for future research that brings together two key areas in reinforcement learning: solving multi-objective decision-making problems and generalizing across diverse environments. We make our code available at [https://github.com/JaydenTeoh/MORL-Generalization](https://github.com/JaydenTeoh/MORL-Generalization).

AAMAS Conference 2025 Conference Paper

On Learning Informative Trajectory Embeddings for Imitation, Classification and Regression

  • Zichang Ge
  • Changyu Chen
  • Arunesh Sinha
  • Pradeep Varakantham

In real-world sequential decision making tasks like autonomous driving, robotics, and healthcare, learning from observed stateaction trajectories is critical for tasks like imitation, classification, and clustering. For example, self-driving cars must replicate human driving behaviors, while robots and healthcare systems benefit from modeling decision sequences, whether or not they come from expert data. Existing trajectory encoding methods often focus on specific tasks or rely on reward signals, limiting their ability to generalize across domains and tasks. Inspired by the success of embedding models like CLIP and BERT in static domains, we propose a novel method for embedding state-action trajectories into a latent space that captures the skills and competencies in the dynamic underlying decision-making processes. This method operates without the need for reward labels, enabling better generalization across diverse domains and tasks. Our contributions are threefold: (1) We introduce a trajectory embedding approach that captures multiple abilities from state-action data. (2) The learned embeddings exhibit strong representational power across downstream tasks, including imitation, classification, clustering, and regression. (3) The embeddings demonstrate unique properties, such as controlling agent behaviors in IQ-Learn and an additive structure in the latent space. Experimental results confirm that our method outperforms traditional approaches, offering more flexible and powerful trajectory representations for various applications. Our code is available at https: //github. com/Erasmo1015/vte.

ICLR Conference 2025 Conference Paper

On Minimizing Adversarial Counterfactual Error in Adversarial Reinforcement Learning

  • Roman Belaire
  • Arunesh Sinha
  • Pradeep Varakantham

Deep Reinforcement Learning (DRL) policies are highly susceptible to adversarial noise in observations, which poses significant risks in safety-critical scenarios. The challenge inherent to adversarial perturbations is that by altering the information observed by the agent, the state becomes only partially observable. Existing approaches address this by either enforcing consistent actions across nearby states or maximizing the worst-case value within adversarially perturbed observations. However, the former suffers from performance degradation when attacks succeed, while the latter tends to be overly conservative, leading to suboptimal performance in benign settings. We hypothesize that these limitations stem from their failing to account for partial observability directly. To this end, we introduce a novel objective called Adversarial Counterfactual Error (ACoE), defined on the beliefs about the true state and balancing value optimization with robustness. To make ACoE scalable in model-free settings, we propose the theoretically-grounded surrogate objective Cumulative-ACoE (C-ACoE). Our empirical evaluations on standard benchmarks (MuJoCo, Atari, and Highway) demonstrate that our method significantly outperforms current state-of-the-art approaches for addressing adversarial RL challenges, offering a promising direction for improving robustness in DRL under adversarial conditions. Our code is available at https://github.com/romanbelaire/acoe-robust-rl.

ICLR Conference 2025 Conference Paper

Semantic Loss Guided Data Efficient Supervised Fine Tuning for Safe Responses in LLMs

  • Yuxiao Lu
  • Arunesh Sinha
  • Pradeep Varakantham

Large Language Models (LLMs) generating unsafe responses to toxic prompts is a significant issue in their applications. While various efforts aim to address this safety concern, previous approaches often demand substantial human data collection or rely on the less dependable option of using another LLM to generate corrective data. In this paper, we aim to take this problem and overcome limitations of requiring significant high-quality human data. Our method requires only a small set of unsafe responses to toxic prompts, easily obtained from the unsafe LLM itself. By employing a semantic cost combined with a negative Earth Mover Distance (EMD) loss, we guide the LLM away from generating unsafe responses. Additionally, we propose a novel lower bound for EMD loss, enabling more efficient optimization. Our results demonstrate superior performance and data efficiency compared to baselines, and we further examine the nuanced effects of over-alignment and potential degradation of language capabilities when using contrastive data.

AAAI Conference 2024 Conference Paper

Handling Long and Richly Constrained Tasks through Constrained Hierarchical Reinforcement Learning

  • Yuxiao Lu
  • Arunesh Sinha
  • Pradeep Varakantham

Safety in goal directed Reinforcement Learning (RL) settings has typically been handled through constraints over trajectories and have demonstrated good performance in primarily short horizon tasks. In this paper, we are specifically interested in the problem of solving temporally extended decision making problems such as robots cleaning different areas in a house while avoiding slippery and unsafe areas (e.g., stairs) and retaining enough charge to move to a charging dock; in the presence of complex safety constraints. Our key contribution is a (safety) Constrained Search with Hierarchical Reinforcement Learning (CoSHRL) mechanism that combines an upper level constrained search agent (which computes a reward maximizing policy from a given start to a far away goal state while satisfying cost constraints) with a low-level goal conditioned RL agent (which estimates cost and reward values to move between nearby states). A major advantage of CoSHRL is that it can handle constraints on the cost value distribution (e.g., on Conditional Value at Risk, CVaR) and can adjust to flexible constraint thresholds without retraining. We perform extensive experiments with different types of safety constraints to demonstrate the utility of our approach over leading approaches in constrained and hierarchical RL.

AAAI Conference 2024 Conference Paper

Imitate the Good and Avoid the Bad: An Incremental Approach to Safe Reinforcement Learning

  • Huy Hoang
  • Tien Mai
  • Pradeep Varakantham

A popular framework for enforcing safe actions in Reinforcement Learning (RL) is Constrained RL, where trajectory based constraints on expected cost (or other cost measures) are employed to enforce safety and more importantly these constraints are enforced while maximizing expected reward. Most recent approaches for solving Constrained RL convert the trajectory based cost constraint into a surrogate problem that can be solved using minor modifications to RL methods. A key drawback with such approaches is an over or underestimation of the cost constraint at each state. Therefore, we provide an approach that does not modify the trajectory based cost constraint and instead imitates "good" trajectories and avoids "bad" trajectories generated from incrementally improving policies. We employ an oracle that utilizes a reward threshold (which is varied with learning) and the overall cost constraint to label trajectories as "good" or "bad". A key advantage of our approach is that we are able to work from any starting policy or set of trajectories and improve on it. In an exhaustive set of experiments, we demonstrate that our approach is able to outperform top benchmark approaches for solving Constrained RL problems, with respect to expected cost, CVaR cost, or even unknown cost constraints.

ICAPS Conference 2024 Conference Paper

Imitating Cost-Constrained Behaviors in Reinforcement Learning

  • Qian Shao
  • Pradeep Varakantham
  • Shih-Fen Cheng

Complex planning and scheduling problems have long been solved using various optimization or heuristic approaches. In recent years, imitation learning that aims to learn from expert demonstrations has been proposed as a viable alternative to solving these problems. Generally speaking, imitation learning is designed to learn either the reward (or preference) model or directly the behavioral policy by observing the behavior of an expert. Existing work in imitation learning and inverse reinforcement learning has focused on imitation primarily in unconstrained settings (e. g. , no limit on fuel consumed by the vehicle). However, in many real-world domains, the behavior of an expert is governed not only by reward (or preference) but also by constraints. For instance, decisions on self-driving delivery vehicles are dependent not only on the route preferences/rewards (depending on past demand data) but also on the fuel in the vehicle and the time available. In such problems, imitation learning is challenging as decisions are not only dictated by the reward model but are also dependent on a cost-constrained model. In this paper, we provide multiple methods that match expert distributions in the presence of trajectory cost constraints through (a) Lagrangian-based method; (b) Meta-gradients to find a good trade-off between expected return and minimizing constraint violation; and (c) Cost-violation-based alternating gradient. We empirically show that leading imitation learning approaches imitate cost-constrained behaviors poorly and our meta-gradient-based approach achieves the best performance.

NeurIPS Conference 2024 Conference Paper

Improving Environment Novelty Quantification for Effective Unsupervised Environment Design

  • Jayden Teoh
  • Wenjun Li
  • Pradeep Varakantham

Unsupervised Environment Design (UED) formalizes the problem of autocurricula through interactive training between a teacher agent and a student agent. The teacher generates new training environments with high learning potential, curating an adaptive curriculum that strengthens the student's ability to handle unseen scenarios. Existing UED methods mainly rely on regret, a metric that measures the difference between the agent's optimal and actual performance, to guide curriculum design. Regret-driven methods generate curricula that progressively increase environment complexity for the student but overlook environment novelty — a critical element for enhancing an agent's generalizability. Measuring environment novelty is especially challenging due to the underspecified nature of environment parameters in UED, and existing approaches face significant limitations. To address this, this paper introduces the Coverage-based Evaluation of Novelty In Environment (CENIE) framework. CENIE proposes a scalable, domain-agnostic, and curriculum-aware approach to quantifying environment novelty by leveraging the student's state-action space coverage from previous curriculum experiences. We then propose an implementation of CENIE that models this coverage and measures environment novelty using Gaussian Mixture Models. By integrating both regret and novelty as complementary objectives for curriculum design, CENIE facilitates effective exploration across the state-action space while progressively increasing curriculum complexity. Empirical evaluations demonstrate that augmenting existing regret-based UED algorithms with CENIE achieves state-of-the-art performance across multiple benchmarks, underscoring the effectiveness of novelty-driven autocurricula for robust generalization.

ECAI Conference 2024 Conference Paper

Preserving the Privacy of Reward Functions in MDPs through Deception

  • Shashank Reddy Chirra
  • Pradeep Varakantham
  • Praveen Paruchuri

Preserving the privacy of preferences (or rewards) of a sequential decision-making agent when decisions are observable is crucial in many physical and cybersecurity domains. For instance, in wildlife monitoring, forest rangers must conduct surveillance without revealing animal locations to poachers. This paper addresses privacy preservation in planning over a sequence of actions in MDPs, where the reward function represents the preference structure to be protected. Observers can use Inverse RL (IRL) to learn these preferences, making this a challenging task. Current research on Differential Privacy (DP) in this setting fails to ensure a lower bound on the minimum expected reward and offers theoretical guarantees that are inadequate against IRL-based observers. To bridge this gap, we propose a novel approach rooted in the theory of deception. Deception includes two models: dissimulation (hiding the truth) and simulation (showing the wrong). As our first contribution, we theoretically demonstrate a significant privacy leak in the current dissimulation-based method. Our second contribution is a novel RL-based planning algorithm that uses simulation to effectively address these privacy concerns while ensuring a guarantee on the expected reward. Through experimentation on multiple benchmark problems, we show that our proposed approach outperforms existing methods in preserving the privacy of reward functions. Code to reproduce the results can be found at: https: //github. com/shshnkreddy/DeceptiveRL

AAMAS Conference 2024 Conference Paper

Regret-based Defense in Adversarial Reinforcement Learning

  • Roman Belaire
  • Pradeep Varakantham
  • Thanh Nguyen
  • David Lo

Deep Reinforcement Learning (DRL) policies are vulnerable to adversarial noise in observations, which can have disastrous consequences in safety-critical environments. For instance, a self-driving car receiving adversarially perturbed sensory observations about traffic signs (e. g. , a stop sign physically altered to be perceived as a speed limit sign) can be fatal. Leading existing approaches for making RL algorithms robust to an observation-perturbing adversary have focused on (a) regularization approaches that make expected value objectives robust by adding adversarial loss terms; or (b) employing “maximin” (i. e. , maximizing the minimum value) notions of robustness. While regularization approaches are adept at reducing the probability of successful attacks, their performance drops significantly when an attack is successful. On the other hand, maximin objectives, while robust, can be extremely conservative. To this end, we focus on optimizing a well-studied robustness objective, namely regret. To ensure the solutions provided are not too conservative, we optimize an approximation of regret using three different methods. We demonstrate that our methods outperform existing best approaches for adversarial RL problems across a variety of standard benchmarks from literature.

AAAI Conference 2024 Conference Paper

Reward Penalties on Augmented States for Solving Richly Constrained RL Effectively

  • Hao Jiang
  • Tien Mai
  • Pradeep Varakantham
  • Huy Hoang

Constrained Reinforcement Learning employs trajectory-based cost constraints (such as expected cost, Value at Risk, or Conditional VaR cost) to compute safe policies. The challenge lies in handling these constraints effectively while optimizing expected reward. Existing methods convert such trajectory-based constraints into local cost constraints, but they rely on cost estimates, leading to either aggressive or conservative solutions with regards to cost. We propose an unconstrained formulation that employs reward penalties over states augmented with costs to compute safe policies. Unlike standard primal-dual methods, our approach penalizes only infeasible trajectories through state augmentation. This ensures that increasing the penalty parameter always guarantees a feasible policy, a feature lacking in primal-dual methods. Our approach exhibits strong empirical performance and theoretical properties, offering a fresh paradigm for solving complex Constrained RL problems, including rich constraints like expected cost, Value at Risk, and Conditional Value at Risk. Our experimental results demonstrate superior performance compared to leading approaches across various constraint types on multiple benchmark problems.

NeurIPS Conference 2024 Conference Paper

Safety through feedback in Constrained RL

  • Shashank Reddy Chirra
  • Pradeep Varakantham
  • Praveen Paruchuri

In safety-critical RL settings, the inclusion of an additional cost function is often favoured over the arduous task of modifying the reward function to ensure the agent's safe behaviour. However, designing or evaluating such a cost function can be prohibitively expensive. For instance, in the domain of self-driving, designing a cost function that encompasses all unsafe behaviours (e. g. , aggressive lane changes, risky overtakes) is inherently complex, it must also consider all the actors present in the scene making it expensive to evaluate. In such scenarios, the cost function can be learned from feedback collected offline in between training rounds. This feedback can be system generated or elicited from a human observing the training process. Previous approaches have not been able to scale to complex environments and are constrained to receiving feedback at the state level which can be expensive to collect. To this end, we introduce an approach that scales to more complex domains and extends beyond state-level feedback, thus, reducing the burden on the evaluator. Inferring the cost function in such settings poses challenges, particularly in assigning credit to individual states based on trajectory-level feedback. To address this, we propose a surrogate objective that transforms the problem into a state-level supervised classification task with noisy labels, which can be solved efficiently. Additionally, it is often infeasible to collect feedback for every trajectory generated by the agent, hence, two fundamental questions arise: (1) Which trajectories should be presented to the human? and (2) How many trajectories are necessary for effective learning? To address these questions, we introduce a \textit{novelty-based sampling} mechanism that selectively involves the evaluator only when the the agent encounters a \textit{novel} trajectory, and discontinues querying once the trajectories are no longer \textit{novel}. We showcase the efficiency of our method through experimentation on several benchmark Safety Gymnasium environments and realistic self-driving scenarios. Our method demonstrates near-optimal performance, comparable to when the cost function is known, by relying solely on trajectory-level feedback across multiple domains. This highlights both the effectiveness and scalability of our approach. The code to replicate these results can be found at \href{https: //github. com/shshnkreddy/RLSF}{https: //github. com/shshnkreddy/RLSF}

NeurIPS Conference 2024 Conference Paper

SPRINQL: Sub-optimal Demonstrations driven Offline Imitation Learning

  • Huy Hoang
  • Tien Mai
  • Pradeep Varakantham

We focus on offline imitation learning (IL), which aims to mimic an expert's behavior using demonstrations without any interaction with the environment. One of the main challenges in offline IL is the limited support of expert demonstrations, which typically cover only a small fraction of the state-action space. While it may not be feasible to obtain numerous expert demonstrations, it is often possible to gather a larger set of sub-optimal demonstrations. For example, in treatment optimization problems, there are varying levels of doctor treatments available for different chronic conditions. These range from treatment specialists and experienced general practitioners to less experienced general practitioners. Similarly, when robots are trained to imitate humans in routine tasks, they might learn from individuals with different levels of expertise and efficiency. In this paper, we propose an offline IL approach that leverages the larger set of sub-optimal demonstrations while effectively mimicking expert trajectories. Existing offline IL methods based on behavior cloning or distribution matching often face issues such as overfitting to the limited set of expert demonstrations or inadvertently imitating sub-optimal trajectories from the larger dataset. Our approach, which is based on inverse soft-Q learning, learns from both expert and sub-optimal demonstrations. It assigns higher importance (through learned weights) to aligning with expert demonstrations and lower importance to aligning with sub-optimal ones. A key contribution of our approach, called SPRINQL, is transforming the offline IL problem into a convex optimization over the space of Q functions. Through comprehensive experimental evaluations, we demonstrate that the SPRINQL algorithm achieves state-of-the-art (SOTA) performance on offline IL benchmarks. Code is available at https: //github. com/hmhuy0/SPRINQL.

AAMAS Conference 2024 Conference Paper

Unifying Regret and State-Action Space Coverage for Effective Unsupervised Environment Design

  • Jayden Teoh Jing Teoh
  • Wenjun Li
  • Pradeep Varakantham

Unsupervised Environment Design (UED) employs interactive training between a teacher agent and a student agent to train generallycapable student agents. Existing UED methods primarily rely on regret to progressively introduce curriculum complexity for the student but often overlook the importance of environment novelty — a critical element for enhancing an agent’s exploration and generalization capabilities. There is a substantial lack of investigating the effects of environment novelty in UED. This paper addresses this gap by introducing the GMM-based Evaluation of Novelty In Environments (GENIE) framework. GENIE quantifies environment novelty within the UED paradigm by using Gaussian Mixture Models. To assess GENIE’s effectiveness in quantifying novelty and driving exploration, we integrate it with ACCEL, the state-ofthe-art UED algorithm. Empirical results demonstrate the superior zero-shot performance of this extended approach over existing UED algorithms, including its predecessor. By providing a means to quantify environment novelty, GENIE lays the groundwork for future UED algorithms to unify novelty-driven exploration and regret-driven exploitation in curriculum generation.

AAAI Conference 2024 Conference Paper

Unsupervised Training Sequence Design: Efficient and Generalizable Agent Training

  • Wenjun Li
  • Pradeep Varakantham

To train generalizable Reinforcement Learning (RL) agents, researchers recently proposed the Unsupervised Environment Design (UED) framework, in which a teacher agent creates a very large number of training environments and a student agent trains on the experiences in these environments to be robust against unseen testing scenarios. For example, to train a student to master the “stepping over stumps” task, the teacher will create numerous training environments with varying stump heights and shapes. In this paper, we argue that UED neglects training efficiency and its need for very large number of environments (henceforth referred to as infinite horizon training) makes it less suitable to training robots and non-expert humans. In real-world applications where either creating new training scenarios is expensive or training efficiency is of critical importance, we want to maximize both the learning efficiency and learning outcome of the student. To achieve efficient finite horizon training, we propose a novel Markov Decision Process (MDP) formulation for the teacher agent, referred to as Unsupervised Training Sequence Design (UTSD). Specifically, we encode salient information from the student policy (e.g., behaviors and learning progress) into the teacher's state space, enabling the teacher to closely track the student's learning progress and consequently discover the optimal training sequences with finite lengths. Additionally, we explore the teacher's efficient adaptation to unseen students at test time by employing the context-based meta-learning approach, which leverages the teacher's past experiences with various students. Finally, we empirically demonstrate our teacher's capability to design efficient and effective training sequences for students with varying capabilities.

AAMAS Conference 2023 Conference Paper

Avoiding Starvation of Arms in Restless Multi-Armed Bandits

  • Dexun Li
  • Pradeep Varakantham

Restless multi-armed bandits (RMAB) is a popular framework for optimizing performance with limited resources under uncertainty. It is an extremely useful model for monitoring beneficiaries (arms) and executing timely interventions using health workers (limited resources) to ensure optimal benefit in public health settings. For instance, RMAB has been used to track patients’ health and monitor their adherence in tuberculosis settings, ensure pregnant mothers listen to automated calls about good pregnancy practices, etc. Due to the limited resources, typically certain individuals, communities, or regions are starved of interventions, which can potentially have a significant negative impact on the individual/community in the long term. To that end, we first define a soft fairness objective which entails an algorithm never probabilistically favors one arm over another if the long-term cumulative reward of choosing the latter arm is higher. Then we provide a scalable approach to ensure longterm optimality while satisfying the proposed fairness constraints in RMAB. Our method, referred to as SoftFair, can balance the tradeoff between the goal of having resources uniformly distributed and maximizing cumulative rewards. SoftFair also provides theoretical performance guarantees and is asymptotically optimal. Finally, we demonstrate the utility of our approaches on simulated benchmarks and show that the soft fairness objective can be handled without a significant sacrifice on the optimal value.

AAAI Conference 2023 Conference Paper

Constrained Reinforcement Learning in Hard Exploration Problems

  • Pathmanathan Pankayaraj
  • Pradeep Varakantham

One approach to guaranteeing safety in Reinforcement Learning is through cost constraints that are dependent on the policy. Recent works in constrained RL have developed methods that ensure constraints are enforced even at learning time while maximizing the overall value of the policy. Unfortunately, as demonstrated in our experimental results, such approaches do not perform well on complex multi-level tasks, with longer episode lengths or sparse rewards. To that end, we propose a scalable hierarchical approach for constrained RL problems that employs backward cost value functions in the context of task hierarchy and a novel intrinsic reward function in lower levels of the hierarchy to enable cost constraint enforcement. One of our key contributions is in proving that backward value functions are theoretically viable even when there are multiple levels of decision making. We also show that our new approach, referred to as Hierarchically Limited consTraint Enforcement (HiLiTE) significantly improves on state of the art Constrained RL approaches for many benchmark problems from literature. We further demonstrate that this performance (on value and constraint enforcement) clearly outperforms existing best approaches for constrained RL and hierarchical RL.

AAAI Conference 2023 Conference Paper

Future Aware Pricing and Matching for Sustainable On-Demand Ride Pooling

  • Xianjie Zhang
  • Pradeep Varakantham
  • Hao Jiang

The popularity of on-demand ride pooling is owing to the benefits offered to customers (lower prices), taxi drivers (higher revenue), environment (lower carbon footprint due to fewer vehicles) and aggregation companies like Uber (higher revenue). To achieve these benefits, two key interlinked challenges have to be solved effectively: (a) pricing -- setting prices to customer requests for taxis; and (b) matching -- assignment of customers (that accepted the prices) to taxis/cars. Traditionally, both these challenges have been studied individually and using myopic approaches (considering only current requests), without considering the impact of current matching on addressing future requests. In this paper, we develop a novel framework that handles the pricing and matching problems together, while also considering the future impact of the pricing and matching decisions. In our experimental results on a real-world taxi dataset, we demonstrate that our framework can significantly improve revenue (up to 17% and on average 6.4%) in a sustainable manner by reducing the number of vehicles (up to 14% and on average 10.6%) required to obtain a given fixed revenue and the overall distance travelled by vehicles (up to 11.1% and on average 3.7%). That is to say, we are able to provide an ideal win-win scenario for all stakeholders (customers, drivers, aggregator, environment) involved by obtaining higher revenue for customers, drivers, aggregator (ride pooling company) while being good for the environment (due to fewer number of vehicles on the road and lesser fuel consumed).

IJCAI Conference 2023 Conference Paper

Generalization through Diversity: Improving Unsupervised Environment Design

  • Wenjun Li
  • Pradeep Varakantham
  • Dexun Li

Agent decision making using Reinforcement Learning (RL) heavily relies on either a model or simulator of the environment (e. g. , moving in an 8x8 maze with three rooms, playing Chess on an 8x8 board). Due to this dependence, small changes in the environment (e. g. , positions of obstacles in the maze, size of the board) can severely affect the effectiveness of the policy learned by the agent. To that end, existing work has proposed training RL agents on an adaptive curriculum of environments (generated automatically) to improve performance on out-of-distribution (OOD) test scenarios. Specifically, existing research has employed the potential for the agent to learn in an environment (captured using Generalized Advantage Estimation, GAE) as the key factor to select the next environment(s) to train the agent. However, such a mechanism can select similar environments (with a high potential to learn) thereby making agent training redundant on all but one of those environments. To that end, we provide a principled approach to adaptively identify diverse environments based on a novel distance measure relevant to environment design. We empirically demonstrate the versatility and effectiveness of our method in comparison to multiple leading approaches for unsupervised environment design on three distinct benchmark problems used in literature.

NeurIPS Conference 2023 Conference Paper

Generative Modelling of Stochastic Actions with Arbitrary Constraints in Reinforcement Learning

  • Changyu Chen
  • Ramesha Karunasena
  • Thanh Nguyen
  • Arunesh Sinha
  • Pradeep Varakantham

Many problems in Reinforcement Learning (RL) seek an optimal policy with large discrete multidimensional yet unordered action spaces; these include problems in randomized allocation of resources such as placements of multiple security resources and emergency response units, etc. A challenge in this setting is that the underlying action space is categorical (discrete and unordered) and large, for which existing RL methods do not perform well. Moreover, these problems require validity of the realized action (allocation); this validity constraint is often difficult to express compactly in a closed mathematical form. The allocation nature of the problem also prefers stochastic optimal policies, if one exists. In this work, we address these challenges by (1) applying a (state) conditional normalizing flow to compactly represent the stochastic policy — the compactness arises due to the network only producing one sampled action and the corresponding log probability of the action, which is then used by an actor-critic method; and (2) employing an invalid action rejection method (via a valid action oracle) to update the base policy. The action rejection is enabled by a modified policy gradient that we derive. Finally, we conduct extensive experiments to show the scalability of our approach compared to prior methods and the ability to enforce arbitrary state-conditional constraints on the support of the distribution of actions in any state.

AAMAS Conference 2023 Conference Paper

Knowledge Compilation for Constrained Combinatorial Action Spaces in Reinforcement Learning

  • Jiajing Ling
  • Moritz Lukas Schuler
  • Akshat Kumar
  • Pradeep Varakantham

Action-constrained reinforcement learning (ACRL), where any action taken in a state must satisfy given constraints, has several practical applications such as resource allocation in supply-demand matching, and path planning among others. A key challenge is to enforce constraints when the action space is discrete and combinatorial. To address this, first, we assume an action is represented using propositional variables, and action constraints are represented using Boolean functions. Second, we compactly encode the set of all valid actions that satisfy action constraints using a probabilistic sentential decision diagram (PSDD), a recently proposed knowledge compilation framework. Parameters of the PSDD compactly encode the probability distribution over all valid actions. Consequently, the learning task becomes optimizing PSDD parameters to maximize the RL objective. Third, we show how to embed the PSDD parameters using deep neural networks, and optimize them using a deep Q-learning based algorithm. By design, our approach is guaranteed to never violate any constraint, and does not involve any expensive projection step over the constraint space. Finally, we show how practical resource allocation constraints can be encoded using a PSDD. Empirically, our approach works better than previous ACRL methods, which often violate constraints, and are not scalable as they involve computationally expensive projectionover-constraints step.

ECAI Conference 2023 Conference Paper

On Sustainable Ride Pooling Through Conditional Expected Value Decomposition

  • Avinandan Bose
  • Hao Jiang
  • Pradeep Varakantham
  • Zichang Ge

Centralized Multi-Agent Reinforcement Learning (MARL) presents itself as an ideal framework for aggregation companies (e. g. , Uber, Lyft, Deliveroo) that have to take a sequential set of centralized decisions on assigning individual agents (typically resources like taxis, food delivery personnel) to customer requests online in the presence of demand uncertainty. However, centralized learning is especially challenging in such very large scale environments, with thousands of agents/resources and hundreds of thousands of requests coming in each day. In this paper, we provide a novel value decomposition mechanism that is able to tackle the scale and provide high quality (matching) decisions at each time step. We show that our value decomposition approach, Conditional Expectation based Value Decomposition (CEVD) is more sustainable (requires 9. 9% fewer vehicles to serve equal number of requests) and more efficient (serves 9. 76% more requests, while traveling 13. 32% lesser distance) than the current best approach over two different city scale (New York and Chicago) benchmarks for ride pooling using taxis.

AAMAS Conference 2023 Conference Paper

Strategic Planning for Flexible Agent Availability in Large Taxi Fleets

  • Rajiv Ranjan Kumar
  • Pradeep Varakantham
  • Shih-Fen Cheng

In large scale multi-agent systems like taxi fleets, individual agents (taxi drivers) are self interested (maximizing their own profits) and this can introduce inefficiencies in the system. One such inefficiency is with regards to the "required" availability of taxis at different time periods during the day. Since a taxi driver can work for limited number of hours in a day (e. g. , 8-10 hours in a city like Singapore), there is a need to optimize the specific hours, so as to maximize individual as well as social welfare. Technically, this corresponds to solving a large scale multi-stage selfish routing game with transition uncertainty. Existing work in addressing this problem is either unable to handle “driver" constraints (e. g. , breaks during work hours) or not scalable. To that end, we provide a novel mechanism that builds on replicator dynamics through ideas from behavior cloning. We demonstrate that our methods provide significantly better policies than the existing approach in terms of improving individual agent revenue and overall agent availability.

IJCAI Conference 2023 Conference Paper

Transferable Curricula through Difficulty Conditioned Generators

  • Sidney Tio
  • Pradeep Varakantham

Advancements in reinforcement learning (RL) have demonstrated superhuman performance in complex tasks such as Starcraft, Go, Chess etc. However, knowledge transfer from Artificial ``Experts" to humans remain a significant challenge. A promising avenue for such transfer would be the use of curricula. Recent methods in curricula generation focuses on training RL agents efficiently, yet such methods rely on surrogate measures to track student progress, and are not suited for training robots in the real world (or more ambitiously humans). In this paper, we introduce a method named Parameterized Environment Response Model (PERM) that shows promising results in training RL agents in parameterized environments. Inspired by Item Response Theory, PERM seeks to model difficulty of environments and ability of RL agents directly. Given that RL agents and humans are trained more efficiently under the ``zone of proximal development", our method generates a curriculum by matching the difficulty of an environment to the current ability of the student. In addition, PERM can be trained offline and does not employ non-stationary measures of student ability, making it suitable for transfer between students. We demonstrate PERM's ability to represent the environment parameter space, and training with RL agents with PERM produces a strong performance in deterministic environments. Lastly, we show that our method is transferable between students, without any sacrifice in training quality.

UAI Conference 2022 Conference Paper

Efficient resource allocation with fairness constraints in restless multi-armed bandits

  • Dexun Li
  • Pradeep Varakantham

Restless Multi-Armed Bandits (RMAB) is an apt model to represent decision-making problems in public health interventions (e. g. , tuberculosis, maternal, and child care), anti-poaching planning, sensor monitoring, personalized recommendations and many more. Existing research in RMAB has contributed mechanisms and theoretical results to a wide variety of settings, where the focus is on maximizing expected value. In this paper, we are interested in ensuring that RMAB decision making is also fair to different arms while maximizing expected value. In the context of public health settings, this would ensure that different people and/or communities are fairly represented while making public health intervention decisions. To achieve this goal, we formally define the fairness constraints in RMAB and provide planning and learning methods to solve RMAB in a fair manner. We demonstrate key theoretical properties of fair RMAB and experimentally demonstrate that our proposed methods handle fairness constraints without sacrificing significantly on solution quality.

AAAI Conference 2022 Conference Paper

Field Study in Deploying Restless Multi-Armed Bandits: Assisting Non-profits in Improving Maternal and Child Health

  • Aditya Mate
  • Lovish Madaan
  • Aparna Taneja
  • Neha Madhiwalla
  • Shresth Verma
  • Gargi Singh
  • Aparna Hegde
  • Pradeep Varakantham

The widespread availability of cell phones has enabled nonprofits to deliver critical health information to their beneficiaries in a timely manner. This paper describes our work to assist non-profits that employ automated messaging programs to deliver timely preventive care information to beneficiaries (new and expecting mothers) during pregnancy and after delivery. Unfortunately, a key challenge in such information delivery programs is that a significant fraction of beneficiaries drop out of the program. Yet, non-profits often have limited health-worker resources (time) to place crucial service calls for live interaction with beneficiaries to prevent such engagement drops. To assist non-profits in optimizing this limited resource, we developed a Restless Multi-Armed Bandits (RMABs) system. One key technical contribution in this system is a novel clustering method of offline historical data to infer unknown RMAB parameters. Our second major contribution is evaluation of our RMAB system in collaboration with an NGO, via a real-world service quality improvement study. The study compared strategies for optimizing service calls to 23003 participants over a period of 7 weeks to reduce engagement drops. We show that the RMAB group provides statistically significant improvement over other comparison groups, reducing ∼ 30% engagement drops. To the best of our knowledge, this is the first study demonstrating the utility of RMABs in real world public health settings. We are transitioning our RMAB system to the NGO for real-world use.

AAMAS Conference 2022 Conference Paper

Hierarchical Value Decomposition for Effective On-demand Ride-Pooling

  • Jiang Hao
  • Pradeep Varakantham

On-demand ride-pooling (e. g. , UberPool, GrabShare) services focus on serving multiple different customer requests using each vehicle, i. e. , an empty or partially filled vehicle can be assigned requests from different passengers with different origins and destinations. On the other hand, in Taxi on Demand (ToD) services (e. g. , UberX), one vehicle is assigned to only one request at a time. On-demand ride pooling is not only beneficial to customers (lower cost), drivers (higher revenue per trip) and aggregation companies (higher revenue), but is also of crucial importance to the environment as it reduces the number of vehicles required on the roads. Since each vehicle has to be matched with a combination of customer requests, the matching problem in ride pooling is significantly more challenging. Due to this complexity, most existing solutions to ride-pooling problem are myopic in that they either ignore future impact of current matches or the impact of other taxis in the expected revenue earned by a taxi. In this paper, we build on an approximate dynamic programming framework to consider impact of other taxis on the value of a taxi (expected revenue earned until end of horizon) through a novel hierarchical value decomposition framework. On a real world city scale taxi data set, we show a significant improvement of up to 10. 7% in requests served compared to existing best method for on-demand ride pooling.

ICAPS Conference 2022 Conference Paper

Joint Pricing and Matching for City-Scale Ride-Pooling

  • Sanket Shah
  • Meghna Lowalekar
  • Pradeep Varakantham

Central to efficient ride-pooling are two challenges: (1) how to `price' customers' requests for rides, and (2) if the customer agrees to that price, how to best `match' these requests to drivers. While both of them are interdependent, each challenge's individual complexity has meant that, historically, they have been decoupled and studied individually. This paper creates a framework for batched pricing and matching in which pricing is seen as a meta-level optimisation over different possible matching decisions. Our key contributions are in developing a variant of the revenue-maximizing auction corresponding to the meta-level optimization problem, and then providing a scalable mechanism for computing posted prices. We test our algorithm on real-world data at city-scale and show that our algorithm reliably matches demand to supply across a range of parameters.

AAMAS Conference 2021 Conference Paper

Adaptive Operating Hours for Improved Performance of Taxi Fleets

  • Rajiv Ranjan Kumar
  • Pradeep Varakantham
  • Shih-Fen Cheng

Taxi fleets and car aggregation systems are an important component of the urban public transportation system. Taxis and cars in taxi fleets and car aggregation systems (e. g. , Uber) are dependent on a large number of self-controlled and profit-driven taxi drivers, which introduces inefficiencies in the system. There are two ways in which taxi fleet performance can be optimized: (i) Operational decision making: improve assignment of taxis/cars to customers, while accounting for future demand; (ii) strategic decision making: optimize operating hours of (taxi and car) drivers. Existing research has primarily focused on the operational decisions in (i) and we focus on the strategic decisions in (ii). We first model this complex real world decision making problem (with thousands of taxi drivers) as a multi-stage stochastic congestion game with a non dedicated set of agents (i. e. , agents start operation at a random stage and exit the game after a fixed time), where there is a dynamic population of agents (constrained by the maximum number of drivers). We provide planning and learning methods for computing the ideal operating hours in such a game, so as to improve efficiency of the overall fleet. In our experimental results, we demonstrate that our planning based approach provides up to 16% improvement in revenue over existing method on a real world taxi dataset. The learning based approach further improves the performance and achieves up to 10% more revenue than the planning approach.

UAI Conference 2021 Conference Paper

CLAIM: curriculum learning policy for influence maximization in unknown social networks

  • Dexun Li
  • Meghna Lowalekar
  • Pradeep Varakantham

Influence maximization is the problem of finding a small subset of nodes in a network that can maximize the diffusion of information. Recently, it has also found application in HIV prevention, substance abuse prevention, micro-finance adoption, etc. , where the goal is to identify the set of peer leaders in a real-world physical social network who can disseminate information to a large group of people. Unlike online social networks, real-world networks are not completely known, and collecting information about the network is costly as it involves surveying multiple people. In this paper, we focus on this problem of network discovery for influence maximization. The existing work in this direction proposes a reinforcement learning framework. As the environment interactions in real-world settings are costly, so it is important for the reinforcement learning algorithms to have minimum possible environment interactions, i. e, to be sample efficient. In this work, we propose CLAIM - Curriculum LeArning Policy for Influence Maximization to improve the sample efficiency of RL methods. We conduct experiments on real-world datasets and show that our approach can outperform the current best approach.

IJCAI Conference 2021 Conference Paper

Learn to Intervene: An Adaptive Learning Policy for Restless Bandits in Application to Preventive Healthcare

  • Arpita Biswas
  • Gaurav Aggarwal
  • Pradeep Varakantham
  • Milind Tambe

In many public health settings, it is important for patients to adhere to health programs, such as taking medications and periodic health checks. Unfortunately, beneficiaries may gradually disengage from such programs, which is detrimental to their health. A concrete example of gradual disengagement has been observed by an organization that carries out a free automated call-based program for spreading preventive care information among pregnant women. Many women stop picking up calls after being enrolled for a few months. To avoid such disengagements, it is important to provide timely interventions. Such interventions are often expensive and can be provided to only a small fraction of the beneficiaries. We model this scenario as a restless multi-armed bandit (RMAB) problem, where each beneficiary is assumed to transition from one state to another depending on the intervention. Moreover, since the transition probabilities are unknown a priori, we propose a Whittle index based Q-Learning mechanism and show that it converges to the optimal solution. Our method improves over existing learning-based methods for RMABs on multiple benchmarks from literature and also on the maternal healthcare dataset.

AAMAS Conference 2021 Conference Paper

Learning Index Policies for Restless Bandits with Application to Maternal Healthcare

  • Arpita Biswas
  • Gaurav Aggarwal
  • Pradeep Varakantham
  • Milind Tambe

In many community health settings, it is crucial to have a systematic monitoring and intervention process to ensure that the patients adhere to healthcare programs, such as periodic health checks or taking medications. When these interventions are expensive, they can be provided to only a fixed small fraction of the patients at any period of time. Hence, it is important to carefully choose the beneficiaries who should be provided with interventions and when. We model this scenario as a restless multi-armed bandit (RMAB) problem, where each beneficiary is assumed to transition from one state to another depending on the intervention provided to them. In practice, the transition probabilities are unknown a priori, and hence, we propose a mechanism for the problem of balancing the explore-exploit trade-off. Empirically, we find that our proposed mechanism outperforms the baseline intervention scheme maternal healthcare dataset.

JAIR Journal 2021 Journal Article

Zone pAth Construction (ZAC) based Approaches for Effective Real-Time Ridesharing

  • Meghna Lowalekar
  • Pradeep Varakantham
  • Patrick Jaillet

Real-time ridesharing systems such as UberPool, Lyft Line and GrabShare have become hugely popular as they reduce the costs for customers, improve per trip revenue for drivers and reduce traffic on the roads by grouping customers with similar itineraries. The key challenge in these systems is to group the “right” requests to travel together in the “right” available vehicles in real-time, so that the objective (e.g., requests served, revenue or delay) is optimized. This challenge has been addressed in existing work by: (i) generating as many relevant feasible combinations of requests (with respect to the available delay for customers) as possible in real-time; and then (ii) optimizing assignment of the feasible request combinations to vehicles. Since the number of request combinations increases exponentially with the increase in vehicle capacity and number of requests, unfortunately, such approaches have to employ ad hoc heuristics to identify a subset of request combinations for assignment. Our key contribution is in developing approaches that employ zone (abstraction of individual locations) paths instead of request combinations. Zone paths allow for generation of significantly more “relevant” combinations (in comparison to ad hoc heuristics) in real-time than competing approaches due to two reasons: (i) Each zone path can typically represent multiple request combinations; (ii) Zone paths are generated using a combination of offline and online methods. Specifically, we contribute both myopic (ridesharing assignment focussed on current requests only) and non-myopic (ridesharing assignment considers impact on expected future requests) approaches that employ zone paths. In our experimental results, we demonstrate that our myopic approach outperforms the current best myopic approach for ridesharing on both real-world and synthetic datasets (with respect to both objective and runtime). We also show that our non-myopic approach obtains 14.7% improvement over existing myopic approach. Our non-myopic approach gets improvements of up to 12.48% over a recent non-myopic approach, NeurADP. Even when NeurADP is allowed to optimize learning over test settings, results largely remain comparable except in a couple of cases, where NeurADP performs better.

AAAI Conference 2020 Conference Paper

Neural Approximate Dynamic Programming for On-Demand Ride-Pooling

  • Sanket Shah
  • Meghna Lowalekar
  • Pradeep Varakantham

On-demand ride-pooling (e. g. , UberPool, LyftLine, Grab- Share) has recently become popular because of its ability to lower costs for passengers while simultaneously increasing revenue for drivers and aggregation companies (e. g. , Uber). Unlike in Taxi on Demand (ToD) services – where a vehicle is assigned one passenger at a time – in on-demand ride-pooling, each vehicle must simultaneously serve multiple passengers with heterogeneous origin and destination pairs without violating any quality constraints. To ensure near real-time response, existing solutions to the real-time ridepooling problem are myopic in that they optimise the objective (e. g. , maximise the number of passengers served) for the current time step without considering the effect such an assignment could have on assignments in future time steps. However, considering the future effects of an assignment that also has to consider what combinations of passenger requests can be assigned to vehicles adds a layer of combinatorial complexity to the already challenging problem of considering future effects in the ToD case. A popular approach that addresses the limitations of myopic assignments in ToD problems is Approximate Dynamic Programming (ADP). Existing ADP methods for ToD can only handle Linear Program (LP) based assignments, however, as the value update relies on dual values from the LP. The assignment problem in ride pooling requires an Integer Linear Program (ILP) that has bad LP relaxations. Therefore, our key technical contribution is in providing a general ADP method that can learn from the ILP based assignment found in ride-pooling. Additionally, we handle the extra combinatorial complexity from combinations of passenger requests by using a Neural Network based approximate value function and show a connection to Deep Reinforcement Learning that allows us to learn this value-function with increased stability and sample-efficiency. We show that our approach easily outperforms leading approaches for on-demand ride-pooling on a real-world dataset by up to 16%, a significant improvement in city-scale transportation problems.

ICAPS Conference 2020 Conference Paper

Online Traffic Signal Control through Sample-Based Constrained Optimization

  • Srishti Dhamija
  • Alolika Gon
  • Pradeep Varakantham
  • William Yeoh 0001

Traffic congestion reduces productivity of individuals by increasing time spent in traffic and also increases pollution. To reduce traffic congestion by better handling dynamic traffic patterns, recent work has focused on online traffic signal control. Typically, the objective in traffic signal control is to minimize expected delay over all vehicles given the uncertainty associated with the vehicle turn movements at intersections. In order to ensure responsiveness in decision making, a typical approach is to compute a schedule that minimizes the delay for the expected scenario of vehicle movements instead of minimizing expected delay over the feasible vehicle movement scenarios. Such an approximation degrades schedule quality with respect to expected delay as vehicle turn uncertainty at intersections increases. We introduce TUSERACT (TUrn-SamplE-based Real-time trAffic signal ConTrol), an approach that minimizes expected delay over samples of turn movement uncertainty of vehicles. Specifically, our key contributions are: (a) By exploiting the insight that vehicle turn movements do not change with traffic signal control schedule, we provide a scalable constraint program formulation to compute a schedule that minimizes expected delay across multiple vehicle movement samples for a traffic signal; (b) a novel mechanism to coordinate multiple traffic signals through vehicle turn movement samples; and (c) a comprehensive experimental evaluation to demonstrate the utility of TUSERACT over SURTRAC, a leading approach for online traffic signal control which makes the aforementioned approximation. Our approach provides substantially lower (up to 60%) mean expected delay relative to SURTRAC with very few turn movement samples while providing real-time decision making on both real and synthetic networks.

AAAI Conference 2020 Conference Paper

Solving Online Threat Screening Games using Constrained Action Space Reinforcement Learning

  • Shah Sanket
  • Arunesh Sinha
  • Pradeep Varakantham
  • Perrault Andrew
  • Milind Tambe

Large-scale screening for potential threats with limited resources and capacity for screening is a problem of interest at airports, seaports, and other ports of entry. Adversaries can observe screening procedures and arrive at a time when there will be gaps in screening due to limited resource capacities. To capture this game between ports and adversaries, this problem has been previously represented as a Stackelberg game, referred to as a Threat Screening Game (TSG). Given the significant complexity associated with solving TSGs and uncertainty in arrivals of customers, existing work has assumed that screenees arrive and are allocated security resources at the beginning of the time-window. In practice, screenees such as airport passengers arrive in bursts correlated with flight time and are not bound by fixed timewindows. To address this, we propose an online threat screening model in which the screening strategy is determined adaptively as a passenger arrives while satisfying a hard bound on acceptable risk of not screening a threat. To solve the online problem, we first reformulate it as a Markov Decision Process (MDP) in which the hard bound on risk translates to a constraint on the action space and then solve the resultant MDP using Deep Reinforcement Learning (DRL). To this end, we provide a novel way to efficiently enforce linear inequality constraints on the action output in DRL. We show that our solution allows us to significantly reduce screenee wait time without compromising on the risk.

AAMAS Conference 2019 Conference Paper

A Homophily-Free Community Detection Framework for Trajectories with Delayed Responses

  • Chung-Kyun Han
  • Shih-Fen Cheng
  • Pradeep Varakantham

Community detection has been widely studied in the areas of social network analysis and recommendation system. However, most existing research focus on cases where relationships are explicit or depend on simultaneous appearance. In this paper, we propose to study the community detection problem where the relationships are not based on simultaneous appearance, but time-delayed appearances. In other words, we aim to capture the relationship where one individual physically follows another individual. In our attempt to capture such relationships, the major challenge is the presence of spatial homophily, i. e. , individuals are attracted to locations due to their popularities and not because of communications. In tackling the community detection problem with spatial homophily and delayed responses, we make the following key contributions: (1) We introduce a four-phase framework, which by way of using quantified impacts excludes homophily. (2) To validate the framework, we generate a synthetic dataset based on a known community structure and then infer that community structure. (3) Finally, we execute this framework on a real-world dataset with more than 6, 000 taxis in Singapore. Our results are also compared to those of a baseline approach without homophily-elimination.

UAI Conference 2019 Conference Paper

Correlated Learning for Aggregation Systems

  • Tanvi Verma
  • Pradeep Varakantham

Aggregation systems (e. g. , Uber, Lyft, Food- Panda, Deliveroo) have been increasingly used to improve efficiency in numerous environments, including in transportation, logistics, food and grocery delivery. In these systems, a centralized entity (e. g. , Uber) aggregates sup- ply and assigns them to demand so as to optimize a central metric such as profit, number of requests, delay etc. Due to optimizing a metric of importance to the centralized entity, the interests of individuals (e. g. , drivers, de- livery boys) can be sacrificed. Therefore, in this paper, we focus on the problem of serving individual interests, i. e. , learning revenue maximizing policies for individuals in the presence of a self interested central entity. Since there are large number of learning agents that are homogenous, we represent the problem as an Anonymous Multi-Agent Reinforcement Learn- ing (AyMARL) problem. By using the self interested centralized entity as a correlation entity, we provide a novel learning mechanism that helps individual agents to maximize their individual revenue. Our Correlated Learning (CL) algorithm is able to outperform existing mechanisms on a generic simulator for aggregation systems and multiple other benchmark Multi-Agent Reinforcement Learning (MARL) problems.

ICAPS Conference 2019 Conference Paper

Entropy Based Independent Learning in Anonymous Multi-Agent Settings

  • Tanvi Verma
  • Pradeep Varakantham
  • Hoong Chuin Lau

Efficient sequential matching of supply and demand is a problem of interest in many online to offline services. For instance, Uber, Lyft, Grab for matching taxis to customers; Ubereats, Deliveroo, FoodPanda etc for matching restaurants to customers. In these online to offline service problems, individuals who are responsible for supply (e. g. , taxi drivers, delivery bikes or delivery van drivers) earn more by being at the ”right” place at the ”right” time. We are interested in developing approaches that learn to guide individuals to be in the ”right” place at the ”right” time (to maximize revenue) in the presence of other similar ”learning” individuals and only local aggregated observation of other agents states (e. g. , only number of other taxis in same zone as current agent).

AAMAS Conference 2019 Conference Paper

RE-ORG: An Online Repositioning Guidance Agent

  • Muralidhar Konda
  • Pradeep Varakantham
  • Aayush Saxena
  • Meghna Lowalekar

Continuous matching of supply and demand through online/realtime decision making is a problem of critical importance in many urban environments. [1–4, 7, 9, 10]. Examples include bike/scooter sharing systems, car sharing, food/ grocery delivery, smart vending machines and other similar aggregation systems. In these systems, apart from supply (e. g. , bikes/scooters, cars, restaurants/supermarkets) and demand (e. g. , customers), there may also be matching resources that help match supply and demand (e. g. , trucks, car drivers, delivery boys). The success of these systems depends on the ability to have supply available at the “right" locations at the “right" time. The matching decisions taken by the company operators are typically performed based on the current status of supply, demand and location of matching resources. In this paper, we present RE- ORG (Repositioning agEnt for Online spatio-tempoRal matchinG problems) to provide guidance for matching resources to have supply at the right locations at the right time to serve demand. Apart from providing the guidance on matching decisions, RE-ORG also provides a real time status of the supply, demand and matching resources.

ICAPS Conference 2019 Conference Paper

Resource Constrained Deep Reinforcement Learning

  • Abhinav Bhatia
  • Pradeep Varakantham
  • Akshat Kumar

In urban environments, resources have to be constantly matched to the “right” locations where customer demand is present. For instance, ambulances have to be matched to base stations regularly so as to reduce response time for emergency incidents in ERS (Emergency Response Systems); vehicles (cars, bikes among others) have to be matched to docking stations to reduce lost demand in shared mobility systems. Such problems are challenging owing to the demand uncertainty, combinatorial action spaces and constraints on allocation of resources (e. g. , total resources, minimum and maximum number of resources at locations and regions). Existing systems typically employ myopic and greedy optimization approaches to optimize resource allocation. Such approaches typically are unable to handle surges or variances in demand patterns well. Recent work has demonstrated the ability of Deep RL methods in adapting well to highly uncertain environments. However, existing Deep RL methods are unable to handle combinatorial action spaces and constraints on allocation of resources. To that end, we have developed three approaches on top of the well known actor-critic approach, DDPG (Deep Deterministic Policy Gradient) that are able to handle constraints on resource allocation. We also demonstrate that they are able to outperform leading approaches on simulators validated on semi-real and real data sets.

ICAPS Conference 2019 Conference Paper

ZAC: A Zone Path Construction Approach for Effective Real-Time Ridesharing

  • Meghna Lowalekar
  • Pradeep Varakantham
  • Patrick Jaillet

Real-time ridesharing systems such as UberPool, Lyft Line, GrabShare have become hugely popular as they reduce the costs for customers, improve per trip revenue for drivers and reduce traffic on the roads by grouping customers with similar itineraries. The key challenge in these systems is to group the right requests to travel in available vehicles in real-time, so that the objective (e. g. , requests served, revenue or delay) is optimized. The most relevant existing work has focussed on generating as many relevant feasible (with respect to available delay for customers) combinations of requests (referred to as trips) as possible in real-time. Since the number of trips increases exponentially with the increase in vehicle capacity and number of requests, unfortunately, such an approach has to employ ad hoc heuristics to identify relevant trips. To that end, we propose an approach that generates many zone (abstraction of individual locations) paths – where each zone path can represent multiple trips (combinations of requests) – and assigns available vehicles to these zone paths to optimize the objective. The key advantage of our approach is that these zone paths are generated using a combination of offline and online methods, consequently allowing for the generation of many more relevant combinations in real-time than competing approaches. We demonstrate that our approach outperforms (with respect to both objective and runtime) the current best approach for ridesharing on both real world and synthetic datasets.

AAMAS Conference 2018 Conference Paper

A Driver Guidance System for Taxis in Singapore

  • Shashi Shekhar Jha
  • Shih-Fen Cheng
  • Meghna Lowalekar
  • Nicholas Wong
  • Rishikeshan Rajendram
  • Pradeep Varakantham
  • Nghia Troung Troung
  • Firmansyah Bin Abd Rahman

Traditional taxi fleet operators world-over have been facing intense competitions from various ride-hailing services such as Uber and Grab. Based on our studies on the taxi industry in Singapore, we see that the emergence of Uber and Grab in the ride-hailing market has greatly impacted the taxi industry: the average daily taxi ridership for the past two years has been falling continuously, by close to 20% in total. In this work, we discuss how efficient real-time data analytics and large-scale multiagent optimization technology could help taxi drivers compete against more technologically advanced service platforms. Our system has been in field trial with close to 400 drivers, and our initial results show that by following our recommendations, drivers on average save 21. 5% on roaming time.

ICAPS Conference 2018 Conference Paper

Bounded Rank Optimization for Effective and Efficient Emergency Response

  • Pallavi Manohar
  • Pradeep Varakantham
  • Hoong Chuin Lau

Effective placement of emergency response vehicles (such as ambulances, fire trucks, police cars) to deal with medical, fire or criminal activities can reduce the incident response time by few seconds, which in turn can potentially save a human life. Owing to its adoption in Emergency Medical Services (EMSs) worldwide, existing research on improving emergency response has focused on optimizing the objective of bounded time (i. e. number of incidents served in a fixed time). Due to the dependence of this objective on temporal uncertainty, optimizing the bounded time objective is challenging. In this paper, we propose a new objective referred to as the bounded rank (which is the number of incidents served by a base station whose rank is below a bounded rank value) that has nice theoretical properties and serves as an indirect substitute for the bounded time objective. To understand the theoretical properties of this new objective in the context of the spatio-temporal uncertainty associated with emergency incidents, we first provide a Poisson Point Process (PPP) model of the emergency response problem. We then formally define the bounded rank objective in the context of the model and demonstrate that the bounded rank metric is monotone submodular. Due to the monotone submodularity of the objective, we can propose a greedy approach that can provide an {\em a priori} guarantee of 50\% from optimal and a much tighter {\em posteriori} guarantee. Practically and more importantly, we demonstrate that optimizing this bounded rank objective on simulators validated on real data (and not just on the abstract PPP model) provides better results than the best known approach for optimizing bounded time objective.

UAI Conference 2018 Conference Paper

Decentralized Planning for Non-dedicated Agent Teams with Submodular Rewards in Uncertain Environments

  • Pritee Agrawal
  • Pradeep Varakantham
  • William Yeoh 0001

Decentralized planning under uncertainty for agent teams is a problem of interest in many domains including (but not limited to) disaster rescue, sensor networks and security patrolling. Decentralized MDPs, Dec-MDPs have traditionally been used to represent such decentralized planning under uncertainty problems. However, in many domains, agents may not be dedicated to the team for the entire time horizon. For instance, due to limited availability of resources, it is quite common for police personnel leaving patrolling teams to attend to accidents. Such non-dedication can arise due to the emergence of higher priority tasks or damage to existing agents. However, there is very limited literature dealing with handling of non-dedication in decentralized settings. To that end, we provide a general model to represent problems dealing with cooperative and decentralized planning for non-dedicated agent teams. We also provide two greedy approaches (an offline one and an offline-online one) that are able to deal with agents leaving the team in an effective and efficient way by exploiting the submodularity property. Finally, we demonstrate that our approaches are able to obtain more than 90% of optimal solution quality on benchmark problems from the literature.

AAAI Conference 2018 Conference Paper

Dispatch Guided Allocation Optimization for Effective Emergency Response

  • Supriyo Ghosh
  • Pradeep Varakantham

Effective emergency (medical, fire or criminal) response is crucial for improving safety and security in urban environments. Recent research in improving effectiveness of emergency management systems (EMSs) has utilized data-driven optimization models for efficient allocation of emergency response vehicles (ERVs) to base locations. However, these data-driven optimization models either ignore the dispatch strategy of ERVs (typically the nearest available ERV is dispatched to serve an incident) or employ myopic approaches (e. g. , greedy approach based on marginal gain). This results in allocations that are not synchronised with the real evolution dynamics on the ground or can be improved significantly. To bridge this gap, we make the following contributions: (1) We first provide a novel exact optimization model for allocation of ERVs that incorporates the non-linear real-world dispatch strategy as linear constraints and ensures that optimization exactly imitates the real-world dynamics of EMS; (2) In order to improve scalability, we then provide two novel heuristic approaches to solve problems with large number of emergency incidents; and (3) Finally, using two real-world EMS data sets, we empirically demonstrate that our heuristic approaches provide significant improvement over the best known benchmark approach.

AIJ Journal 2018 Journal Article

Online spatio-temporal matching in stochastic and dynamic domains

  • Meghna Lowalekar
  • Pradeep Varakantham
  • Patrick Jaillet

Online spatio-temporal matching of servers/services to customers is a problem that arises at a large scale in many domains associated with shared transportation (e. g. , taxis, ride sharing, super shuttles, etc.) and delivery services (e. g. , food, equipment, clothing, home fuel, etc.). A key characteristic of these problems is that the matching of servers/services to customers in one stage has a direct impact on the matching in the next stage. For instance, it is efficient for taxis to pick up customers closer to the drop off point of the customer from the first stage of matching. Traditionally, greedy/myopic approaches have been adopted to address such large scale online matching problems. While they provide solutions in a scalable manner, due to their myopic nature, the quality of matching obtained can be improved significantly (demonstrated in our experimental results). In this paper, we present a multi-stage stochastic optimization formulation to consider potential future demand scenarios (obtained from past data). We then provide an enhancement to solve large scale problems more effectively and efficiently online. We also provide the worst-case theoretical bounds on the performance of different approaches. Finally, we demonstrate the significant improvement provided by our techniques over myopic approaches and two other multi-stage approaches from literature (Approximate Dynamic Programming and Hybrid Multi-Stage Stochastic optimization formulation) on three real world taxi data sets.

ICAPS Conference 2018 Conference Paper

Reserved Optimisation: Handling Incident Priorities in Emergency Response Systems

  • Muralidhar Konda
  • Supriyo Ghosh
  • Pradeep Varakantham

Emergency (medical, fire or criminal) Management Systems (EMSs) are crucial for ensuring public safety and security. Typically in many cities, less than 20% of the cases received by EMSs belong to the extremely serious category and require immediate help. Rest of the incidents typically are less serious and thereby allow more flexibility in response time. Therefore, for efficient management of EMS requests, several EMSs now categorise an incoming emergency request into a priority level based on well studied ``triaging" methods. Leading research on optimising emergency response has either focussed on data-driven models for settings with homogenous incidents or on generic heuristics (that are not data-driven) in multi-priority incident settings. In this paper, we provide data-driven models that employ tiered optimisation of allocation and dispatch simultaneously to ensure high priority incidents are served effectively. To that end, we make the following contributions in this paper: (1) For a given dataset of historical incidents, we first provide an optimisation model that maximises the percentage of highest priority incidents served within a threshold response time, while ensuring threshold response times for other priority incidents. Apart from optimising the allocation, this optimisation model also provides a detailed dispatch strategy that fits the given set of historical incidents well; (2) To better handle high variance (spatial and temporal) in arrival of high priority incidents, we reserve a set of ERVs for high priority incidents. Our second contribution is in modifying our optimisation model to reserve a subset of ERVs for high priority incidents while considering a minor modification to nearest available ERV dispatch strategy; and (3) Finally, using a real-world EMS data set, we experimentally demonstrate that our solution with a detailed dispatch strategy outperforms the existing benchmark approach. Moreover, due to the presence of few high priority incidents and significant spatio-temporal uncertainty associated with them, we show that a simple dispatch strategy with reserved ERVs outperforms the detailed dispatch strategy.

TIST Journal 2018 Journal Article

Risk-Sensitive Stochastic Orienteering Problems for Trip Optimization in Urban Environments

  • Pradeep Varakantham
  • Akshat Kumar
  • Hoong Chuin Lau
  • William Yeoh

Orienteering Problems (OPs) are used to model many routing and trip planning problems. OPs are a variant of the well-known traveling salesman problem where the goal is to compute the highest reward path that includes a subset of vertices and has an overall travel time less than a specified deadline. However, the applicability of OPs is limited due to the assumption of deterministic and static travel times. To that end, Campbell et al. extended OPs to Stochastic OPs (SOPs) to represent uncertain travel times (Campbell et al. 2011). In this article, we make the following key contributions: (1) We extend SOPs to Dynamic SOPs (DSOPs), which allow for time-dependent travel times; (2) we introduce a new objective criterion for SOPs and DSOPs to represent a percentile measure of risk; (3) we provide non-linear optimization formulations along with their linear equivalents for solving the risk-sensitive SOPs and DSOPs; (4) we provide a local search mechanism for solving the risk-sensitive SOPs and DSOPs; and (5) we provide results on existing benchmark problems and a real-world theme park trip planning problem.

AAAI Conference 2018 Conference Paper

Upping the Game of Taxi Driving in the Age of Uber

  • Shashi Shekhar Jha
  • Shih-Fen Cheng
  • Meghna Lowalekar
  • Nicholas Wong
  • Rishikeshan Rajendram
  • Trong Khiem Tran
  • Pradeep Varakantham
  • Nghia Truong Trong

In most cities, taxis play an important role in providing pointto-point transportation service. If the taxi service is reliable, responsive, and cost-effective, past studies show that taxilike services can be a viable choice in replacing a significant amount of private cars. However, making taxi services efficient is extremely challenging, mainly due to the fact that taxi drivers are self-interested and they operate with only local information. Although past research has demonstrated how recommendation systems could potentially help taxi drivers in improving their performance, most of these efforts are not feasible in practice. This is mostly due to the lack of both the comprehensive data coverage and an efficient recommendation engine that can scale to tens of thousands of drivers. In this paper, we propose a comprehensive working platform called the Driver Guidance System (DGS). With real-time citywide taxi data provided by our collaborator in Singapore, we demonstrate how we can combine real-time data analytics and large-scale optimization to create a guidance system that can potentially benefit tens of thousands of taxi drivers. Via a realistic agent-based simulation, we demonstrate that drivers following DGS can significantly improve their performance over ordinary drivers, regardless of the adoption ratios. We have concluded our system designing and building and have recently entered the field trial phase.

ICAPS Conference 2017 Conference Paper

Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Improving Revenues

  • Tanvi Verma
  • Pradeep Varakantham
  • Sarit Kraus
  • Hoong Chuin Lau

Taxis (which include cars working with car aggregation systems such as Uber, Grab, Lyft etc.) have become a critical component in the urban transportation. While most research and applications in the context of taxis have focused on improving performance from a customer perspective, in this paper, we focus on improving performance from a taxi driver perspective. Higher revenues for taxi drivers can help bring more drivers into the system thereby improving availability for customers in dense urban cities. Typically, when there is no customer on board, taxi drivers will cruise around to find customers either directly (on the street) or indirectly (due to a request from a nearby customer on phone or on aggregation systems). For such cruising taxis, we develop a Reinforcement Learning (RL) based system to learn from real trajectory logs of drivers to advise them on the right locations to find customers which maximize their revenue. There are multiple translational challenges involved in building this RL system based on real data, such as annotating the activities (e. g. , roaming, going to a taxi stand, etc.) observed in trajectory logs, identifying the right features for a state, action space and evaluating against real driver performance observed in the dataset. We also provide a dynamic abstraction mechanism to improve the basic learning mechanism. Finally, we provide a thorough evaluation on a real world data set from a developed Asian city and demonstrate that an RL based system can provide significant benefits to the drivers.

AAAI Conference 2017 Conference Paper

Decentralized Planning in Stochastic Environments with Submodular Rewards

  • Rajiv Kumar
  • Pradeep Varakantham
  • Akshat Kumar

Decentralized Markov Decision Process (Dec-MDP) provides a rich framework to represent cooperative decentralized and stochastic planning problems under transition uncertainty. However, solving a Dec-MDP to generate coordinated yet decentralized policies is NEXP-Hard. Researchers have made significant progress in providing approximate approaches to improve scalability with respect to number of agents. However, there has been little or no research devoted to finding guarantees on solution quality for approximate approaches considering multiple (more than 2 agents) agents. We have a similar situation with respect to the competitive decentralized planning problem and the Stochastic Game (SG) model. To address this, we identify models in the cooperative and competitive case that rely on submodular rewards, where we show that existing approximate approaches can provide strong quality guarantees (a priori, and for cooperative case also posteriori guarantees). We then provide solution approaches and demonstrate improved online guarantees on benchmark problems from the literature for the cooperative case.

JAIR Journal 2017 Journal Article

Dynamic Repositioning to Reduce Lost Demand in Bike Sharing Systems

  • Supriyo Ghosh
  • Pradeep Varakantham
  • Yossiri Adulyasak
  • Patrick Jaillet

Bike Sharing Systems (BSSs) are widely adopted in major cities of the world due to concerns associated with extensive private vehicle usage, namely, increased carbon emissions, traffic congestion and usage of nonrenewable resources. In a BSS, base stations are strategically placed throughout a city and each station is stocked with a pre-determined number of bikes at the beginning of the day. Customers hire the bikes from one station and return them at another station. Due to unpredictable movements of customers hiring bikes, there is either congestion (more than required) or starvation (fewer than required) of bikes at base stations. Existing data has shown that congestion/starvation is a common phenomenon that leads to a large number of unsatisfied customers resulting in a significant loss in customer demand. In order to tackle this problem, we propose an optimisation formulation to reposition bikes using vehicles while also considering the routes for vehicles and future expected demand. Furthermore, we contribute two approaches that rely on decomposability in the problem (bike repositioning and vehicle routing) and aggregation of base stations to reduce the computation time significantly. Finally, we demonstrate the utility of our approach by comparing against two benchmark approaches on two real-world data sets of bike sharing systems. These approaches are evaluated using a simulation where the movements of customers are generated from real-world data sets.

AAMAS Conference 2017 Conference Paper

Exploiting Anonymity and Homogeneity in Factored Dec-MDPs through Precomputed Binomial Distributions

  • Rajiv Ranjan Kumar
  • Pradeep Varakantham

Recent work in decentralized stochastic planning for cooperative agents has focussed on exploiting homogeneity of agents and anonymity in interactions to solve problems with large numbers of agents. Due to a linear optimization formulation that computes joint policy and an objective that indirectly approximates joint expected reward with reward for expected number of agents in all state, action pairs, these approaches have ensured improved scalability. Such an objective closely approximates joint expected reward when there are many agents, due to law of large numbers. However, the performance deteriorates in problems with fewer agents. In this paper, we improve on the previous line of work by providing a linear optimization formulation that employs a more direct approximation of joint expected reward. The new approximation is based on offline computation of binomial distributions. Our new technique is not only able to improve quality performance on problems with large numbers of agents, but is able to perform on par with existing best approaches on problems with fewer agents. This is achieved without sacrificing on scalability/run-time performance of previous work.

ICAPS Conference 2017 Conference Paper

Incentivizing the Use of Bike Trailers for Dynamic Repositioning in Bike Sharing Systems

  • Supriyo Ghosh
  • Pradeep Varakantham

Bike Sharing System (BSS) is a green mode of transportation that is employed extensively for short distance travels in major cities of the world. Unfortunately, the users behaviour driven by their personal needs can often result in empty or full base stations, thereby resulting in loss of customer demand. To counter this loss in customer demand, BSS operators typically utilize a fleet of carrier vehicles for repositioning the bikes between stations. However, this fuel burning mode of repositioning incurs a significant amount of routing, labor cost and further increases carbon emissions. Therefore, we propose a potentially self-sustaining and environment friendly system of dynamic repositioning, that moves idle bikes during the day with the help of bike trailers. A bike trailer is an add-on to a bike that can help with carrying 3-5 bikes at once. Specifically, we make the following key contributions: (i) We provide an optimization formulation that generates “repositioning” tasks so as to minimize the expected lost demand over past demand scenarios; (ii) Within the budget constraints of the operator, we then design a mechanism to crowdsource the tasks among potential users who intend to execute repositioning tasks; (iii) Finally, we provide extensive results on a wide range of demand scenarios from a real-world data set to demonstrate that our approach is highly competitive to the existing fuel burning mode of repositioning while being green.

IJCAI Conference 2017 Conference Paper

Mechanism Design for Strategic Project Scheduling

  • Pradeep Varakantham
  • Na Fu

Organizing large scale projects (e. g. , Conferences, IT Shows, F1 race) requires precise scheduling of multiple dependent tasks on common resources where multiple selfish entities are competing to execute the individual tasks. In this paper, we consider a well studied and rich scheduling model referred to as RCPSP (Resource Constrained Project Scheduling Problem). The key change to this model that we consider in this paper is the presence of selfish entities competing to perform individual tasks with the aim of maximizing their own utility. Due to the selfish entities in play, the goal of the scheduling problem is no longer only to minimize makespan for the entire project, but rather, to maximize social welfare while ensuring incentive compatibility and economic efficiency. We show that traditional VCG mechanism is not incentive compatible in this context and hence we provide two new practical mechanisms that extend on VCG. These new mechanisms referred to as Individual Completion based Payments (ICP) and Social Completion based Payments (SCP) provide strong theoretical properties including strategy proofness.

ICAPS Conference 2017 Conference Paper

Online Repositioning in Bike Sharing Systems

  • Meghna Lowalekar
  • Pradeep Varakantham
  • Supriyo Ghosh
  • Sanjay Dominik Jena
  • Patrick Jaillet

Due to increased traffic congestion and carbon emissions, Bike Sharing Systems (BSSs) are adopted in various cities for short distance travels, specifically for last mile transportation. The success of a bike sharing system depends on its ability to have bikes available at the "right" base stations at the "right" times. Typically, carrier vehicles are used to perform repositioning of bikes between stations so as to satisfy customer requests. Owing to the uncertainty in customer demand and day-long repositioning, the problem of having bikes available at the right base stations at the right times is a challenging one. In this paper, we propose a multi-stage stochastic formulation, to consider expected future demand over a set of scenarios to find an efficient repositioning strategy for bike sharing systems. Furthermore, we provide a Lagrangian decomposition approach (that decouples the global problem into routing and repositioning slaves and employs a novel DP approach to efficiently solve routing slave) and a greedy online anticipatory heuristic to solve large scale problems effectively and efficiently. Finally, in our experimental results, we demonstrate significant reduction in lost demand provided by our techniques on real world datasets from two bike sharing companies in comparison to existing benchmark approaches.

IJCAI Conference 2017 Conference Paper

Proactive and Reactive Coordination of Non-dedicated Agent Teams Operating in Uncertain Environments

  • Pritee Agrawal
  • Pradeep Varakantham

Domains such as disaster rescue, security patrolling etc. often feature dynamic environments where allocations of tasks to agents become ineffective due to unforeseen conditions that may require agents to leave the team. Agents leave the team either due to arrival of high priority tasks (e. g. , emergency, accident or violation) or due to some damage to the agent. Existing research in task allocation has only considered fixed number of agents and in some instances arrival of new agents on the team. However, there is little or no literature that considers situations where agents leave the team after task allocation. To that end, we first provide a general model to represent non-dedicated teams. Second, we provide a proactive approach based on sample average approximation to generate a strategy that works well across different feasible scenarios of agents leaving the team. Furthermore, we also provide a 2-stage approach that provides a 2-stage policy that changes allocation based on observed state of the team. Third, we provide a reactive approach that rearranges the allocated tasks to better adapt to leaving agents. Finally, we provide a detailed evaluation of our approaches on existing benchmark problems.

JAIR Journal 2017 Journal Article

Sampling Based Approaches for Minimizing Regret in Uncertain Markov Decision Processes (MDPs)

  • Asrar Ahmed
  • Pradeep Varakantham
  • Meghna Lowalekar
  • Yossiri Adulyasak
  • Patrick Jaillet

Markov Decision Processes (MDPs) are an effective model to represent decision processes in the presence of transitional uncertainty and reward tradeoffs. However, due to the difficulty in exactly specifying the transition and reward functions in MDPs, researchers have proposed uncertain MDP models and robustness objectives in solving those models. Most approaches for computing robust policies have focused on the computation of maximin policies which maximize the value in the worst case amongst all realisations of uncertainty. Given the overly conservative nature of maximin policies, recent work has proposed minimax regret as an ideal alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only and they are also limited in their scalability. Therefore, we provide a general model of uncertain MDPs that considers uncertainty over both transition and reward functions. Furthermore, we also consider dependence of the uncertainty across different states and decision epochs. We also provide a mixed integer linear program formulation for minimizing regret given a set of samples of the transition and reward functions in the uncertain MDP. In addition, we provide two myopic variants of regret, namely Cumulative Expected Myopic Regret (CEMR) and One Step Regret (OSR) that can be optimized in a scalable manner. Specifically, we provide dynamic programming and policy iteration based algorithms to optimize CEMR and OSR respectively. Finally, to demonstrate the effectiveness of our approaches, we provide comparisons on two benchmark problems from literature. We observe that optimizing the myopic variants of regret, OSR and CEMR are better than directly optimizing the regret.

AAAI Conference 2016 Conference Paper

A Proactive Sampling Approach to Project Scheduling under Uncertainty

  • Pradeep Varakantham
  • Na Fu
  • Hoong Chuin Lau

Uncertainty in activity durations is a key characteristic of many real world scheduling problems in manufacturing, logistics and project management. RCPSP/max with durational uncertainty is a general model that can be used to represent durational uncertainty in a wide variety of scheduling problems where there exist resource constraints. However, computing schedules or execution strategies for RCPSP/max with durational uncertainty is NP-hard and hence we focus on providing approximation methods in this paper. We provide a principled approximation approach based on Sample Average Approximation (SAA) to compute proactive schedules for RCPSP/max with durational uncertainty. We further contribute an extension to SAA for improving scalability significantly without sacrificing on solution quality. Not only is our approach able to compute schedules at comparable runtimes as existing approaches, it also provides lower α-quantile makespan (also referred to as α-robust makespan) values than the best known approach on benchmark problems from the literature.

ECAI Conference 2016 Conference Paper

An Intelligent System for Personalized Conference Event Recommendation and Scheduling

  • Aldy Gunawan
  • Hoong Chuin Lau
  • Pradeep Varakantham
  • Wenjie Wang

Many conference mobile apps today lack the intelligent feature to automatically generates optimal schedules based on delegates' preferences. This entails two major challenges: (a) identifying preferences of users; and (b) given the preferences, generating a schedule that optimizes his preferences. In this paper, we specifically focus on academic conferences, where users are prompted to input their preferred keywords. Our key contribution is an integrated conference scheduling agent that automatically recognizes user preferences based on keywords, provides a list of recommended talks and optimizes user schedule based on these preferences. To demonstrate the utility of our integrated conference scheduling agent, we first demonstrated the app in the International Conference on Autonomous Agents and Multi-Agent Systems (AAMAS 2015) and conducted a survey to collect some data, which are used to verify the results presented in this paper. It is able to provide well calibrated results with respect to precision, accuracy and recall. We also tested the app in the 2015 WI-IAT International Conference (Singapore). The android and web-based apps have been demonstrated and deployed in AAMAS 2016 (Singapore) with positive responses from the users.

ECAI Conference 2016 Conference Paper

Detecting Communities Using Coordination Games: A Short Paper

  • Radhika Arava
  • Pradeep Varakantham

Communities typically capture homophily as people of the same community share many common features. This paper is motivated by the problem of community detection in social networks, as it can help improve our understanding of the network topology. Given the selfish nature of humans to align with like-minded people, we employ game theoretic models and algorithms to detect communities in this paper. Specifically, we employ coordination games to represent interactions between individuals in a social network. We provide a novel and scalable two phased algorithm NashOverlap to compute an accurate overlapping community structure in the given network. We evaluate our algorithm against the best existing methods for community detection and show that our algorithm improves significantly on benchmark networks with respect to standard normalised mutual information measure.

AAAI Conference 2016 Conference Paper

NLU Framework for Voice Enabling Non-Native Applications on Smart Devices

  • Soujanya Lanka
  • Deepika Pathania
  • Pooja Kushalappa
  • Pradeep Varakantham

Voice is a critical user interface on smart devices (wearables, phones, speakers, televisions) to access applications (or services) available on them. Unfortunately, only a few native applications (provided by the OS developer) are typically voice enabled in devices of today. Since, the utility of a smart device is determined more by the strength of external applications developed for the device, voice enabling non-native applications in a scalable, seamless manner within the device is a critical use case and is the focus of our work. We have developed a Natural Language Understanding (NLU) framework that uses templates supported by the application (as determined by the application developer). This framework can be employed in any mobile OS (Android, iOS, Tizen, Android wear) for a wide range of devices. To aid this demonstration, we have implemented the framework as a service in Android OS. When the user issues a voice command, the natural language query is obtained by this service (using one of local, cloud based or hybrid speech recognizers). The service then executes our NLU framework to identify the relevant application and particular action details. In this demonstration, we will showcase this NLU framework implemented as an Android service on a set of applications that will be installed on the fly. Specifically, we will show how the voice queries are understood and necessary services are launched on android smart wearables and phones.

AAAI Conference 2016 Conference Paper

Online Spatio-Temporal Matching in Stochastic and Dynamic Domains

  • Meghna Lowalekar
  • Pradeep Varakantham
  • Patrick Jaillet

Spatio-temporal matching of services to customers online is a problem that arises on a large scale in many domains associated with shared transportation (ex: taxis, ride sharing, super shuttles, etc.) and delivery services (ex: food, equipment, clothing, home fuel, etc.). A key characteristic of these problems is that matching of services to customers in one round has a direct impact on the matching of services to customers in the next round. For instance, in the case of taxis, in the second round taxis can only pick up customers closer to the drop off point of the customer from the first round of matching. Traditionally, greedy myopic approaches have been adopted to address such large scale online matching problems. While they provide solutions in a scalable manner, due to their myopic nature the quality of matching obtained can be improved significantly (demonstrated in our experimental results). In this paper, we present a two stage stochastic optimization formulation to consider expected future demand. We then provide multiple enhancements to solve large scale problems more effectively and efficiently. Finally, we demonstrate the significant improvement provided by our techniques over myopic approaches on two real world taxi data sets.

AAMAS Conference 2016 Conference Paper

PRESS: Personalized Event Scheduling Recommender System (Demonstration)

  • Hoong Chuin Lau
  • Aldy Gunawan
  • Pradeep Varakantham
  • Wenjie Wang

This paper presents a personalized event scheduling recommender system, PRESS, for a large conference setting with multiple parallel tracks. PRESS is a mobile application that gathers personalized information from a user and recommends talks/demos to be attend. The input from a user include a list of keyword preferences and (optionally) preferred talks. We use the MALLET topic model package to analyze the set of conference papers and classify them based on automatically identified topics. We propose an algorithm to generate a list of recommended papers based on the user keywords and the MALLET topics. An optimization model is then applied to obtain a feasible schedule. The recommended set is matched against the selected papers by the user which we obtained from a survey conducted at AAMAS- 15 in Istanbul, Turkey. We show that PRESS is able to provide reasonable accuracy, precision and recall rates. PRESS will be deployed live during AAMAS-16 in Singapore.

AAAI Conference 2016 Conference Paper

Robust Decision Making for Stochastic Network Design

  • Akshat Kumar
  • Arambam Singh
  • Pradeep Varakantham
  • Daniel Sheldon

We address the problem of robust decision making for stochastic network design. Our work is motivated by spatial conservation planning where the goal is to take management decisions within a fixed budget to maximize the expected spread of a population of species over a network of land parcels. Most previous work for this problem assumes that accurate estimates of different network parameters (edge activation probabilities, habitat suitability scores) are available, which is an unrealistic assumption. To address this shortcoming, we assume that network parameters are only partially known, specified via interval bounds. We then develop a decision making approach that computes the solution with minimax regret. We provide new theoretical results regarding the structure of the minmax regret solution which help develop a computationally efficient approach. Empirically, we show that previous approaches that work on point estimates of network parameters result in high regret on several standard benchmarks, while our approach provides significantly more robust solutions.

AAMAS Conference 2016 Conference Paper

Robust Influence Maximization (Extended Abstract)

  • Meghna Lowalekar
  • Pradeep Varakantham
  • Akshat Kumar

Influence Maximization [2] is the problem of finding a fixed size set of nodes, which will maximize the expected number of influenced nodes in a social network. The number of influenced nodes is dependent on the influence strength of edges that can be very noisy. The noise in the influence strengths can be modeled using a random noise or adversarial noise model. It has been shown that all random processes that independently affect edges of the graph can be absorbed into the activation probabilities themselves and hence random noise can be captured within the independent cascade model. On the other hand, similar to He et al. [1], we consider the adversarial noise where influence strength for an edge can belong to any point in the interval: [p̌u, v, p̂u, v] and the exact values are chosen by an adversary from this interval. The problems of evaluating robustness of a given solution and computing robust optimal solutions have received scant attention in the literature and are of key interest in this paper. Specifically, we aim to minimize (over all available seed sets) the maximum (over all instantiations of influence strengths) regret. Concretely, the key contributions are: (1) We show that maximum regret for a given solution is attained when influence strength on each of the edges is set to one of the extreme values of the influence strength intervals on edges. (2) We provide a novel way of considering samples that accounts for the noise in influence strength on all edges. (3) We develop a framework which provides an approach to get an optimal regret solution and more importantly a metric to evaluate robustness of a given solution based on the regret optimal solution. (4) Finally, we show results on evaluating the robustness of the well known greedy approach. Surprisingly, even without considering noise in influence strengths explicitly, greedy approach achieves highly robust solutions on small-medium scale social network instances.

ICAPS Conference 2016 Conference Paper

Robust Partial Order Schedules for RCPSP/max with Durational Uncertainty

  • Na Fu
  • Pradeep Varakantham
  • Hoong Chuin Lau

In this work, we consider RCPSP/max with durational uncertainty. We focus on computing robust Partial Order Schedules (or, in short POS) which can be executed with risk controlled feasibility and optimality, i. e. , there is stochastic posteriori quality guarantee that the derived POS can be executed with all constraints honored and completion before robust makespan. To address this problem, we propose BACCHUS: a solution method on Benders Accelerated Cut Creation for Handling Uncertainty in Scheduling. In our proposed approach, we first give an MILP formulation for the deterministic RCPSP/max and partition the model into POS generation process and start time schedule determination. Then we develop Benders algorithm and propose cut generation scheme designed for effective convergence to optimality for RCPSP/max. To account for durational uncertainty, we extend the deterministic model by additional consideration of duration scenarios. In the extended MILP, the risks of constraint violation and failure to meet robust makespan are counted during POS exploration. We then approximate the uncertainty problem with computing a risk value related percentile of activity durations from the uncertainty distributions. Finally, we apply Pareto cut generation scheme and propose heuristics for infeasibility cuts to accelerate the algorithm process. Experimental results demonstrate that BACCHUS efficiently and effectively generates robust solutions for scheduling under uncertainty.

IJCAI Conference 2016 Conference Paper

Robust Repositioning to Counter Unpredictable Demand in Bike Sharing Systems

  • Supriyo Ghosh
  • Michael Trick
  • Pradeep Varakantham

Bike Sharing Systems (BSSs) experience a significant loss in customer demand due to starvation (empty base stations precluding bike pickup) or congestion (full base stations precluding bike return). Therefore, BSSs operators reposition bikes between stations with the help of carrier vehicles. Due to unpredictable and dynamically changing nature of the demand, myopic reasoning typically provides a below par performance. We propose an online and robust repositioning approach to minimise the loss in customer demand while considering the possible uncertainty in future demand. Specifically, we develop a scenario generation approach based on an iterative two player game to compute a strategy of repositioning by assuming that the environment can generate a worse demand scenario (out of the feasible demand scenarios) against the current repositioning solution. Extensive computational results from a simulation built on real world data set of bike sharing company demonstrate that our approach can significantly reduce the expected lost demand over the existing benchmark approaches.

IJCAI Conference 2016 Conference Paper

Scalable Greedy Algorithms for Task/Resource Constrained Multi-Agent Stochastic Planning

  • Pritee Agrawal
  • Pradeep Varakantham
  • William Yeoh

Synergistic interactions between task/resource allocation and stochastic planning exist in many environments such as transportation and logistics, UAV task assignment and disaster rescue. Existing research in exploiting these synergistic interactions between the two problems have either only considered domains where tasks/resources are completely independent of each other or have focussed on approaches with limited scalability. In this paper, we address these two limitations by introducing a generic model for task/resource constrained multi-agent stochastic planning, referred to as TasC-MDPs. We provide two scalable greedy algorithms, one of which provides posterior quality guarantees. Finally, we illustrate the high scalability and solution performance of our approaches in comparison with existing work on two benchmark problems from the literature.

IJCAI Conference 2016 Conference Paper

Sequential Decision Making for Improving Efficiency in Urban Environments

  • Pradeep Varakantham

Rapid urbanization (more than 50% of worlds' population now resides in cities) coupled with the natural lack of coordination in usage of common resources (ex: bikes, ambulances, taxis, traffic personnel, attractions) has a detrimental effect on a wide variety of response (ex: waiting times, response time for emergency needs) and coverage metrics (ex: predictability of traffic/security patrols) in cities of today. Motivated by the need to improve response and coverage metrics in urban environments, my research group is focussed on building intelligent agent systems that make sequential decisions to continuously match available supply of resources to an uncertain demand for resources. Our broad approach to generating these sequential decision strategies is through a combination of data analytics (to obtain a model) and multi-stage optimization (planning/scheduling) under uncertainty (to solve the model). While we perform data analytics, our contributions are focussed on multi-stage optimization under uncertainty. We exploit key properties of urban environments, namely homogeneity and anonymity, limited influence of individual entities, abstraction and near decomposability to solve "multi-stage optimization under uncertainty" effectively and efficiently.

AAAI Conference 2016 Conference Paper

Solving Risk-Sensitive POMDPs With and Without Cost Observations

  • Ping Hou
  • William Yeoh
  • Pradeep Varakantham

Partially Observable Markov Decision Processes (POMDPs) are often used to model planning problems under uncertainty. The goal in Risk-Sensitive POMDPs (RS-POMDPs) is to find a policy that maximizes the probability that the cumulative cost is within some user-defined cost threshold. In this paper, unlike existing POMDP literature, we distinguish between the two cases of whether costs can or cannot be observed and show the empirical impact of cost observations. We also introduce a new search-based algorithm to solve RS-POMDPs and show that it is faster and more scalable than existing approaches in two synthetic domains and a taxi domain generated with real-world data.

ICAPS Conference 2016 Conference Paper

Strategic Planning for Setting Up Base Stations in Emergency Medical Systems

  • Supriyo Ghosh
  • Pradeep Varakantham

Emergency Medical Systems (EMSs) are an important component of public health-care services. Improving infrastructure for EMS and specifically the construction of base stations at the ”right” locations to reduce response times is the main focus of this paper. This is a computationally challenging task because of the: (a) exponentially large action space arising from having to consider combinations of potential base locations, which themselves can be significant; and (b) direct impact on the performance of the ambulance allocation problem, where we decide allocation of ambulances to bases. We present an incremental greedy approach to discover the placement of bases that maximises the service level of EMS. Using the properties of submodular optimisation we show that our greedy algorithm provides quality guaranteed solutions for one of the objectives employed in real EMSs. Furthermore, we validate our derived policy by employing a real-life event driven simulator that incorporates the real dynamics of EMS. Finally, we show the utility of our approaches on a real-world dataset from a large asian city and demonstrate significant improvement over the best known approaches from literature.

ICAPS Conference 2015 Conference Paper

Dynamic Redeployment to Counter Congestion or Starvation in Vehicle Sharing Systems

  • Supriyo Ghosh
  • Pradeep Varakantham
  • Yossiri Adulyasak
  • Patrick Jaillet

Extensive usage of private vehicles has led to increased traffic congestion, carbon emissions, and usage of non-renewable resources. These concerns have led to the wide adoption of vehicle sharing (ex: bike sharing, car sharing) systems in many cities of the world. In vehicle-sharing systems, base stations (ex: docking stations for bikes) are strategically placed throughout a city and each of the base stations contain a pre-determined number of vehicles at the beginning of each day. Due to the stochastic and individualistic movement of customers, there is typically either congestion (more than required)or starvation (fewer than required) of vehicles at certain base stations. As demonstrated in our experimental results, this happens often and can cause a significant loss in demand. We propose to dynamically redeploy idle vehicles using carriers so as to minimize lost de-mand or alternatively maximize revenue for the vehicle sharing company. To that end, we contribute an optimization formulation to jointly address the redeploy-ment (of vehicles) and routing (of carriers) problemsand provide two approaches that rely on decomposability and abstraction of problem domains to reduce the computation time significantly. Finally, we demonstrate the utility of our approaches on two real world data sets of bike-sharing companies.

SoCS Conference 2015 Conference Paper

Dynamic Redeployment to Counter Congestion or Starvation in Vehicle Sharing Systems

  • Supriyo Ghosh
  • Pradeep Varakantham
  • Yossiri Adulyasak
  • Patrick Jaillet

Vehicle sharing (ex: bike sharing, car sharing) systems, an attractive alternative of private transportation, are widely adopted in major cities around the world. In vehicle-sharing systems, base stations (ex: docking stations for bikes) are strategically placed throughout a city and each of the base stations contain a pre-determined number of vehicles at the beginning of each day. Due to the stochastic and individualistic movement of customers, there is typically either congestion (more than required) or starvation (fewer than required) of vehicles at certain base stations, which causes a significant loss in demand. We propose to dynamically redeploy idle vehicles using carriers so as to minimize lost demand or alternatively maximize revenue for the vehicle sharing company. To that end, we contribute an optimization formulation to jointly address the redeployment (of vehicles) and routing (of carriers) problems and provide two approaches that rely on decomposability and abstraction of problem domains to reduce the computation time significantly.

IJCAI Conference 2015 Conference Paper

Probabilistic Inference Based Message-Passing for Resource Constrained DCOPs

  • Supriyo Ghosh
  • Akshat Kumar
  • Pradeep Varakantham

Distributed constraint optimization (DCOP) is an important framework for coordinated multiagent decision making. We address a practically useful variant of DCOP, called resource-constrained DCOP (RC-DCOP), which takes into account agents’ consumption of shared limited resources. We present a promising new class of algorithm for RC-DCOPs by translating the underlying coordination problem to probabilistic inference. Using inference techniques such as expectation-maximization and convex optimization machinery, we develop a novel convergent message-passing algorithm for RC-DCOPs. Experiments on standard benchmarks show that our approach provides better quality than previous best DCOP algorithms and has much lower failure rate. Comparisons against an efficient centralized solver show that our approach provides near-optimal solutions, and is significantly faster on larger instances.

AAAI Conference 2015 Conference Paper

Risk Based Optimization for Improving Emergency Medical Systems

  • Sandhya Saisubramanian
  • Pradeep Varakantham
  • Hoong Chuin Lau

In emergency medical systems, arriving at the incident location a few seconds early can save a human life. Thus, this paper is motivated by the need to reduce the response time – time taken to arrive at the incident location after receiving the emergency call – of Emergency Response Vehicles, ERVs (ex: ambulances, fire rescue vehicles) for as many requests as possible. We expect to achieve this primarily by positioning the ”right” number of ERVs at the ”right” places and at the ”right” times. Given the exponentially large action space (with respect to number of ERVs and their placement) and the stochasticity in location and timing of emergency incidents, this problem is computationally challenging. To that end, our contributions building on existing data-driven approaches are three fold: 1. Based on real world evaluation metrics, we provide a risk based optimization criterion to learn from past incident data. Instead of minimizing expected response time, we minimize the largest value of response time such that the risk of finding requests that have a higher value is bounded (ex: Only 10% of requests should have a response time greater than 8 minutes). 2. We develop a mixed integer linear optimization formulation to learn and compute an allocation from a set of input requests while considering the risk criterion. 3. To allow for ”live” reallocation of ambulances, we provide a decomposition method based on Lagrangian Relaxation to significantly reduce the run-time of the optimization formulation. Finally, we provide an exhaustive evaluation on real-world datasets from two asian cities that demonstrates the improvement provided by our approach over current practice and the best known approach from literature.

AAAI Conference 2015 Conference Paper

Solving Uncertain MDPs with Objectives that Are Separable over Instantiations of Model Uncertainty

  • Yossiri Adulyasak
  • Pradeep Varakantham
  • Asrar Ahmed
  • Patrick Jaillet

Markov Decision Problems, MDPs offer an effective mechanism for planning under uncertainty. However, due to unavoidable uncertainty over models, it is difficult to obtain an exact specification of an MDP. We are interested in solving MDPs, where transition and reward functions are not exactly specified. Existing research has primarily focussed on computing infinite horizon stationary policies when optimizing robustness, regret and percentile based objectives. We focus specifically on finite horizon problems with a special emphasis on objectives that are separable over individual instantiations of model uncertainty (i. e. , objectives that can be expressed as a sum over instantiations of model uncertainty): (a) First, we identify two separable objectives for uncertain MDPs: Average Value Maximization (AVM) and Confidence Probability Maximisation (CPM). (b) Second, we provide optimization based solutions to compute policies for uncertain MDPs with such objectives. In particular, we exploit the separability of AVM and CPM objectives by employing Lagrangian dual decomposition (LDD). (c) Finally, we demonstrate the utility of the LDD approach on a benchmark problem from the literature.

AAAI Conference 2014 Conference Paper

Decentralized Stochastic Planning with Anonymity in Interactions

  • Pradeep Varakantham
  • Yossiri Adulyasak
  • Patrick Jaillet

In this paper, we solve cooperative decentralized stochastic planning problems, where the interactions between agents (specified using transition and reward functions) are dependent on the number of agents (and not on the identity of the individual agents) involved in the interaction. A collision of robots in a narrow corridor, defender teams coordinating patrol activities to secure a target, etc. are examples of such anonymous interactions. Formally, we consider problems that are a subset of the well known Decentralized MDP (DEC- MDP) model, where the anonymity in interactions is specified within the joint reward and transition functions. In this paper, not only do we introduce a general model model called D-SPAIT to capture anonymity in interactions, but also provide optimization based optimal and local-optimal solutions for generalizable sub-categories of D-SPAIT.

ICAPS Conference 2014 Conference Paper

Revisiting Risk-Sensitive MDPs: New Algorithms and Results

  • Ping Hou
  • William Yeoh 0001
  • Pradeep Varakantham

While Markov Decision Processes (MDPs) have been shown to be effective models for planning under uncertainty, the objective to minimize the expected cumulative cost is inappropriate for high-stake planning problems. As such, Yu, Lin, and Yan (1998) introduced the Risk-Sensitive MDP (RS-MDP) model, where the objective is to find a policy that maximizes the probability that the cumulative cost is within some user-defined cost threshold. In this paper, we revisit this problem and introduce new algorithms that are based on classical techniques, such as depth-first search and dynamic programming, and a recently introduced technique called Topological Value Iteration (TVI). We demonstrate the applicability of our approach on randomly generated MDPs as well as domains from the ICAPS 2011 International Probabilistic Planning Competition (IPPC).

ECAI Conference 2014 Conference Paper

Unleashing Dec-MDPs in Security Games: Enabling Effective Defender Teamwork

  • Eric Anyung Shieh
  • Albert Xin Jiang
  • Amulya Yadav
  • Pradeep Varakantham
  • Milind Tambe

Multiagent teamwork and defender-attacker security games are two areas that are currently receiving significant attention within multiagent systems research. Unfortunately, despite the need for effective teamwork among multiple defenders, little has been done to harness the teamwork research in security games. This paper is the first to remedy this situation by integrating the powerful teamwork mechanisms offered by Dec-MDPs into security games. We offer the following novel contributions in this paper: (i) New models of security games where a defender team's pure strategy is defined as a Dec-MDP policy for addressing coordination under uncertainty; (ii) New algorithms based on column generation that enable efficient generation of mixed strategies given this new model; (iii) Handling global events during defender execution for effective teamwork; (iv) Exploration of the robustness of randomized pure strategies. The paper opens the door to a potentially new area combining computational game theory and multiagent teamwork.

NeurIPS Conference 2013 Conference Paper

Regret based Robust Solutions for Uncertain Markov Decision Processes

  • Asrar Ahmed
  • Pradeep Varakantham
  • Yossiri Adulyasak
  • Patrick Jaillet

In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of {\em maximin} policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed {\em minimax} regret as a suitable alternative to the {\em maximin} objective for robust optimization. However, existing algorithms for handling {\em minimax} regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across multiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds.

JAAMAS Journal 2013 Journal Article

TESLA: an extended study of an energy-saving agent that leverages schedule flexibility

  • Jun-young Kwak
  • Pradeep Varakantham
  • Wendy Wood

Abstract This paper presents transformative energy-saving schedule-leveraging agent (TESLA), an agent for optimizing energy usage in commercial buildings. TESLA’s key insight is that adding flexibility to event/meeting schedules can lead to significant energy savings. This paper provides four key contributions: (i) online scheduling algorithms, which are at the heart of TESLA, to solve a stochastic mixed integer linear program for energy-efficient scheduling of incrementally/dynamically arriving meetings and events; (ii) an algorithm to effectively identify key meetings that lead to significant energy savings by adjusting their flexibility; (iii) an extensive analysis on energy savings achieved by TESLA; and (iv) surveys of real users which indicate that TESLA’s assumptions of user flexibility hold in practice. TESLA was evaluated on data gathered from over 110, 000 meetings held at nine campus buildings during an 8-month period in 2011–2012 at the University of Southern California and Singapore Management University. These results and analysis show that, compared to the current systems, TESLA can substantially reduce overall energy consumption.

AAMAS Conference 2012 Conference Paper

Active Malware Analysis using Stochastic Games

  • Simon Williamson
  • Pradeep Varakantham
  • Debin Gao
  • Ong Chen Hui

Cyber security is increasingly important for defending computer systems from loss of privacy or unauthorised use. One important aspect is threat analysis - how does an attacker infiltrate a system and what do they want once they are inside. This paper considers the problem of Active Malware Analysis, where we learn about the human or software intruder by actively interacting with it with the goal of learning about its behaviours and intentions, whilst at the same time that intruder may be trying to avoid detection or showing those behaviours and intentions. This game-theoretic active learning is then used to obtain a behavioural clustering of malware, an important contribution for both understanding malware at a high level and more crucially, for the deployment of effective anti-malware defences. This paper makes the following contributions: (i) A formal definition of the game-theoretic active malware analysis problem; (ii) A fast algorithm for learning about a malware in the active analysis problem which utilises the concept of reducing entropy in the beliefs about the malware; (iii) A virtual machine based agent architecture for the implementation of the active malware analysis problem and (iv) A behaviour based clustering of malware behaviour which is shown to be more accurate than a similar clustering using only passive information about the malware.

AAAI Conference 2012 Conference Paper

Decision Support for Agent Populations in Uncertain and Congested Environments

  • Pradeep Varakantham
  • Shih-Fen Cheng
  • Geoff Gordon
  • Asrar Ahmed

This research is motivated by large scale problems in urban transportation and labor mobility where there is congestion for resources and uncertainty in movement. In such domains, even though the individual agents do not have an identity of their own and do not explicitly interact with other agents, they effect other agents. While there has been much research in handling such implicit effects, it has primarily assumed deterministic movements of agents. We address the issue of decision support for individual agents that are identical and have involuntary movements in dynamic environments. For instance, in a taxi fleet serving a city, when a taxi is hired by a customer, its movements are uncontrolled and depend on (a) the customers requirement; and (b) the location of other taxis in the fleet. Towards addressing decision support in such problems, we make two key contributions: (a) A framework to represent the decision problem for selfish individuals in a dynamic population, where there is transitional uncertainty (involuntary movements); and (b) Two techniques (Fictitious Play for Symmetric Agent Populations, FP-SAP and Softmax based Flow Update, SMFU) that converge to equilibrium solutions. We show that our techniques (apart from providing equilibrium strategies) outperform “driver” strategies with respect to overall availability of taxis and the revenue obtained by the taxi drivers. We demonstrate this on a real world data set with 8, 000 taxis and 83 zones (representing the entire area of Singapore).

AAMAS Conference 2012 Conference Paper

Delayed Observation Planning in Partially Observable Domains

  • Pradeep Varakantham
  • Janusz Marecki

Traditional models for planning under uncertainty such as Markov Decision Processes (MDPs) or Partially Observable MDPs (POMDPs) assume that the observations about the results of agent actions are instantly available to the agent. In so doing, they are no longer applicable to domains where observations are received with delays caused by temporary unavailability of information (e. g. delayed response of the market to a new product). To that end, we make the following key contributions towards solving Delayed observation POMDPs (D-POMDPs): (i) We first provide an parameterized approximate algorithm for solving D-POMDPs efficiently, with desired accuracy; and (ii) We then propose a policy execution technique that adjusts the policy at runtime to account for the actual realization of observations. We then show the performance of our techniques on POMDP benchmark problems with delayed observations where explicit modeling of delayed observations leads to solutions of superior quality.

UAI Conference 2012 Conference Paper

Dynamic Stochastic Orienteering Problems for Risk-Aware Applications

  • Hoong Chuin Lau
  • William Yeoh 0001
  • Pradeep Varakantham
  • Duc Thien Nguyen
  • HuaXing Chen

Orienteering problems (OPs) are a variant of the well-known prize-collecting traveling salesman problem, where the salesman needs to choose a subset of cities to visit within a given deadline. OPs and their extensions with stochastic travel times (SOPs) have been used to model vehicle routing problems and tourist trip design problems. However, they suffer from two limitations – travel times between cities are assumed to be time independent and the route provided is independent of the risk preference (with respect to violating the deadline) of the user. To address these issues, we make the following contributions: We introduce (1) a dynamic SOP (DSOP) model, which is an extension of SOPs with dynamic (time-dependent) travel times; (2) a risk-sensitive criterion to allow for different risk preferences; and (3) a local search algorithm to solve DSOPs with this risk-sensitive criterion. We evaluated our algorithms on a real-world dataset for a theme park navigation problem as well as synthetic datasets employed in the literature.

AAMAS Conference 2012 Conference Paper

Lagrangian Relaxation for Large-Scale Multi-Agent Planning

  • Geoff Gordon
  • Pradeep Varakantham
  • William Yeoh
  • Hoong Chuin Lau
  • Ajay Srinivasan Aravamudhan
  • Shih-Fen Cheng

Multi-agent planning is a well-studied problem with applications in various areas. Due to computational constraints, existing research typically focuses either on unstructured domains with many agents, where we are content with heuristic solutions, or domains with small numbers of agents or special structure, where we can find provably near-optimal solutions. In contrast, here we focus on provably near-optimal solutions in domains with many agents, by exploiting influence limits. To that end, we make two key contributions: (a) an algorithm, based on Lagrangian relaxation and randomized rounding, for solving multi-agent planning problems represented as large mixed-integer programs; (b) a proof of convergence of our algorithm to a near-optimal solution.

AAMAS Conference 2012 Conference Paper

Prioritized Shaping of Models for Solving DEC-POMDPs

  • Pradeep Varakantham
  • William Yeoh
  • Prasanna Velagapudi
  • Katia Sycara
  • Paul Scerri

An interesting class of multi-agent POMDP planning problems can be solved by having agents iteratively solve individual POMDPs, find interactions with other individual plans, shape their transition and reward functions to encourage good interactions and discourage bad ones and then recompute a new plan. D-TREMOR showed that this approach can allow distributed planning for hundreds of agents. However, the quality and speed of the planning process depends on the prioritization scheme used. Lower priority agents shape their models with respect to the models of higher priority agents. In this paper, we introduce a new prioritization scheme that is guaranteed to converge and is empirically better, in terms of solution quality and planning time, than the existing prioritization scheme for some problems.

AAMAS Conference 2012 Conference Paper

SAVES: A Sustainable Multiagent Application to Conserve Building Energy Considering Occupants

  • Jun-young Kwak
  • Pradeep Varakantham
  • Rajiv Maheswaran
  • Milind Tambe
  • Farrokh Jazizadeh
  • Geoffrey Kavulya
  • Laura Klein
  • Burcin Becerik-Gerber

This paper describes an innovative multiagent system called SAVES with the goal of conserving energy in commercial buildings. We specifically focus on an application to be deployed in an existing university building that provides several key novelties: (i) jointly performed with the university facility management team, SAVES is based on actual occupant preferences and schedules, actual energy consumption and loss data, real sensors and hand-held devices, etc. ; (ii) it addresses novel scenarios that require negotiations with groups of building occupants to conserve energy; (iii) it focuses on a non-residential building, where human occupants do not have a direct financial incentive in saving energy and thus requires a different mechanism to effectively motivate occupants; and (iv) SAVES uses a novel algorithm for generating optimal MDP policies that explicitly consider multiple criteria optimization (energy and personal comfort) as well as uncertainty over occupant preferences when negotiating energy reduction - this combination of challenges has not been considered in previous MDP algorithms. In a validated simulation testbed, we show that SAVES substantially reduces the overall energy consumption compared to the existing control method while achieving comparable average satisfaction levels for occupants. As a real-world test, we provide results of a trial study where SAVES is shown to lead occupants to conserve energy in real buildings.

UAI Conference 2012 Conference Paper

Uncertain Congestion Games with Assorted Human Agent Populations

  • Asrar Ahmed
  • Pradeep Varakantham
  • Shih-Fen Cheng

Congestion games model a wide variety of real-world resource congestion problems, such as selfish network routing, traffic route guidance in congested areas, taxi fleet optimization and crowd movement in busy areas. However, existing research in congestion games assumes: (a) deterministic movement of agents between resources; and (b) perfect rationality (i.e. maximizing their own expected value) of all agents. Such assumptions are not reasonable in dynamic domains where decision support has to be provided to humans. For instance, in optimizing the performance of a taxi fleet serving a city, movement of taxis can be involuntary or nondeterministic (decided by the specific customer who hires the taxi) and more importantly, taxi drivers may not follow advice provided by the decision support system (due to bounded rationality of humans). To that end, we contribute: (a) a general framework for representing congestion games under uncertainty for populations with assorted notions of rationality. (b) a scalable approach for solving the decision problem for perfectly rational agents which are in the mix with boundedly rational agents; and (c) a detailed evaluation on a synthetic and realworld data set to illustrate the usefulness of our new approach with respect to key social welfare metrics in the context of an assorted human-agent population. An interesting result from our experiments on a real-world taxi fleet optimization problem is that it is better (in terms of revenue and operational efficiency) for taxi drivers to follow perfectly rational strategies irrespective of the percentage of drivers not following the advice.

AAMAS Conference 2011 Conference Paper

Adaptive Decision Support for Structured Organizations: A Case for OrgPOMDPs

  • Pradeep Varakantham
  • Nathan Schurr
  • Alan Carlin
  • Christopher Amato

In today's world, organizations are faced with increasingly large and complex problems that require decision-making under uncertainty. Current methods for optimizing such decisions fall short of handling the problem scale and time constraints. We argue that this is due to existing methods not exploiting the inherent structure of the organizations which solve these problems. We propose a new model called the OrgPOMDP (Organizational POMDP), which is based on the partially observable Markov decision process (POMDP). This new model combines two powerful representations for modeling large scale problems: hierarchical modeling and factored representations. In this paper we make three key contributions: (a) Introduce the OrgPOMDP model; (b) Present an algorithm to solve OrgPOMDP problems efficiently; and (c) Apply OrgPOMDPs to scenarios in an existing large organization, the Air and Space Operation Center (AOC). We conduct experiments and show that our OrgPOMDP approach results in greater scalability and greatly reduced runtime. In fact, as the size of the problem increases, we soon reach a point at which the OrgPOMDP approach continues to provide solutions while traditional POMDP methods cannot. We also provide an empirical evaluation to highlight the benefits of an organization implementing an OrgPOMDP policy.

AAMAS Conference 2011 Conference Paper

Decentralized Decision Support for an Agent Population in Dynamic and Uncertain Domains

  • Pradeep Varakantham
  • Shih-Fen Cheng
  • Nguyen Thi Duong

This research is motivated by problems in urban transportation and labor mobility, where the agent flow is dynamic, non-deterministic and on a large scale. In such domains, even though the individual agents do not have an identity of their own and do not explicitly impact other agents, they have implicit interactions with other agents. While there has been much research in handling such implicit effects, it has primarily assumed controlled movements of agents in static environments. We address the issue of decision support for individual agents having involuntary movements in dynamic environments. For instance, in a taxi fleet serving a city: (i) Movements of a taxi are uncontrolled when it is hired by a customer. (ii) Depending on movements of other taxis in the fleet, the environment and hence the movement model for the current taxi changes. Towards addressing this problem, we make three key contributions: (a) A framework to represent the decision problem for individuals in a dynamic population, where there is uncertainty in movements; (b) A novel heuristic technique called Iterative Sampled OPtimization (ISOP) and greedy heuristics to solve large scale problems in domains of interest; and (c) Analyze the solutions provided by our techniques on problems inspired from a real world data set of a taxi fleet operator in Singapore. As shown in the experimental results, our techniques are able to provide strategies that outperform "driver" strategies with respect to: (i) overall availability of taxis; and (ii) the revenue obtained by the taxi drivers.

AAMAS Conference 2011 Conference Paper

Distributed Model Shaping for Scaling to Decentralized POMDPs with Hundreds of Agents

  • Prasanna Velagapudi
  • Pradeep Varakantham
  • Katia Sycara
  • Paul Scerri

The use of distributed POMDPs for cooperative teams has been severely limited by the incredibly large joint policyspace that results from combining the policy-spaces of the individual agents. However, much of the computational cost of exploring the entire joint policy space can be avoided by observing that in many domains important interactions between agents occur in a relatively small set of scenarios, previously defined as coordination locales (CLs). Moreover, even when numerous interactions might occur, given a set of individual policies there are relatively few actual interactions. Exploiting this observation and building on an existing model shaping algorithm, this paper presents D-TREMOR, an algorithm in which cooperative agents iteratively generate individual policies, identify and communicate possible interactions between their policies, shape their models based on this information and generate new policies. D-TREMOR has three properties that jointly distinguish it from previous DEC-POMDP work: (1) it is completely distributed; (2) it is scalable (allowing 100 agents to compute a "good" joint policy in under 6 hours) and (3) it has low communication overhead. D-TREMOR complements these traits with the following key contributions, which ensure improved scalability and solution quality: (a) techniques to ensure convergence; (b) faster approaches to detect and evaluate CLs; (c) heuristics to capture dependencies between CLs; and (d) novel shaping heuristics to aggregate effects of CLs. While the resulting policies are not globally optimal, empirical results show that agents have policies that effectively manage uncertainty and the joint policy is better than policies generated by independent solvers.

AAMAS Conference 2011 Conference Paper

Incremental DCOP Search Algorithms for Solving Dynamic DCOPs

  • William Yeoh
  • Pradeep Varakantham
  • Xiaoxun Sun
  • Sven Koenig

Distributed constraint optimization problems (DCOPs) are well-suited for modeling multi-agent coordination problems. However, most research has focused on developing algorithms for solving static DCOPs. In this paper, we model dynamic DCOPs as sequences of (static) DCOPs with changes from one DCOP to the next one in the sequence. We introduce the ReuseBounds procedure, which can be used by any-space ADOPT and any-space BnB-ADOPT to find cost-minimal solutions for all DCOPs in the sequence faster than by solving each DCOP individually. This procedure allows those agents that are guaranteed to remain unaffected by a change to reuse their lower and upper bounds from the previous DCOP when solving the next one in the sequence. Our experimental results show that the speedup gained from this procedure increases with the amount of memory the agents have available.

AAMAS Conference 2010 Conference Paper

Analyzing the impact of human bias on human-agent teams in resource allocation domains

  • Praveen Paruchuri
  • Pradeep Varakantham
  • Katia Sycara
  • Paul Scerri

As agent-human teams get increasingly deployed in the real-world, agent designers need to take into account that humans and agentshave different abilities to specify preferences. In this paper, we focus on how human biases in specifying preferences for resourcesimpacts the performance of large, heterogeneous teams. In particular, we model the inclination of humans to simplify their preference functions and to exaggerate their utility for desired resources. We then study the effect of these biases on two different problems, which are representative of most resource allocation problems addressed in literature.

AAMAS Conference 2010 Conference Paper

Risk-Sensitive Planning in Partially Observable Environments

  • Janusz Marecki
  • Pradeep Varakantham

Partially Observable Markov Decision Process (POMDP) isa popular framework for planning under uncertainty in partially observable domains. Yet, the POMDP model is risk-neutral in that it assumes that the agent is maximizing theexpected reward of its actions. In contrast, in domains likefinancial planning, it is often required that the agent decisions are risk-sensitive (maximize the utility of agent actions, for non-linear utility functions). Unfortunately, existing POMDP solvers cannot solve such planning problems exactly. By considering piecewise linear approximations of utility functions, this paper addresses this shortcoming in three contributions: (i) It defines the Risk-SensitivePOMDP model; (ii) It derives the fundamental propertiesof the underlying value functions and provides a functionalvalue iteration technique to compute them exactly and (c) Itproposes an efficient procedure to determine the dominatedvalue functions, to speed up the algorithm. Our experimentsshow that the proposed approach is feasible and applicableto realistic financial planning domains.

ICAPS Conference 2010 Conference Paper

Towards Finding Robust Execution Strategies for RCPSP/max with Durational Uncertainty

  • Na Fu
  • Pradeep Varakantham
  • Hoong Chuin Lau

Resource Constrained Project Scheduling Problems with minimum and maximum time lags (RCPSP/max) have been studied extensively in the literature. However, the more realistic RCPSP/max problems - ones where durations of activities are not known with certainty - have received scant interest and hence are the main focus of the paper. Towards addressing the significant computational complexity involved in tackling RCPSP/max with durational uncertainty, we employ a local search mechanism to generate robust schedules. In this regard, we make two key contributions: (a) Introducing and studying the key properties of a new decision rule to specify start times of activities with respect to dynamic realizations of the duration uncertainty; and (b) Deriving the fitness function that is used to guide the local search towards robust schedules. Experimental results show that the performance of local search is improved with the new fitness evaluation over the best known existing approach.

AAMAS Conference 2009 Conference Paper

Caching Schemes for DCOP Search Algorithms

  • Willliam Yeoh
  • Pradeep Varakantham
  • Sven Koenig

Distributed Constraint Optimization (DCOP) is useful for solving agent-coordination problems. Any-space DCOP search algorithms require only a small amount of memory but can be sped up by caching information. However, their current caching schemes do not exploit the cached information when deciding which information to preempt from the cache when a new piece of information needs to be cached. Our contributions are three-fold: (1) We frame the problem as an optimization problem. (2) We introduce three new caching schemes (MaxPriority, MaxEffort and MaxUtility) that exploit the cached information in a DCOP-specific way. (3) We evaluate how the resulting speed up depends on the search strategy of the DCOP search algorithm. Our experimental results show that, on all tested DCOP problem classes, our MaxEffort and MaxUtility schemes speed up ADOPT (which uses best-first search) more than the other tested caching schemes, while our MaxPriority scheme speeds up BnB-ADOPT (which uses depth-first branch-andbound search) at least as much as the other tested caching schemes.

ICAPS Conference 2009 Conference Paper

Exploiting Coordination Locales in Distributed POMDPs via Social Model Shaping

  • Pradeep Varakantham
  • Jun-young Kwak
  • Matthew E. Taylor
  • Janusz Marecki
  • Paul Scerri
  • Milind Tambe

Distributed POMDPs provide an expressive framework for modeling multiagent collaboration problems, but NEXP-Complete complexity hinders their scalability and application in real-world domains. This paper introduces a subclass of distributed POMDPs, and TREMOR, an algorithm to solve such distributed POMDPs. The primary novelty of TREMOR is that agents plan individually with a single agent POMDP solver and use social model shaping to implicitly coordinate with other agents. Experiments demonstrate that TREMOR can provide solutions orders of magnitude faster than existing algorithms while achieving comparable, or even superior, solution quality.

ICAPS Conference 2008 Conference Paper

Linear Relaxation Techniques for Task Management in Uncertain Settings

  • Pradeep Varakantham
  • Stephen F. Smith

In this paper, we consider the problem of assisting a busy user in managing her workload of pending tasks. We assume that our user is typically oversubscribed, and is invariably juggling multiple concurrent streams of tasks (or work flows) of varying importance and urgency. There is uncertainty with respect to the duration of a pending task as well as the amount of follow-on work that may be generated as a result of executing the task. The user's goal is to be as productive as possible; i. e., to execute tasks that realize the maximum cumulative payoff. This is achieved by enabling the assistant to provide advice about where and how to shed load when all tasks cannot be done. A simple temporal problem with uncertainty and preferences (called an STPPU) provides a natural framework for representing the user's current set of tasks. However, current STPPU solution techniques are inadequate as a basis for generating advice in this context, since they are applicable only in the restrictive case where all pending tasks can be accomplished within time constraints and our principal concern is support in oversubscribed circumstances. We present two techniques that are based on linear relaxation for solving the this oversubscribed problem. Given an ordering of tasks, these algorithms identify which tasks to ignore, which to compress and by how much, to maximize quality. We show experimentally that our approaches perform significantly better than techniques adapted from prior research in oversubscribed scheduling.

AAMAS Conference 2008 Conference Paper

Not All Agents Are Equal: Scaling up Distributed POMDPs for Agent Networks

  • Janusz Marecki
  • Tapana Gupta
  • Pradeep Varakantham
  • Milind Tambe
  • Makoto Yokoo

Many applications of networks of agents, including mobile sensor networks, unmanned air vehicles, autonomous underwater vehicles, involve 100s of agents acting collaboratively under uncertainty. Distributed Partially Observable Markov Decision Problems (Distributed POMDPs) are well-suited to address such applications, but so far, only limited scale-ups of up to five agents have been demonstrated. This paper escalates the scale-up, presenting an algorithm called FANS, increasing the number of agents in distributed POMDPs for the first time into double digits. FANS is founded on finite state machines (FSMs) for policy representation and expoits these FSMs to provide three key contributions: (i) Not all agents within an agent network need the same expressivity of policy representation; FANS introduces novel heuristics to automatically vary the FSM size in different agents for scaleup; (ii) FANS illustrates efficient integration of its FSM-based policy search within algorithms that exploit agent network structure; (iii) FANS provides significant speedups in policy evaluation and heuristic computations within the network algorithms by exploiting the FSMs for dynamic programming. Experimental results show not only orders of magnitude improvements over previous best known algorithms for smaller-scale domains (with similar solution quality), but also a scale-up into double digits in terms of numbers of agents.

IJCAI Conference 2007 Conference Paper

  • Pradeep Varakantham
  • Rajiv Maheswaran
  • Tapana Gupta
  • Milind Tambe

While POMDPs (partially observable markov decision problems) are a popular computational model with wide-ranging applications, the computational cost for optimal policy generation is prohibitive. Researchers are investigating ever-more efficient algorithms, yet many applications demand such algorithms bound any loss in policy quality when chasing efficiency. To address this challenge, we present two new techniques. The first approximates in the value space to obtain solutions efficiently for a pre-specified error bound. Unlike existing techniques, our technique guarantees the resulting policy will meet this bound. Furthermore, it does not require costly computations to determine the quality loss of the policy. Our second technique prunes large tracts of belief space that are unreachable, allowing faster policy computation without any sacrifice in optimality. The combination of the two techniques, which are complementary to existing optimal policy generation algorithms, provides solutions with tight error bounds efficiently in domains where competing algorithms fail to provide such tight bounds.

AAMAS Conference 2007 Conference Paper

Letting loose a SPIDER on a network of POMDPs: Generating quality guaranteed policies

  • Pradeep Varakantham
  • Janusz Marecki
  • Yuichi Yabu
  • Milind Tambe
  • Makoto Yokoo

Distributed Partially Observable Markov Decision Problems (Distributed POMDPs) are a popular approach for modeling multi-agent systems acting in uncertain domains. Given the significant complexity of solving distributed POMDPs, particularly as we scale up the numbers of agents, one popular approach has focused on approximate solutions. Though this approach is efficient, the algorithms within this approach do not provide any guarantees on solution quality. A second less popular approach focuses on global optimality, but typical results are available only for two agents, and also at considerable computational cost. This paper overcomes the limitations of both these approaches by providing SPIDER, a novel combination of three key features for policy generation in distributed POMDPs: (i) it exploits agent interaction structure given a network of agents (i. e. allowing easier scale-up to larger number of agents); (ii) it uses a combination of heuristics to speedup policy search; and (iii) it allows quality guaranteed approximations, allowing a systematic tradeoff of solution quality for time. Experimental results show orders of magnitude improvement in performance when compared with previous global optimal algorithms.

IJCAI Conference 2005 Conference Paper

Networked Distributed POMDPs: A Synergy of Distributed Constraint Optimization and POMDPs

  • Ranjit Nair
  • Pradeep Varakantham
  • Milind Tambe
  • Makoto

In many real-world multiagent applications such as distributed sensor nets, a network of agents is formed based on each agent’s limited interactions with a small number of neighbors. While distributed POMDPs capture the realworld uncertainty in multiagent domains, they fail to exploit such locality of interaction. Distributed constraint optimization (DCOP) captures the locality of interaction but fails to capture planning under uncertainty. This paper present a new model synthesized from distributed POMDPs and DCOPs, called Networked Distributed POMDPs (ND-POMDPs). Exploiting network structure enables us to present a distributed policy generation algorithm that performs local search.