Arrow Research search

Author name cluster

Balaraman Ravindran

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

48 papers
2 author rows

Possible papers

48

AAAI Conference 2026 Conference Paper

SafeMIL: Learning Offline Safe Imitation Policy from Non-Preferred Trajectories

  • Returaj Burnwal
  • Nirav Pravinbhai Bhatt
  • Balaraman Ravindran

In this work, we study the problem of offline safe imitation learning (IL). In many real-world settings, online interactions can be risky, and accurately specifying the reward and the safety cost information at each timestep can be difficult. However, it is often feasible to collect trajectories reflecting undesirable or risky behavior, implicitly conveying the behavior the agent should avoid. We refer to these trajectories as non-preferred trajectories. Unlike standard IL, which aims to mimic demonstrations, our agent must also learn to avoid risky behavior using non-preferred trajectories. In this paper, we propose a novel approach, SafeMIL, to learn a parameterized cost that predicts if the state-action pair is risky via Multiple Instance Learning. The learned cost is then used to avoid non-preferred behaviors, resulting in a policy that prioritizes safety. We empirically demonstrate that our approach can learn a safer policy that satisfies cost constraints without degrading the reward performance, thereby outperforming several baselines.

AAMAS Conference 2024 Conference Paper

MABL: Bi-Level Latent-Variable World Model for Sample-Efficient Multi-Agent Reinforcement Learning

  • Aravind Venugopal
  • Stephanie Milani
  • Fei Fang
  • Balaraman Ravindran

Multi-agent reinforcement learning (MARL) methods often suffer from high sample complexity, limiting their use in real-world problems where data is sparse or expensive to collect. Although latent-variable world models have been employed to address this issue by generating abundant synthetic data for MARL training, most of these models cannot encode vital global information available during training into their latent states, which hampers learning efficiency. The few exceptions that incorporate global information assume centralized execution of their learned policies, which is impractical in many applications with partial observability. We propose a novel model-based MARL algorithm, MABL (Multi- Agent Bi-Level world model), that learns a bi-level latent-variable world model from high-dimensional inputs. Unlike existing models, MABL is capable of encoding essential global information into the latent states during training while guaranteeing the decentralized execution of learned policies. For each agent, MABL learns a global latent state at the upper level, which is used to inform the learning of an agent latent state at the lower level. During execution, agents exclusively use lower-level latent states and act independently. Crucially, MABL can be combined with any model-free MARL algorithm for policy learning. In our empirical evaluation with complex discrete and continuous multi-agent tasks including SMAC, Flatland, and MAMuJoCo, MABL surpasses SOTA multiagent latent-variable world models in both sample efficiency and overall performance.

AAMAS Conference 2023 Conference Paper

Matching Options to Tasks using Option-Indexed Hierarchical Reinforcement Learning

  • Kushal Chauhan
  • Soumya Chatterjee
  • Akash Reddy
  • Aniruddha S
  • Balaraman Ravindran
  • Pradeep Shenoy

The options framework in Hierarchical Reinforcement Learning breaks down overall goals into a combination of simpler tasks (options) and their policies, allowing for abstraction in the action space. Ideally, options can be reused across different goals; indeed, this is necessary to build a continual learning agent that can effectively leverage its prior experience. Previous approaches allow limited transfer of pre-learned options to new task settings. We propose a novel option indexing approach to hierarchical learning (OI-HRL), where we learn an affinity function between options and items present in the environment. With OI-HRL, we effectively reuse a large library of pre-trained options in zero-shot generalization at test time by restricting goal-directed learning to relevant options alone. We develop a meta-training loop that learns the representations of options and environments over a series of HRL problems by incorporating feedback about the relevance of retrieved options to the higher-level goal. Our model is competitive with oracular baselines and substantially better than a baseline with the entire option pool available for learning the hierarchical policy.

TMLR Journal 2022 Journal Article

Benchmarking and Analyzing Unsupervised Network Representation Learning and the Illusion of Progress

  • Saket Gurukar
  • Priyesh Vijayan
  • Srinivasan Parthasarathy
  • Balaraman Ravindran
  • Aakash Srinivasan
  • Goonmeet Bajaj
  • Chen Cai
  • Moniba Keymanesh

A number of methods have been developed for unsupervised network representation learning -- ranging from classical methods based on the graph spectra to recent random walk based methods and from deep learning based methods to matrix factorization based methods. Each new study inevitably seeks to establish the relative superiority of the proposed method over others. The lack of a standard assessment protocol and benchmark suite often leave practitioners wondering if a new idea represents a significant scientific advance. In this work, we articulate a clear and pressing need to systematically and rigorously benchmark such methods. Our overall assessment -- a result of a careful benchmarking of 15 methods for unsupervised network representation learning on 16 non-attributed graphs (several with different characteristics) - is that many recently proposed improvements are somewhat of an illusion when assessed through the lens of downstream tasks such as link prediction and node classification. Specifically, we find that several proposed improvements are marginal at best and that aspects of many of these datasets often render such small differences insignificant, especially when viewed from a rigorous statistical lens. A more detailed analysis of our results identify several new insights: first, we find that classical methods, often dismissed or not considered by recent efforts, can compete on certain types of datasets if they are tuned appropriately; second, we find that from a qualitative standpoint, a couple of methods based on matrix factorization offer a small but not always consistent advantage over alternative methods; third, no single method completely outperforms other embedding methods on both node classification and link prediction tasks. Finally, we also present several analysis that reveals settings under which certain algorithms perform well (e.g., the role of neighborhood context and dataset properties that impact performance). An important outcome of this study is the benchmark and evaluation protocol, which practitioners may find useful for future research in this area.

ICLR Conference 2022 Conference Paper

Causal Contextual Bandits with Targeted Interventions

  • Chandrasekar Subramanian
  • Balaraman Ravindran

We study a contextual bandit setting where the learning agent has the ability to perform interventions on targeted subsets of the population, apart from possessing qualitative causal side-information. This novel formalism captures intricacies in real-world scenarios such as software product experimentation where targeted experiments can be conducted. However, this fundamentally changes the set of options that the agent has, compared to standard contextual bandit settings, necessitating new techniques. This is also the first work that integrates causal side-information in a contextual bandit setting, where the agent aims to learn a policy that maps contexts to arms (as opposed to just identifying one best arm). We propose a new algorithm, which we show empirically performs better than baselines on experiments that use purely synthetic data and on real world-inspired experiments. We also prove a bound on regret that theoretically guards performance.

IJCAI Conference 2022 Conference Paper

Evolutionary Approach to Security Games with Signaling

  • Adam Żychowski
  • Jacek Mańdziuk
  • Elizabeth Bondi
  • Aravind Venugopal
  • Milind Tambe
  • Balaraman Ravindran

Green Security Games have become a popular way to model scenarios involving the protection of natural resources, such as wildlife. Sensors (e. g. drones equipped with cameras) have also begun to play a role in these scenarios by providing real-time information. Incorporating both human and sensor defender resources strategically is the subject of recent work on Security Games with Signaling (SGS). However, current methods to solve SGS do not scale well in terms of time or memory. We therefore propose a novel approach to SGS, which, for the first time in this domain, employs an Evolutionary Computation paradigm: EASGS. EASGS effectively searches the huge SGS solution space via suitable solution encoding in a chromosome and a specially-designed set of operators. The operators include three types of mutations, each focusing on a particular aspect of the SGS solution, optimized crossover and a local coverage improvement scheme (a memetic aspect of EASGS). We also introduce a new set of benchmark games, based on dense or locally-dense graphs that reflect real-world SGS settings. In the majority of 342 test game instances, EASGS outperforms state-of-the-art methods, including a reinforcement learning method, in terms of time scalability, nearly constant memory utilization, and quality of the returned defender's strategies (expected payoffs).

AAAI Conference 2021 Conference Paper

An Enhanced Advising Model in Teacher-Student Framework using State Categorization

  • Daksh Anand
  • Vaibhav Gupta
  • Praveen Paruchuri
  • Balaraman Ravindran

The teacher-student framework aims to improve the sample efficiency of RL algorithms by deploying an advising mechanism in which a teacher helps a student by guiding its exploration. Prior work in this field has considered an advising mechanism where the teacher advises the student about the optimal action to take in a given state. However, real-world teachers can leverage domain expertise to provide more informative signals. Using this insight, we propose to extend the current advising framework wherein the teacher would provide not only the optimal action but also a qualitative assessment of the state. We introduce a novel architecture, namely Advice Replay Memory (ARM), to effectively reuse the advice provided by the teacher. We demonstrate the robustness of our approach by showcasing our experiments on multiple Atari 2600 games using a fixed set of hyper-parameters. Additionally, we show that a student taking help even from a suboptimal teacher can achieve significant performance boosts and eventually outperform the teacher. Our approach outperforms the baselines even when provided with comparatively suboptimal teachers and an advising budget, which is smaller by orders of magnitude. The contributions of our paper are 4fold (a) supplementing student’s knowledge by providing the state category (b) introduction of ARM to effectively reuse the advice throughout learning (c) ability to achieve significant performance boost even with a coarse state categorization (d) enabling the student to outperform the teacher.

JAIR Journal 2021 Journal Article

MADRaS: Multi Agent Driving Simulator

  • Anirban Santara
  • Sohan Rudra
  • Sree Aditya Buridi
  • Meha Kaushik
  • Abhishek Naik
  • Bharat Kaul
  • Balaraman Ravindran

Autonomous driving has emerged as one of the most active areas of research as it has the promise of making transportation safer and more efficient than ever before. Most real-world autonomous driving pipelines perform perception, motion planning and action in a loop. In this work we present MADRaS, an open-source multi-agent driving simulator for use in the design and evaluation of motion planning algorithms for autonomous driving. Given a start and a goal state, the task of motion planning is to solve for a sequence of position, orientation and speed values in order to navigate between the states while adhering to safety constraints. These constraints often involve the behaviors of other agents in the environment. MADRaS provides a platform for constructing a wide variety of highway and track driving scenarios where multiple driving agents can be trained for motion planning tasks using reinforcement learning and other machine learning algorithms. MADRaS is built on TORCS, an open-source car-racing simulator. TORCS offers a variety of cars with different dynamic properties and driving tracks with different geometries and surface. MADRaS inherits these functionalities from TORCS and introduces support for multi-agent training, inter-vehicular communication, noisy observations, stochastic actions, and custom traffic cars whose behaviors can be programmed to simulate challenging traffic conditions encountered in the real world. MADRaS can be used to create driving tasks whose complexities can be tuned along eight axes in well-defined steps. This makes it particularly suited for curriculum and continual learning. MADRaS is lightweight and it provides a convenient OpenAI Gym interface for independent control of each car. Apart from the primitive steering-acceleration-brake control mode of TORCS, MADRaS offers a hierarchical track-position – speed control mode that can potentially be used to achieve better generalization. MADRaS uses a UDP based client server model where the simulation engine is the server and each client is a driving agent. MADRaS uses multiprocessing to run each agent as a parallel process for efficiency and integrates well with popular reinforcement learning libraries like RLLib. We show experiments on single and multi-agent reinforcement learning with and without curriculum

AAMAS Conference 2021 Conference Paper

Reinforcement Learning for Unified Allocation and Patrolling in Signaling Games with Uncertainty

  • Aravind Venugopal
  • Elizabeth Bondi
  • Harshavardhan Kamarthi
  • Keval Dholakia
  • Balaraman Ravindran
  • Milind Tambe

Green Security Games (GSGs) have been successfully used in the protection of valuable resources such as fisheries, forests, and wildlife. Real-world deployment involves both resource allocation and subsequent coordinated patrolling with communication in the presence real-time, uncertain information. Previous game models do not address both of these stages simultaneously. Furthermore, adopting existing solution strategies is difficult since they do not scale well for larger, more complex variants of the game models. We propose a novel GSG model to address these challenges. We also present a novel algorithm, CombSGPO, to compute a defender strategy for this game model. CombSGPO performs policy search over a multidimensional, discrete action space to compute an allocation strategy that is best suited to a best-response patrolling strategy for the defender, learnt by training a multi-agent Deep Q- Network. We show via experiments that CombSGPO converges to better strategies and is more scalable than comparable approaches. From a detailed analysis of the coordination and signaling behavior learnt by CombSGPO, we find that strategic signaling emerges in the final learnt strategy.

AAAI Conference 2021 Conference Paper

Relational Boosted Bandits

  • Ashutosh Kakadiya
  • Sriraam Natarajan
  • Balaraman Ravindran

Contextual bandits algorithms have become essential in realworld user interaction problems in recent years. However, these algorithms represent context as attribute value representation, which makes them infeasible for real-world domains like social networks, which are inherently relational. We propose Relational Boosted Bandits (RB2), a contextual bandits algorithm for relational domains based on (relational) boosted trees. RB2 enables us to learn interpretable and explainable models due to the more descriptive nature of the relational representation. We empirically demonstrate the effectiveness and interpretability of RB2 on tasks such as link prediction, relational classification, and recommendation.

ICAPS Conference 2021 Conference Paper

RePReL: Integrating Relational Planning and Reinforcement Learning for Effective Abstraction

  • Harsha Kokel
  • Arjun Manoharan
  • Sriraam Natarajan
  • Balaraman Ravindran
  • Prasad Tadepalli

State abstraction is necessary for better task transfer in complex reinforcement learning environments. Inspired by the benefit of state abstraction in MAXQ and building upon hybrid planner-RL architectures, we propose RePReL, a hierarchical framework that leverages a relational planner to provide useful state abstractions. Our experiments demonstrate that the abstractions enable faster learning and efficient transfer across tasks. More importantly, our framework enables the application of standard RL approaches for learning in structured domains. The benefit of using the state abstractions is critical in relational settings, where the number and/or types of objects are not fixed apriori. Our experiments clearly show that RePReL framework not only achieves better performance and efficient learning on the task at hand but also demonstrates better generalization to unseen tasks.

PRL Workshop 2021 Workshop Paper

RePReL: Integrating Relational Planning and Reinforcement Learning for Effective Abstraction

  • Harsha Kokel
  • Arjun Manoharan
  • Sriraam Natarajan
  • Balaraman Ravindran
  • Prasad Tadepalli

State abstraction enables sample-efficient learning and better task transfer in complex reinforcement learning environments. Inspired by the benefits of state abstraction in hierarchical planning and learning, we propose RePReL, a hierarchical framework that leverages a relational planner to provide useful state abstractions for learning. State abstraction is especially beneficial in relational settings, where the number and/or types of objects are not fixed apriori. Our experiments show that RePReL framework not only achieves better performance and efficient learning on the task at hand but also demonstrates better generalization to unseen tasks. It has been argued that for human-level general intelligence, the ability to detect compositional structure in the domain (Lake et al. 2017) and form task-specific abstractions (Konidaris 2019) is necessary. Our RePReL framework takes a step in that direction by formalizing the prior domain knowledge that gives rise to effective task-specific abstractions. 1

AAMAS Conference 2021 Conference Paper

SEERL: Sample Efficient Ensemble Reinforcement Learning

  • Rohan Saphal
  • Balaraman Ravindran
  • Dheevatsa Mudigere
  • Sasikant Avancha
  • Bharat Kaul

Ensemble learning is a very prevalent method employed in machine learning. The relative success of ensemble methods is attributed to their ability to tackle a wide range of instances and complex problems that require different low-level approaches. However, ensemble methods are relatively less popular in reinforcement learning owing to the high sample complexity and computational expense involved in obtaining a diverse ensemble. We present a novel training and model selection framework for model-free reinforcement algorithms that use ensembles of policies obtained from a single training run. These policies are diverse in nature and are learned through directed perturbation of the model parameters at regular intervals. We show that learning and selecting an adequately diverse set of policies is required for a good ensemble while extreme diversity can prove detrimental to overall performance. Selection of an adequately diverse set of policies is done through our novel policy selection framework. We evaluate our approach on challenging discrete and continuous control tasks and also discuss various ensembling strategies. Our framework is substantially sample efficient, computationally inexpensive and is seen to outperform state-of-the-art (SOTA) scores in Atari 2600 and Mujoco.

ICLR Conference 2020 Conference Paper

EMPIR: Ensembles of Mixed Precision Deep Networks for Increased Robustness Against Adversarial Attacks

  • Sanchari Sen
  • Balaraman Ravindran
  • Anand Raghunathan

Ensuring robustness of Deep Neural Networks (DNNs) is crucial to their adoption in safety-critical applications such as self-driving cars, drones, and healthcare. Notably, DNNs are vulnerable to adversarial attacks in which small input perturbations can produce catastrophic misclassifications. In this work, we propose EMPIR, ensembles of quantized DNN models with different numerical precisions, as a new approach to increase robustness against adversarial attacks. EMPIR is based on the observation that quantized neural networks often demonstrate much higher robustness to adversarial attacks than full precision networks, but at the cost of a substantial loss in accuracy on the original (unperturbed) inputs. EMPIR overcomes this limitation to achieve the ``best of both worlds", i.e., the higher unperturbed accuracies of the full precision models combined with the higher robustness of the low precision models, by composing them in an ensemble. Further, as low precision DNN models have significantly lower computational and storage requirements than full precision models, EMPIR models only incur modest compute and memory overheads compared to a single full-precision model (<25% in our evaluations). We evaluate EMPIR across a suite of DNNs for 3 different image recognition tasks (MNIST, CIFAR-10 and ImageNet) and under 4 different adversarial attacks. Our results indicate that EMPIR boosts the average adversarial accuracies by 42.6%, 15.2% and 10.5% for the DNN models trained on the MNIST, CIFAR-10 and ImageNet datasets respectively, when compared to single full-precision models, without sacrificing accuracy on the unperturbed inputs.

AAAI Conference 2020 Short Paper

ERLP: Ensembles of Reinforcement Learning Policies (Student Abstract)

  • Rohan Saphal
  • Balaraman Ravindran
  • Dheevatsa Mudigere
  • Sasikanth Avancha
  • Bharat Kaul

Reinforcement learning algorithms are sensitive to hyperparameters and require tuning and tweaking for specific environments for improving performance. Ensembles of reinforcement learning models on the other hand are known to be much more robust and stable. However, training multiple models independently on an environment suffers from high sample complexity. We present here a methodology to create multiple models from a single training instance that can be used in an ensemble through directed perturbation of the model parameters at regular intervals. This allows training a single model that converges to several local minima during the optimization process as a result of the perturbation. By saving the model parameters at each such instance, we obtain multiple policies during training that are ensembled during evaluation. We evaluate our approach on challenging discrete and continuous control tasks and also discuss various ensembling strategies. Our framework is substantially sample efficient, computationally inexpensive and is seen to outperform state of the art (SOTA) approaches

IROS Conference 2020 Conference Paper

Understanding Dynamic Scenes using Graph Convolution Networks

  • Sravan Mylavarapu
  • Mahtab Sandhu
  • Priyesh Vijayan
  • K. Madhava Krishna
  • Balaraman Ravindran
  • Anoop M. Namboodiri

We present a novel Multi-Relational Graph Convolutional Network (MRGCN) based framework to model on-road vehicle behaviors from a sequence of temporally ordered frames as grabbed by a moving monocular camera. The input to MRGCN is a multi-relational graph where the graph's nodes represent the active and passive agents/objects in the scene, and the bidirectional edges that connect every pair of nodes are encodings of their Spatio-temporal relations. We show that this proposed explicit encoding and usage of an intermediate spatio-temporal interaction graph to be well suited for our tasks over learning end-end directly on a set of temporally ordered spatial relations. We also propose an attention mechanism for MRGCNs that conditioned on the scene dynamically scores the importance of information from different interaction types. The proposed framework achieves significant performance gain over prior methods on vehicle-behavior classification tasks on four datasets. We also show a seamless transfer of learning to multiple datasets without resorting to fine-tuning. Such behavior prediction methods find immediate relevance in a variety of navigation tasks such as behavior planning, state estimation, and applications relating to the detection of traffic violations over videos.

AAMAS Conference 2019 Conference Paper

Advice Replay Approach for Richer Knowledge Transfer in Teacher Student Framework

  • Vaibhav Gupta
  • Daksh Anand
  • Praveen Paruchuri
  • Balaraman Ravindran

One of the major drawbacks of RL is the low sample efficiency of the learning algorithms. In many cases domain expertise can help to mitigate this effect. Teacher-Student framework is one such paradigm, where a more experienced agent (teacher) upon being queried helps to accelerate the student’s learning by providing advice on the action to take in a given state. Real world teachers not only provide the action to take in a given state but also provide a more informative signal using the synthesis of knowledge they may have gained with experience. With this motivation, we propose a richer advising framework where the teacher augments the student’s knowledge by also providing the expected long term reward of following that action. The student can then use this value to steadily guide its Q-Network in the correct direction which can lead to a quicker convergence. To help student relive the advices received throughout its learning, we introduce an additional memory called the Advice Replay Memory (ARM). Results show that a student following our approach (a) is able to exploit the environment better, and (b) has a steeper learning curve.

AAMAS Conference 2019 Conference Paper

MaMiC: Macro and Micro Curriculum for Robotic Reinforcement Learning

  • Manan Tomar
  • Akhil Sathuluri
  • Balaraman Ravindran

Shaping in humans and animals has been shown to be a powerful tool for learning complex tasks as compared to learning in a randomized fashion. This makes the problem less complex and enables one to solve the easier sub task at hand first. Generating a curriculum for such guided learning involves subjecting the agent to easier goals first, and then gradually increasing their difficulty. This paper takes a similar direction and proposes a dual curriculum scheme for solving robotic manipulation tasks with sparse rewards, called MaMiC. It includes a macro curriculum scheme which divides the task into multiple sub-tasks followed by a micro curriculum scheme which enables the agent to learn between such discovered sub-tasks. We show how combining macro and micro curriculum strategies help in overcoming major exploratory constraints considered in robot manipulation tasks without having to engineer any complex rewards. The performance of such a dual curriculum scheme is analyzed on the Fetch environments.

AAAI Conference 2019 Short Paper

MaMiC: Macro and Micro Curriculum for Robotic Reinforcement Learning

  • Manan Tomar
  • Akhil Sathuluri
  • Balaraman Ravindran

Generating a curriculum for guided learning involves subjecting the agent to easier goals first, and then gradually increasing their difficulty. This work takes a similar direction and proposes a dual curriculum scheme for solving robotic manipulation tasks with sparse rewards, called MaMiC. It includes a macro curriculum scheme which divides the task into multiple subtasks followed by a micro curriculum scheme which enables the agent to learn between such discovered subtasks. We show how combining macro and micro curriculum strategies help in overcoming major exploratory constraints considered in robot manipulation tasks without having to engineer any complex rewards and also illustrate the meaning and usage of the individual curricula. The performance of such a scheme is analysed on the Fetch environments.

IJCAI Conference 2019 Conference Paper

Successor Options: An Option Discovery Framework for Reinforcement Learning

  • Rahul Ramesh
  • Manan Tomar
  • Balaraman Ravindran

The options framework in reinforcement learning models the notion of a skill or a temporally extended sequence of actions. The discovery of a reusable set of skills has typically entailed building options, that navigate to bottleneck states. In this work, we instead adopt a complementary approach, where we attempt to discover options that navigate to landmark states. These states are prototypical representatives of well-connected regions and can hence access the associated region with relative ease. In this work, we propose Successor Options, which leverages Successor representations to build a model of the state space. The intra-option policies are learnt using a novel pseudo-reward and the model scales to high-dimensional spaces since it does not construct an explicit graph of the entire state space. Additionally, we also propose an Incremental Successor Options model that iterates between constructing Successor representations and building options, which is useful when robust Successor representations cannot be built solely from primitive actions. We demonstrate the efficacy of our approach on a collection of grid-worlds, and on the high-dimensional robotic control environment of Fetch.

AAAI Conference 2018 Conference Paper

Efficient-UCBV: An Almost Optimal Algorithm Using Variance Estimates

  • Subhojyoti Mukherjee
  • K. P. Naveen
  • Nandan Sudarsanam
  • Balaraman Ravindran

We propose a novel variant of the UCB algorithm (referred to as Efficient-UCB-Variance (EUCBV)) for minimizing cumulative regret in the stochastic multi-armed bandit (MAB) setting. EUCBV incorporates the arm elimination strategy proposed in UCB-Improved (Auer and Ortner 2010), while taking into account the variance estimates to compute the arms’ confidence bounds, similar to UCBV (Audibert, Munos, and Szepesvári 2009). Through a theoretical analysis we establish that EUCBV incurs a gap-dependent regret bound of O Kσ2 max log(T Δ2 /K) Δ after T trials, where Δ is the minimal gap between optimal and sub-optimal arms; the above bound is an improvement over that of existing state-of-theart UCB algorithms (such as UCB1, UCB-Improved, UCBV, MOSS). Further, EUCBV incurs a gap-independent regret bound of O √ KT which is an improvement over that of UCB1, UCBV and UCB-Improved, while being comparable with that of MOSS and OCUCB. Through an extensive numerical study we show that EUCBV significantly outperforms the popular UCB variants (like MOSS, OCUCB, etc.) as well as Thompson sampling and Bayes-UCB algorithms.

AAMAS Conference 2018 Conference Paper

RAIL: Risk-Averse Imitation Learning

  • Anirban Santara
  • Abhishek Naik
  • Balaraman Ravindran
  • Dipankar Das
  • Dheevatsa Mudigere
  • Sasikanth Avancha
  • Bharat Kaul

Imitation learning algorithms learn viable policies by imitating an expert’s behavior when reward signals are not available. Generative Adversarial Imitation Learning (GAIL) is a state-of-the-art algorithm for learning policies when the expert’s behavior is available as a fixed set of trajectories. We evaluate in terms of the expert’s cost function and observe that the distribution of trajectory-costs is often more heavy-tailed for GAIL-agents than the expert at a number of benchmark continuous-control tasks. Thus, high-cost trajectories, corresponding to tail-end events of catastrophic failure, are more likely to be encountered by the GAIL-agents than the expert. This makes the reliability of GAIL-agents questionable when it comes to deployment in risk-sensitive applications like robotic surgery and autonomous driving. In this work, we aim to minimize the occurrence of tail-end events by minimizing tail risk within the GAIL framework. We quantify tail risk by the Conditional-Value-at- Risk (CVaR) of trajectories and develop the Risk-Averse Imitation Learning (RAIL) algorithm. We observe that the policies learned with RAIL show lower tail-end risk than those of vanilla GAIL. Thus, the proposed RAIL algorithm appears as a potent alternative to GAIL for improved reliability in risk-sensitive applications.

RLDM Conference 2017 Conference Abstract

Attend, Adapt and Transfer: Attentive Deep Architecture for Adaptive Transfer from multiple sources in the same domain

  • Aravind Srinivas Lakshminarayanan
  • Janarthanan Rajendran
  • Mitesh M. Khapra
  • Prasanna Parthasarathi
  • Balaraman Ravindran

Transferring knowledge from prior source tasks in solving a new target task can be useful in several learning applications. The application of transfer poses two serious challenges which have not been adequately addressed. First, the agent should be able to avoid negative transfer, which happens when the transfer hampers or slows down the learning instead of helping it. Second, the agent should be able to selectively transfer, which is the ability to select and transfer from different and multiple source tasks for different parts of the state space of the target task. We propose A2T (Attend, Adapt and Transfer), an attentive deep architecture which adapts and transfers from these source tasks. Our model is generic enough to effect transfer of either policies or value functions. Empirical evaluations on different learning algorithms show that A2T is an effective architecture for transfer by being able to avoid negative transfer while transferring selectively from multiple source tasks in the same domain.

AAAI Conference 2017 Conference Paper

Dynamic Action Repetition for Deep Reinforcement Learning

  • Aravind Lakshminarayanan
  • Sahil Sharma
  • Balaraman Ravindran

One of the long standing goals of Artificial Intelligence (AI) is to build cognitive agents which can perform complex tasks from raw sensory inputs without explicit supervision (Lake et al. 2016). Recent progress in combining Reinforcement Learning objective functions and Deep Learning architectures has achieved promising results for such tasks. An important aspect of such sequential decision making problems, which has largely been neglected, is for the agent to decide on the duration of time for which to commit to actions. Such action repetition is important for computational efficiency, which is necessary for the agent to respond in real-time to events (in applications such as self-driving cars). Action Repetition arises naturally in real life as well as simulated environments. The time scale of executing an action enables an agent (both humans and AI) to decide the granularity of control during task execution. Current state of the art Deep Reinforcement Learning models, whether they are off-policy (Mnih et al. 2015; Wang et al. 2015) or on-policy (Mnih et al. 2016), consist of a framework with a static action repetition paradigm, wherein the action decided by the agent is repeated for a fixed number of time steps regardless of the contextual state while executing the task. In this paper, we propose a new framework - Dynamic Action Repetition which changes Action Repetition Rate (the time scale of repeating an action) from a hyper-parameter of an algorithm to a dynamically learnable quantity. At every decision-making step, our models allow the agent to commit to an action and the time scale of executing the action. We show empirically that such a dynamic time scale mechanism improves the performance on relatively harder games in the Atari 2600 domain, independent of the underlying Deep Reinforcement Learning algorithm used.

RLDM Conference 2017 Conference Abstract

Optimal sample size for A/B tests using cumulative regret

  • Nandan Sudarsanam
  • PitchaiKannu Balaji
  • Balaraman Ravindran

This work presents theoretical results for determining optimal sample size for A/B tests in the online setting. Our explorations differ from previous studies on three accounts. The first is that we rec- ommend a sample size on the basis of cumulative regret, as opposed to the typical statistical significance tests commonly used in A/B tests. The second is that we seek to optimize expected cumulative regret rather than the upper bound on cumulative regret. The third and most important contribution is that, similar to the Bayesian framework, we model the theoretical means of the alternatives as a random variable, which enables us to go beyond the gap dependent and gap independent results that is typical in bandit studies. We study Gaussian and binary reward distributions, with corresponding Gaussian and Uniform distribution of means. Specifically, in the case which explores a Gaussian reward distribution (noise) along with a dif- ferent Gaussian distribution representing the theoretical means of the alternatives, we derive a closed form solution for the optimal sample size which is only a function of the trial horizon and a ratio of the standard deviations of these two distributions. Our results are compared to settings where an equivalent fixed gap is assumed between the means of the alternatives. Our results indicate that when the gap between alternatives is modeled as random variable, the optimal sample sizes deviate significantly from the corresponding fixed gap settings.

RLDM Conference 2017 Conference Abstract

Shared Learning in Ensemble Deep Q-Networks

  • Rakesh R Menon
  • Manu Srinath Halvagal
  • Balaraman Ravindran

Most deep RL solutions still use extensions of conventional exploration strategies that have been well studied and offer theoretical guarantees in bandit problems and simple MDPs. However, exploration in large state spaces needs to be more directed than is possible with these traditional exploration strategies such as -greedy. The recently proposed Bootstrapped DQN offers a new exploration strategy that is capa- ble of deep directed exploration and is better suited for deep RL problems. Bootstrapped DQN works by learning multiple independent estimates of the action-value function and guiding action selection using a randomly selected estimate. The method relies on variability among the different value estimators (called heads) for effective and deep exploration. In BootstrappedDQN, this variability is ensured through both se- lective masking of training examples as well as by the random initialization of network parameters of each head. The network is trained using the Double DQN update rule. Double DQN is an adaptation of Dou- ble Q-Learning which is meant to reduce the overestimation bias in Q-learning. In both Double DQN and Bootstrapped DQN, the target network is used as a stand-in for an independent estimate of the action-value function in the update rule. Independent estimates are needed for Double Q-learning to perform effective updates. However, the target network is highly coupled to the online network leading to imperfect double Q-learning updates. We propose shared learning, an algorithm which takes advantage of the ensemble ar- chitecture of Bootstrapped DQN to overcome the issue with coupled estimates described above. Further, we supplement our algorithm with a framework to share learned experience amongst the bootstrapped heads. We demonstrate how this method can help in speeding up the existing Bootstrapped DQN algorithm with minimal computational overhead.

IJCAI Conference 2017 Conference Paper

Thresholding Bandits with Augmented UCB

  • Subhojyoti Mukherjee
  • Naveen Kolar Purushothama
  • Nandan Sudarsanam
  • Balaraman Ravindran

In this paper we propose the Augmented-UCB (AugUCB) algorithm for a fixed-budget version of the thresholding bandit problem (TBP), where the objective is to identify a set of arms whose quality is above a threshold. A key feature of AugUCB is that it uses both mean and variance estimates to eliminate arms that have been sufficiently explored; to the best of our knowledge this is the first algorithm to employ such an approach for the considered TBP. Theoretically, we obtain an upper bound on the loss (probability of mis-classification) incurred by AugUCB. Although UCBEV in literature provides a better guarantee, it is important to emphasize that UCBEV has access to problem complexity (whose computation requires arms' mean and variances), and hence is not realistic in practice; this is in contrast to AugUCB whose implementation does not require any such complexity inputs. We conduct extensive simulation experiments to validate the performance of AugUCB. Through our simulation work, we establish that AugUCB, owing to its utilization of variance estimates, performs significantly better than the state-of-the-art APT, CSAR and other non variance-based algorithms.

EWRL Workshop 2015 Workshop Paper

A Reinforcement Learning Approach to Online Learning of Decision Trees

  • Abhinav Garlapati
  • Aditi Raghunathan
  • Vaishnavh Nagarajan
  • Balaraman Ravindran

Online decision tree learning algorithms typically examine all features of a new data point to update model parameters. We propose a novel alternative, Reinforcement Learningbased Decision Trees (RLDT), that uses Reinforcement Learning (RL) to actively examine a minimal number of features of a data point to classify it with high accuracy. Furthermore, RLDT optimizes a long term return, providing a better alternative to the traditional myopic greedy approach to growing decision trees. We demonstrate that this approach performs as well as batch learning algorithms and other online decision tree learning algorithms, while making significantly fewer queries about the features of the data points. We also show that RLDT can effectively handle concept drift.

IJCAI Conference 2015 Conference Paper

CEIL: A Scalable, Resolution Limit Free Approach for Detecting Communities in Large Networks

  • Vishnu Sankar
  • Balaraman Ravindran
  • Shivashankar S

Real world networks typically exhibit non uniform edge densities with there being a higher concentration of edges within modules or communities. Various scoring functions have been proposed to quantify the quality of such communities. In this paper, we argue that the popular scoring functions suffer from certain limitations. We identify the necessary features that a scoring function should incorporate in order to characterize good community structure and propose a new scoring function, CEIL (Community detection using External and Internal scores in Large networks), which conforms closely with our characterization. We also demonstrate experimentally the superiority of our scoring function over the existing scoring functions. Modularity, a very popular scoring function, exhibits resolution limit, i. e. , one cannot find communities that are much smaller in size compared to the size of the network. In many real world networks, community size does not grow in proportion to the network size. This implies that resolution limit is a serious problem in large networks. Modularity is still very popular since it offers many advantages such as fast algorithms for maximizing the score, and non-trivial community structures corresponding to the maxima. We show analytically that the CEIL score does not suffer from resolution limit. We also modify the Louvain method, one of the fastest greedy algorithms for maximizing modularity, to maximize the CEIL score. We show that our algorithm gives the expected communities in synthetic networks as opposed to maximizing modularity. We also show that the community labels given by our algorithm matches closely with the ground truth community labels in real world networks. Our algorithm is on par with Louvain method in computation time and hence scales well to large networks.

IJCAI Conference 2015 Conference Paper

Extended Discriminative Random Walk: A Hypergraph Approach to Multi-View Multi-Relational Transductive Learning

  • Sai Nageswar Satchidanand
  • Harini Ananthapadmanaban
  • Balaraman Ravindran

Transductive inference on graphs has been garnering increasing attention due to the connected nature of many real-life data sources, such as online social media and biological data (protein-protein interaction network, gene networks, etc.). Typically relational information in the data is encoded as edges in a graph but often it is important to model multi-way interactions, such as in collaboration networks and reaction networks. In this work we model multiway relations as hypergraphs and extend the discriminative random walk (DRW) framework, originally proposed for transductive inference on single graphs, to the case of multiple hypergraphs. We use the extended DRW framework for inference on multi-view, multi-relational data in a natural way, by representing attribute descriptions of the data also as hypergraphs. We further exploit the structure of hypergraphs to modify the random walk operator to take into account class imbalance in the data. This work is among very few approaches to explicitly address class imbalance in the innetwork classification setting, using random walks. We compare our approach to methods proposed for inference on hypergraphs, and to methods proposed for multi-view data and show that empirically we achieve better performance. We also compare to methods specifically tailored for class-imbalanced data and show that our approach achieves comparable performance even on non-network data.

NeurIPS Conference 2014 Conference Paper

An Autoencoder Approach to Learning Bilingual Word Representations

  • Sarath Chandar A P
  • Stanislas Lauly
  • Hugo Larochelle
  • Mitesh Khapra
  • Balaraman Ravindran
  • Vikas Raykar
  • Amrita Saha

Cross-language learning allows us to use training data from one language to build models for a different language. Many approaches to bilingual learning require that we have word-level alignment of sentences from parallel corpora. In this work we explore the use of autoencoder-based methods for cross-language learning of vectorial word representations that are aligned between two languages, while not relying on word-level alignments. We show that by simply learning to reconstruct the bag-of-words representations of aligned sentences, within and between languages, we can in fact learn high-quality representations and do without word alignments. We empirically investigate the success of our approach on the problem of cross-language text classification, where a classifier trained on a given language (e. g. , English) must learn to generalize to a different language (e. g. , German). In experiments on 3 language pairs, we show that our approach achieves state-of-the-art performance, outperforming a method exploiting word alignments and a strong machine translation baseline.

ICRA Conference 2014 Conference Paper

RRTPI: Policy iteration on continuous domains using rapidly-exploring random trees

  • Manimaran Sivasamy Sivamurugan
  • Balaraman Ravindran

Path planning in continuous spaces has been a central problem in robotics. In the case of systems with complex dynamics, the performance of sampling based techniques relies on identifying a good approximation to the cost-to-go distance metric. We propose a technique that uses reinforcement learning to learn this distance metric on the fly from samples and combine it with existing sampling based planners to produce near optimal solutions. The resulting algorithm - RRTPI can solve problems with complex dynamics in a sample efficient manner while preserving asymptotic guarantees. We provide experimental evaluation of this technique on domains with underactuated and underpowered dynamics.

AAMAS Conference 2012 Conference Paper

Learning in a Small World

  • Arun Tejasvi Chaganty
  • Prateek Gaur
  • Balaraman Ravindran

Understanding how we are able to perform a diverse set of complex tasks is a central question for the Artificial Intelligence community. A popular approach is to use temporal abstraction as a framework to capture the notion of subtasks. However, this transfers the problem to finding the right subtasks, which is still an open problem. Existing approaches for subtask generation require too much knowledge of the environment, and the abstractions they create can overwhelm the agent. We propose a simple algorithm inspired by small world networks to learn subtasks while solving a task that requires virtually no information of the environment. Additionally, we show that the subtasks we learn can be easily composed by the agent to solve any other task; more formally, we prove that any task can be solved using only a logarithmic combination of these subtasks and primitive actions. Experimental results show that the subtasks we generate outperform other popular subtask generation schemes on standard domains.

ICRA Conference 2012 Conference Paper

Where do i look now? Gaze allocation during visually guided manipulation

  • José I. Nuñez-Varela
  • Balaraman Ravindran
  • Jeremy L. Wyatt

In this work we present principled methods for the coordination of a robot's oculomotor system with the rest of its body motor systems. The problem is to decide which physical actions to perform next and where the robot's gaze should be directed in order to gain information that is relevant to the success of its physical actions. Previous work on this problem has shown that a reward-based coordination mechanism provides an efficient solution. However, that approach does not allow the robot to move its gaze to different parts of the scene, it considers the robot to have only one motor system, and assumes that the actions have the same duration. The main contributions of our work are to extend that previous reward-based approach by making decisions about where to fixate the robot's gaze, handling multiple motor systems, and handling actions of variable duration. We compare our approach against two common baselines: random and round robin gaze allocation. We show how our method provides a more effective strategy to allocate gaze where it is needed the most.

UAI Conference 2011 Conference Paper

Fractional Moments on Bandit Problems

  • Ananda Narayanan B.
  • Balaraman Ravindran

Reinforcement learning addresses the dilemma between exploration to find profitable actions and exploitation to act according to the best observations already made. Bandit problems are one such class of problems in stateless environments that represent this explore/exploit situation. We propose a learning algorithm for bandit problems based on fractional expectation of rewards acquired. The algorithm is theoretically shown to converge on an eta-optimal arm and achieve O(n) sample complexity. Experimental results show the algorithm incurs substantially lower regrets than parameter-optimized eta-greedy and SoftMax approaches and other low sample complexity state-of-the-art techniques.

EWRL Workshop 2011 Conference Paper

Options with Exceptions

  • Munu Sairamesh
  • Balaraman Ravindran

Abstract An option is a policy fragment that represents a solution to a frequent subproblem encountered in a domain. Options may be treated as temporally extended actions thus allowing us to reuse that solution in solving larger problems. Often, it is hard to find subproblems that are exactly the same. These differences, however small, need to be accounted for in the reused policy. In this paper, the notion of options with exceptions is introduced to address such scenarios. This is inspired by the Ripple Down Rules approach used in data mining and knowledge representation communities. The goal is to develop an option representation so that small changes in the subproblem solutions can be accommodated without losing the original solution. We empirically validate the proposed framework on a simulated game domain.

ICRA Conference 2010 Conference Paper

Accurate mobile robot localization in indoor environments using bluetooth

  • Aswin N. Raghavan
  • Harini Ananthapadmanaban
  • Manimaran Sivasamy Sivamurugan
  • Balaraman Ravindran

In this paper, we describe an accurate method for localization of a mobile robot using bluetooth. We introduce novel approaches for obtaining distance estimates and trilateration that overcome the hitherto known limitations of using Bluetooth for localization. Our approach is reliable and has the potential of being scaled to multi-agent scenarios. The proposed approach was tested on a mobile robot, and we present the experimental results. The error obtained was 0. 427 ± 0. 229 m, which proves the accuracy of our method.

AAMAS Conference 2010 Conference Paper

Game Theoretic Network Centrality: Exact Formulas and Efficient Algorithms

  • Karthik Aadithya
  • Balaraman Ravindran

The concept of centrality plays an important role in network analysis. Game theoretic centrality measures have been recently proposed, which are based on computing the Shapley Value (SV) ofeach node (agent) in a suitably constructed co-operative networkgame. However, the naive method of exactcomputation of SVs takes exponential time in the number of nodes. In this paper, we develop analytical formulas for computing SVsof nodes for various kinds of centrality-related co-operative gamesplayed on both weighted and unweighted networks. These formulas not only provide an efficient and error-free way of computingnode centralities, but their surprisingly simple closed form expressions also offer intuition into why certain nodes are relatively moreimportant to a network.

ECAI Conference 2010 Conference Paper

Multi Grain Sentiment Analysis using Collective Classification

  • S. Shivashankar
  • Balaraman Ravindran

Multi grain sentiment analysis is the task of simultaneously classifying sentiment expressed at different levels of granularity, as opposed to single level at a time. Models built for multi grain sentiment analysis assume fully labeled corpus at fine grained level or coarse grained level or both. Huge amount of online reviews are not fully labeled at any of the levels, but are partially labeled at both the levels. We propose a multi grain collective classification framework to not only exploit the information available at all the levels but also use intra dependencies at each level and inter dependencies between the levels. We demonstrate empirically that the proposed framework enables better performance at both the levels compared to baseline approaches.

IROS Conference 2010 Conference Paper

Transfer learning across heterogeneous robots with action sequence mapping

  • Balaji Lakshmanan
  • Balaraman Ravindran

Transfer learning refers to reusing the knowledge gained while solving a task, to solve a related task more efficiently. Much of the prior work on transfer learning, assumes that identical robots were involved in both the tasks. In this work we focus on transfer learning across heterogeneous robots while solving the same task. The action capabilities of the robots are different and are unknown to each other. The actions of one robot cannot be mimicked by another even if known. Such situations arise in multi-robot systems. The objective then is to speed-up the learning of one robot, i. e. , reduce its initial exploration, using very minimal knowledge from a different robot. We propose a framework in which the knowledge transfer is effected through a pseudo reward function generated from the trajectories followed by a different robot while solving the same task. The framework can effectively be used even with a single trajectory. We extend the framework to enable the robot to learn an equivalence between certain sequences of its actions and certain sequences of actions of the other robot. These are then used to learn faster on subsequent tasks. We empirically validate the framework in a rooms world domain.

IJCAI Conference 2007 Conference Paper

  • Pranjal Awasthi
  • Aakanksha Gagrani
  • Balaraman Ravindran

In this paper we present a discriminative framework based on conditional random fields for stochastic modeling of images in a hierarchical fashion. The main advantage of the proposed framework is its ability to incorporate a rich set of interactions among the image sites. We achieve this by inducing a hierarchy of hidden variables over the given label field. The proposed tree like structure of our model eliminates the need for a huge parameter space and at the same time permits the use of exact and efficient inference procedures based on belief propagation. We demonstrate the generality of our approach by applying it to two important computer vision tasks, namely image labeling and object detection. The model parameters are trained using the contrastive divergence algorithm. We report the performance on real world images and compare it with the existing approaches.

IJCAI Conference 2007 Conference Paper

  • Balaraman Ravindran
  • Andrew G. Barto
  • Vimal Mathew.

Deictic representation is a representational paradigm, based on selective attention and pointers, that allows an agent to learn and reason about rich complex environments. In this article we present a hierarchical reinforcement learning framework that employs aspects of deictic representation. We also present a Bayesian algorithm for learning the correct representation for a given sub-problem and empirically validate it on a complex game environment.

IJCAI Conference 2003 Conference Paper

SMDP Homomorphisms: An Algebraic Approach to Abstraction in Semi-Markov Decision Processes

  • Balaraman Ravindran
  • Andrew G. Barto

To operate effectively in complex environments learning agents require the ability to selectively ignore irrelevant details and form useful abstractions. In this article we consider the question of what constitutes a useful abstraction in a stochastic sequential decision problem modeled as a semi-Markov Decision Process (SMDPs). We introduce the notion of SMDP homomorphism and argue that it provides a useful tool for a rigorous study of abstraction for SMDPs. We present an SMDP minimization framework and an abstraction framework for factored MDPs based on SMDP homomorphisms. We also model different classes of abstractions that arise in hierarchical systems. Although we use the options framework for purposes of illustration, the ideas are more generally applicable. We also show that the conditions for abstraction we employ are a generalization of earlier work by Dietterich as applied to the options framework.

NeurIPS Conference 1998 Conference Paper

Improved Switching among Temporally Abstract Actions

  • Richard Sutton
  • Satinder Singh
  • Doina Precup
  • Balaraman Ravindran

In robotics and other control applications it is commonplace to have a pre(cid: 173) existing set of controllers for solving subtasks, perhaps hand-crafted or previously learned or planned, and still face a difficult problem of how to choose and switch among the controllers to solve an overall task as well as possible. In this paper we present a framework based on Markov decision processes and semi-Markov decision processes for phrasing this problem, a basic theorem regarding the improvement in performance that can be ob(cid: 173) tained by switching flexibly between given controllers, and example appli(cid: 173) cations of the theorem. In particular, we show how an agent can plan with these high-level controllers and then use the results of such planning to find an even better plan, by modifying the existing controllers, with negligible additional cost and no re-planning. In one of our examples, the complexity of the problem is reduced from 24 billion state-action pairs to less than a million state-controller pairs. In many applications, solutions to parts of a task are known, either because they were hand(cid: 173) crafted by people or because they were previously learned or planned. For example, in robotics applications, there may exist controllers for moving joints to positions, picking up objects, controlling eye movements, or navigating along hallways. More generally, an intelli(cid: 173) gent system may have available to it several temporally extended courses of action to choose from. In such cases, a key challenge is to take full advantage of the existing temporally ex(cid: 173) tended actions, to choose or switch among them effectively, and to plan at their level rather than at the level of individual actions. Recently, several researchers have begun to address these challenges within the framework of reinforcement learning and Markov decision processes (e. g. , Singh, 1992; Kaelbling, 1993; Dayan & Hinton, 1993; Thrun and Schwartz, 1995; Sutton, 1995; Dietterich, 1998; Parr & Russell, 1998; McGovern, Sutton & Fagg, 1997). Common to much of this recent work is the modeling of a temporally extended action as a policy (controller) and a condition for terminating, which we together refer to as an option (Sutton, Precup & Singh, 1998). In this paper we consider the problem of effectively combining given options into one overall policy, generalizing prior work by Kaelbling (1993). Sections 1-3 introduce the framework; our new results are in Sections 4 and 5. Improved Switching among Temporally Abstract Actions