Author name cluster

Sridhar Mahadevan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

64 papers

2 author rows

AAAI Conference 2026 Conference Paper

Rethinking AI: From Functions to Functors

Sridhar Mahadevan

We propose a new theoretical foundation for artificial intelligence (AI) and machine learning (ML), building on ideas in pure mathematics relating to categories and functors. This paper builds on our AAAI 2025 tutorial Thinking with Functors: Category Theory for A(G)I, which provides background material. In addition, our recent papers on intuitionistic j-do calculus in Topos Causal Models} and GAIA: Categorical Foundations of Generative AI, illustrate how to generalize well-known formalisms in AI, such as causal inference and deep learning, to a category-theoretic setting.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

Universal Causal Inference in a Topos

Sridhar Mahadevan

In this paper, we explore the universal properties underlying causal inference by formulating it in terms of a topos. More concretely, we introduce topos causal models (TCMs), a strict generalization of the popular structural causal models (SCMs). A topos category has several properties that make it attractive: a general theory for how to combine local functions that define ``independent causal mechanisms" into a consistent global function building on the theory of sheaves in a topos; a generic way to define causal interventions using a subobject classifier in a topos category; and finally, an internal logical language for causal and counterfactual reasoning that emerges from the topos itself. A striking characteristic of subobject classifiers is that they induce an intuitionistic logic, whose semantics is based on the partially ordered lattice of subobjects. We show that the underlying subobject classifier for causal inference is not Boolean in general, but forms a Heyting algebra. We define the internal Mitchell-B\'enabou language, a typed local set theory, associated with causal models, and its associated Kripke-Joyal intuitionistic semantics. We prove a universal property of TCM, namely that any causal functor mapping decomposable structure to probabilistic semantics factors uniquely through a TCM representation.

AAAI Conference 2023 Conference Paper

Smoothed Online Combinatorial Optimization Using Imperfect Predictions

Kai Wang
Zhao Song
Georgios Theocharous
Sridhar Mahadevan

Smoothed online combinatorial optimization considers a learner who repeatedly chooses a combinatorial decision to minimize an unknown changing cost function with a penalty on switching decisions in consecutive rounds. We study smoothed online combinatorial optimization problems when an imperfect predictive model is available, where the model can forecast the future cost functions with uncertainty. We show that using predictions to plan for a finite time horizon leads to regret dependent on the total predictive uncertainty and an additional switching cost. This observation suggests choosing a suitable planning window to balance between uncertainty and switching cost, which leads to an online algorithm with guarantees on the upper and lower bounds of the cumulative regret. Empirically, our algorithm shows a significant improvement in cumulative regret compared to other baselines in synthetic online distributed streaming problems.

PDF Details DOI

ICML Conference 2020 Conference Paper

Optimizing for the Future in Non-Stationary MDPs

Yash Chandak
Georgios Theocharous
Shiv Shankar
Martha White
Sridhar Mahadevan
Philip S. Thomas

Most reinforcement learning methods are based upon the key assumption that the transition dynamics and reward functions are fixed, that is, the underlying Markov decision process is stationary. However, in many real-world applications, this assumption is violated, and using existing algorithms may result in a performance lag. To proactively search for a good future policy, we present a policy gradient algorithm that maximizes a forecast of future performance. This forecast is obtained by fitting a curve to the counter-factual estimates of policy performance over time, without explicitly modeling the underlying non-stationarity. The resulting algorithm amounts to a non-uniform reweighting of past data, and we observe that minimizing performance over some of the data from past episodes can be beneficial when searching for a policy that maximizes future performance. We show that our algorithm, called Prognosticator, is more robust to non-stationarity than two online adaptation techniques, on three simulated problems motivated by real-world applications.

AAAI Conference 2018 Conference Paper

Imagination Machines: A New Challenge for Artificial Intelligence

Sridhar Mahadevan

The aim of this paper is to propose a new overarching challenge for AI: the design of imagination machines. Imagination has been defined as the capacity to mentally transcend time, place, and/or circumstance. Much of the success of AI currently comes from a revolution in data science, specifically the use of deep learning neural networks to extract structure from data. This paper argues for the development of a new field called imagination science, which extends data science beyond its current realm of learning probability distributions from samples. Numerous examples are given in the paper to illustrate that human achievements in the arts, literature, poetry, and science may lie beyond the realm of data science, because they require abilities that go beyond finding correlations: for example, generating samples from a novel probability distribution different from the one given during training; causal reasoning to uncover interpretable explanations; or analogical reasoning to generalize to novel situations (e.g., imagination in art, representing alien life in a distant galaxy, understanding a story about talking animals, or inventing representations to model the large-scale structure of the universe). We describe the key challenges in automating imagination, discuss connections between ongoing research and imagination, and outline why automation of imagination provides a powerful launching pad for transforming AI.

JAIR Journal 2018 Journal Article

Proximal Gradient Temporal Difference Learning: Stable Reinforcement Learning with Polynomial Sample Complexity

Bo Liu
Ian Gemp
Mohammad Ghavamzadeh
Ji Liu
Sridhar Mahadevan
Marek Petrik

In this paper, we introduce proximal gradient temporal difference learning, which provides a principled way of designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not by starting from their original objective functions, as previously attempted, but rather from a primal-dual saddle-point objective function. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and do not provide any finite-sample analysis. We also propose an accelerated algorithm, called GTD2-MP, that uses proximal "mirror maps" to yield an improved convergence rate. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

PDF Details DOI

IJCAI Conference 2016 Conference Paper

Proximal Gradient Temporal Difference Learning Algorithms

Bo Liu
Ji Liu
Mohammad Ghavamzadeh
Sridhar Mahadevan
Marek Petrik

In this paper, we describe proximal gradient temporal difference learning, which provides a principled way for designing and analyzing true stochastic gradient temporal difference learning algorithms. We show how gradient TD (GTD) reinforcement learning methods can be formally derived, not with respect to their original objective functions as previously attempted, but rather with respect to primal-dual saddle-point objective functions. We also conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. An accelerated algorithm is also proposed, namely GTD2-MP, which use proximal mirror maps to yield acceleration. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

AAAI Conference 2015 Conference Paper

Aligning Mixed Manifolds

Thomas Boucher
CJ Carey
Sridhar Mahadevan
Melinda Dyar

Current manifold alignment methods can effectively align data sets that are drawn from a non-intersecting set of manifolds. However, as data sets become increasingly high-dimensional and complex, this assumption may not hold. This paper proposes a novel manifold alignment algorithm, low rank alignment (LRA), that uses a low rank representation (instead of a nearest neighbor graph construction) to embed and align data sets drawn from mixtures of manifolds. LRA does not require the tuning of a sensitive nearest neighbor hyperparameter or prior knowledge of the number of manifolds, both of which are common drawbacks with existing techniques. We demonstrate the effectiveness of our algorithm in two real-world applications: a transfer learning task in spectroscopy and a canonical information retrieval task.

UAI Conference 2015 Conference Paper

Finite-Sample Analysis of Proximal Gradient TD Algorithms

Bo Liu 0006
Ji Liu 0002
Mohammad Ghavamzadeh
Sridhar Mahadevan
Marek Petrik

In this paper, we show for the first time how gradient TD (GTD) reinforcement learning methods can be formally derived as true stochastic gradient algorithms, not with respect to their original objective functions as previously attempted, but rather using derived primal-dual saddle-point objective functions. We then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Previous analyses of this class of algorithms use stochastic approximation techniques to prove asymptotic convergence, and no finite-sample analysis had been attempted. Two novel GTD algorithms are also proposed, namely projected GTD2 and GTD2-MP, which use proximal “mirror maps” to yield improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis imply that the GTD family of algorithms are comparable and may indeed be preferred over existing least squares TD methods for off-policy learning, due to their linear complexity. We provide experimental results showing the improved performance of our accelerated gradient TD methods.

RLDM Conference 2015 Conference Abstract

Proximal Reinforcement Learning: Learning to Act in Primal Dual Spaces

Sridhar Mahadevan

In this talk, we set forth a new framework for reinforcement learning developed by us over the past few years, one that yields mathematically rigorous solutions to longstanding fundamental questions that have re- mained unresolved over the past three decades: (i) how to design “safe” reinforcement learning algorithms that remain in a stable region of the parameter space (ii) how to design true stochastic gradient temporal- difference learning algorithms and give finite-sample bounds characterizing their convergence? (iii) more broadly, how to specify a flexible algorithmic framework that simplifies the design of reinforcement learning algorithms for various objective functions? The most important idea that emerges as a motif throughout the solution of these three problems is the use of primal dual spaces connected through the use of “mirror maps”: Legendre transforms that elegantly unify and generalize a myriad past algorithms for solving reinforcement learning problems, from natural gradient actor-critic methods and exponentiated-gradient methods to gradient TD and sparse RL methods. We introduce mirror-descent RL, a powerful family of RL methods that uses mirror maps through different Legendre transforms to achieve reliability, scalability, and sparsity. Our work builds extensively on the past 50 years of advances in stochastic optimization, from the study of proximal mappings, monotone operators, and operator splitting began in the mid-1950s to recent advances in first-order optimization and saddle-point extragradient methods for solving variational inequalities.

AAAI Conference 2014 Conference Paper

Manifold Spanning Graphs

CJ Carey
Sridhar Mahadevan

Graph construction is the essential first step for nearly all manifold learning algorithms. While many applications assume that a simple k-nearest or -close neighbors graph will accurately model the topology of the underlying manifold, these methods often require expert tuning and may not produce high quality graphs. In this paper, the hyperparameter sensitivity of existing graph construction methods is demonstrated. We then present a new algorithm for unsupervised graph construction, based on minimal assumptions about the input data and its manifold structure.

AAAI Conference 2013 Conference Paper

Basis Adaptation for Sparse Nonlinear Reinforcement Learning

Sridhar Mahadevan
Stephen Giguere
Nicholas Jacek

This paper presents a new approach to representation discovery in reinforcement learning (RL) using basis adaptation. We introduce a general framework for basis adaptation as nonlinear separable least-squares value function approximation based on ﬁnding Fréchet gradients of an error function using variable projection functionals. We then present a scalable proximal gradientbased approach for basis adaptation using the recently proposed mirror-descent framework for RL. Unlike traditional temporal-difference (TD) methods for RL, mirror descent based RL methods undertake proximal gradient updates of weights in a dual space, which is linked together with the primal space using a Legendre transform involving the gradient of a strongly convex function. Mirror descent RL can be viewed as a proximal TD algorithm using Bregman divergence as the distance generating function. We present a new class of regularized proximal-gradient based TD methods, which combine feature selection through sparse L1 regularization and basis adaptation. Experimental results are provided to illustrate and validate the approach.

IJCAI Conference 2013 Conference Paper

Manifold Alignment Preserving Global Geometry

Chang Wang
Sridhar Mahadevan

This paper proposes a novel algorithm for manifold alignment preserving global geometry. This approach constructs mapping functions that project data instances from different input domains to a new lower-dimensional space, simultaneously matching the instances in correspondence and preserving global distances between instances within the original domains. In contrast to previous approaches, which are largely based on preserving local geometry, the proposed approach is suited to applications where the global manifold geometry needs to be respected. We evaluate the effectiveness of our algorithm for transfer learning in two real-world cross-lingual information retrieval tasks.

PDF Details DOI

AAAI Conference 2013 Conference Paper

Multiscale Manifold Learning

Chang Wang
Sridhar Mahadevan

Many high-dimensional data sets that lie on a lowdimensional manifold exhibit nontrivial regularities at multiple scales. Most work in manifold learning ignores this multiscale structure. In this paper, we propose approaches to explore the deep structure of manifolds. The proposed approaches are based on the diffusion wavelets framework, data driven, and able to directly process directional neighborhood relationships without ad-hoc symmetrization. The proposed multiscale algorithms are evaluated using both synthetic and real-world data sets, and shown to outperform previous manifold learning methods.

NeurIPS Conference 2013 Conference Paper

Projected Natural Actor-Critic

Philip Thomas
William Dabney
Stephen Giguere
Sridhar Mahadevan

Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability - their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.

AAAI Conference 2012 Conference Paper

Manifold Warping: Manifold Alignment over Time

Hoa Vu
Clifton Carey
Sridhar Mahadevan

Knowledge transfer is computationally challenging, due in part to the curse of dimensionality, compounded by source and target domains expressed using different features (e. g. , documents written in different languages). Recent work on manifold learning has shown that data collected in real-world settings often have high-dimensional representations, but lie on low-dimensional manifolds. Furthermore, data sets collected from similar generating processes often present different high-dimensional views, even though their underlying manifolds are similar. The ability to align these data sets and extract this common structure is critical for many transfer learning tasks. In this paper, we present a novel framework for aligning two sequentially-ordered data sets, taking advantage of a shared low-dimensional manifold representation. Our approach combines traditional manifold alignment and dynamic time warping algorithms using alternating projections. We also show that the previously-proposed canonical time warping algorithm is a special case of our approach. We provide a theoretical formulation as well as experimental results on synthetic and real-world data, comparing manifold warping to other alignment methods.

NeurIPS Conference 2012 Conference Paper

Regularized Off-Policy TD-Learning

Bo Liu
Sridhar Mahadevan
Ji Liu

We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

UAI Conference 2012 Conference Paper

Sparse Q-learning with Mirror Descent

Sridhar Mahadevan
Bo Liu 0006

This paper explores a new framework for reinforcement learning based on online convex optimization, in particular mirror descent and related algorithms. Mirror descent can be viewed as an enhanced gradient method, particularly suited to minimization of convex functions in highdimensional spaces. Unlike traditional gradient methods, mirror descent undertakes gradient updates of weights in both the dual space and primal space, which are linked together using a Legendre transform. Mirror descent can be viewed as a proximal algorithm where the distance generating function used is a Bregman divergence. A new class of proximal-gradient based temporaldifference (TD) methods are presented based on different Bregman divergences, which are more powerful than regular TD learning. Examples of Bregman divergences that are studied include pnorm functions, and Mahalanobis distance based on the covariance of sample gradients. A new family of sparse mirror-descent reinforcement learning methods are proposed, which are able to find sparse fixed points of an l1-regularized Bellman equation at significantly less computational cost than previous methods based on secondorder matrix methods. An experimental study of mirror-descent reinforcement learning is presented using discrete and continuous Markov decision processes. Reinforcement learning (RL) models the interaction between the agent and an environment as a Markov decision process (MDP), a widely adopted model of sequential decision-making. The major aim of this paper is to investigate a new framework for reinforcement learning and sequential decision-making, based on ideas from online convex optimization, in particular mirror descent [NY83] and related algorithms [BT03, Nes09]. Mirror descent generalizes classical first-order gradient descent to non-Euclidean geometries by using a distance-generating function specific to a particular geometry. Mirror descent can be viewed as a proximal algorithm [BT03], where the distance generating function used is a Bregman divergence [Bre67]. Mirror descent is a powerful framework for convex optimization in high-dimensional spaces: an early success of this approach was its use in positron-emission tomography (PET) imaging involving an optimization problem with millions of variables [BTMN01]. The mirror descent algorithm has led to new methods for sparse classification and regression [DHS11, SST11a]. Mirror descent has also been extended from its original deterministic setting [NY83] to a stochastic approximation setting [NJLS09], which makes it highly appropriate for RL, as the standard temporal-difference (TD) learning method for solving MDPs also has its origins in stochastic approximation theory [Bor08]. l1 regularized methods for solving MDPs have become a topic of recent attention, including a technique combining least-squares TD (LSTD) with least-angle regression (LARS) [KN09]; another method for combining l1 regularization with approximate linear programming [PTPZ10]; finally, a linear complementarity formulation of l1 regularization [JPWP10]. These methods involve matrix inversion, requiring near cubic complexity in the number of (active) features. In contrast, mirror-descent based RL methods promise similar performance guarantees involving only linear complexity in the number of features. Recent work in online learning [DHS11, SST11a] has explored the application of mirror descent in developing sparse methods for regularized classification and regression. This paper investigates the use of mirror-descent for sparse reinforcement learning.

IJCAI Conference 2011 Conference Paper

Heterogeneous Domain Adaptation Using Manifold Alignment

Chang Wang
Sridhar Mahadevan

We propose a manifold alignment based approach for heterogeneous domain adaptation. A key aspect of this approach is to construct mappings to link different feature spaces in order to transfer knowledge across domains. The new approach can reuse labeled data from multiple source domains in a target domain even in the case when the input domains do not share any common features or instances. As a pre-processing step, our approach can also be combined with existing domain adaptation approaches to learn a common feature space for all input domains. This paper extends existing manifold alignment approaches by making use of labels rather than correspondences to align the manifolds. This extension significantly broadens the application scope of manifold alignment, since the correspondence relationship required by existing alignment approaches is hard to obtain in many applications.

PDF Details DOI

IJCAI Conference 2011 Conference Paper

Jointly Learning Data-Dependent Label and Locality-Preserving Projections

Chang Wang
Sridhar Mahadevan

This paper describes a novel framework to jointly learn data-dependent label and locality-preserving projections. Given a set of data instances from multiple classes, the proposed approach can automatically learn which classes are more similar to each other, and construct discriminative features using both labeled and unlabeled data to map similar classes to similar locations in a lower dimensional space. In contrast to linear discriminant analysis (LDA) and its variants, which can only return c-1 features for a problem with c classes, the proposed approach can generate d features, where d is bounded only by the number of the input features. We describe and evaluate the new approach both theoretically and experimentally, and compare its performance with other state of the art methods.

PDF Details DOI

NeurIPS Conference 2010 Conference Paper

Basis Construction from Power Series Expansions of Value Functions

Sridhar Mahadevan
Bo Liu

This paper explores links between basis construction methods in Markov decision processes and power series expansions of value functions. This perspective provides a useful framework to analyze properties of existing bases, as well as provides insight into constructing more effective bases. Krylov and Bellman error bases are based on the Neumann series expansion. These bases incur very large initial Bellman errors, and can converge rather slowly as the discount factor approaches unity. The Laurent series expansion, which relates discounted and average-reward formulations, provides both an explanation for this slow convergence as well as suggests a way to construct more efficient basis representations. The first two terms in the Laurent series represent the scaled average-reward and the average-adjusted sum of rewards, and subsequent terms expand the discounted value function using powers of a generalized inverse called the Drazin (or group inverse) of a singular matrix derived from the transition matrix. Experiments show that Drazin bases converge considerably more quickly than several other bases, particularly for large values of the discount factor. An incremental variant of Drazin bases called Bellman average-reward bases (BARBs) is described, which provides some of the same benefits at lower computational cost.

AAMAS Conference 2010 Conference Paper

Basis Function Construction for Hierarchical Reinforcement Learning

Sarah Osentoski
Sridhar Mahadevan

Much past work on solving Markov decision processes (MDPs) using reinforcement learning (RL) has relied on combining parameter estimation methods with hand-designed function approximationarchitectures for representing value functions. Recently, there hasbeen growing interest in a broader framework that combines representation discovery and control learning, where value functions areapproximated using a linear combination of task-dependent basisfunctions learned during the course of solving a particular MDP. This paper introduces an approach to automatic basis function construction for hierarchical reinforcement learning (HRL). Our approach generalizes past work on basis construction to multi-levelaction hierarchies by forming a compressed representation of asemi-Markov decision process (SMDP) at multiple levels of temporal abstraction. The specific approach is based on hierarchicalspectral analysis of graphs induced on an SMDP's state space fromsample trajectories. We present experimental results on benchmarkSMDPs, showing significant speedups when compared to hand-designed approximation architectures.

AAAI Conference 2010 Conference Paper

Compressing POMDPs Using Locality Preserving Non-Negative Matrix Factorization

Georgios Theocharous
Sridhar Mahadevan

Partially Observable Markov Decision Processes (POMDPs) are a well-established and rigorous framework for sequential decision-making under uncertainty. POMDPs are well-known to be intractable to solve exactly, and there has been significant work on finding tractable approximation methods. One well-studied approach is to find a compression of the original POMDP by projecting the belief states to a lower-dimensional space. We present a novel dimensionality reduction method for POMDPs based on locality preserving non-negative matrix factorization. Unlike previous approaches, such as Krylov compression and regular non-negative matrix factorization, our approach preserves the local geometry of the belief space manifold. We present results on standard benchmark POMDPs showing improved performance over previously explored compression algorithms for POMDPs.

AAAI Conference 2010 Conference Paper

Representation Discovery in Sequential Decision Making

Sridhar Mahadevan

Automatically constructing novel representations of tasks from analysis of state spaces is a longstanding fundamental challenge in AI. I review recent progress on this problem for sequential decision making tasks modeled as Markov decision processes. Specifically, I discuss three classes of representation discovery problems: finding functional, state, and temporal abstractions. I describe solution techniques varying along several dimensions: diagonalization or dilation methods using approximate or exact transition models; rewardspecific vs reward-invariant methods; global vs. local representation construction methods; multiscale vs. flat discovery methods; and finally, orthogonal vs. redundant representation discovery methods. I conclude by describing a number of open problems for future work.

IJCAI Conference 2009 Conference Paper

Chang Wang
Sridhar Mahadevan

Manifold alignment has been found to be useful in many areas of machine learning and data mining. In this paper we introduce a novel manifold alignment approach, which differs from “semisupervised alignment” and “Procrustes alignment” in that it does not require predetermining correspondences. Our approach learns a projection that maps data instances (from two different spaces) to a lower dimensional space simultaneously matching the local geometry and preserving the neighborhood relationship within each set. This approach also builds connections between spaces de- ﬁned by different features and makes direct knowledge transfer possible. The performance of our algorithm is demonstrated and validated in a series of carefully designed experiments in information retrieval and bioinformatics.

IJCAI Conference 2009 Conference Paper

Chang Wang
Sridhar Mahadevan

We introduce a nonparametric approach to multiscale analysis of document corpora using a hierarchical matrix analysis framework called diffusion wavelets. In contrast to eigenvector methods, diffusion wavelets construct multiscale basis functions. In this framework, a hierarchy is automatically constructed by an iterative series of dilation and orthogonalization steps beginning with an initial set of orthogonal basis functions, such as the unitvector bases. Each set of basis functions at a given level is constructed from the bases at the lower level by dilation using the dyadic powers of a diffusion operator. A novel aspect of our work is that the diffusion analysis is conducted on the space of variables (words), instead of instances (documents). This approach can automatically and efﬁciently determine the number of levels of the topical hierarchy, as well as the topics at each level. Multiscale analysis of document corpora is achieved by using the projections of the documents onto the spaces spanned by basis functions at different levels. Further, when the input term-term matrix is a “local” diffusion operator, the algorithm runs in time approximately linear in the number of non-zero elements of the matrix. The approach is illustrated on various data sets including NIPS conference papers, 20 Newsgroups and TDT2 data.

AAAI Conference 2008 Conference Paper

Fast Spectral Learning using Lanczos Eigenspace Projections

Sridhar Mahadevan

The core computational step in spectral learning – finding the projection of a function onto the eigenspace of a symmetric operator, such as a graph Laplacian – generally incurs a cubic computational complexity O(N3 ). This paper describes the use of Lanczos eigenspace projections for accelerating spectral projections, which reduces the complexity to O(nTop + n2 N) operations, where n is the number of distinct eigenvalues, and Top is the complexity of multiplying T by a vector. This approach is based on diagonalizing the restriction of the operator to the Krylov space spanned by the operator and a projected function. Even further savings can be accrued by constructing an approximate Lanczos tridiagonal representation of the Krylov-space restricted operator. A key novelty of this paper is the use of Krylov-subspace modulated Lanczos acceleration for multi-resolution wavelet analysis. A challenging problem of learning to control a robot arm is used to test the proposed approach.

ICML Conference 2008 Conference Paper

Manifold alignment using Procrustes analysis

Chang Wang 0001
Sridhar Mahadevan

ICML Conference 2007 Conference Paper

Adaptive mesh compression in 3D computer graphics using multiscale manifold learning

Sridhar Mahadevan

This paper investigates compression of 3D objects in computer graphics using manifold learning. Spectral compression uses the eigenvectors of the graph Laplacian of an object's topology to adaptively compress 3D objects. 3D compression is a challenging application domain: object models can have > 10 5 vertices, and reliably computing the basis functions on large graphs is numerically challenging. In this paper, we introduce a novel multiscale manifold learning approach to 3D mesh compression using diffusion wavelets , a general extension of wavelets to graphs with arbitrary topology. Unlike the "global" nature of Laplacian bases, diffusion wavelet bases are compact, and multiscale in nature. We decompose large graphs using a fast graph partitioning method, and combine local multiscale wavelet bases computed on each subgraph. We present results showing that multiscale diffusion wavelets bases are superior to the Laplacian bases for adaptive compression of large 3D objects.

ICML Conference 2007 Conference Paper

Constructing basis functions from directed graphs for value function approximation

Jeffrey Johns
Sridhar Mahadevan

Basis functions derived from an undirected graph connecting nearby samples from a Markov decision process (MDP) have proven useful for approximating value functions. The success of this technique is attributed to the smoothness of the basis functions with respect to the state space geometry. This paper explores the properties of bases created from directed graphs which are a more natural fit for expressing state connectivity. Digraphs capture the effect of non-reversible MDPs whose value functions may not be smooth across adjacent states. We provide an analysis using the Dirichlet sum of the directed graph Laplacian to show how the smoothness of the basis functions is affected by the graph's invariant distribution. Experiments in discrete and continuous MDPs with non-reversible actions demonstrate a significant improvement in the policies learned using directed graph bases.

JMLR Journal 2007 Journal Article

Hierarchical Average Reward Reinforcement Learning

Mohammad Ghavamzadeh
Sridhar Mahadevan

Hierarchical reinforcement learning (HRL) is a general framework for scaling reinforcement learning (RL) to problems with large state and action spaces by using the task (or action) structure to restrict the space of policies. Prior work in HRL including HAMs, options, MAXQ, and PHAMs has been limited to the discrete-time discounted reward semi-Markov decision process (SMDP) model. The average reward optimality criterion has been recognized to be more appropriate for a wide class of continuing tasks than the discounted framework. Although average reward RL has been studied for decades, prior work has been largely limited to flat policy representations. In this paper, we develop a framework for HRL based on the average reward optimality criterion. We investigate two formulations of HRL based on the average reward SMDP model, both for discrete-time and continuous-time. These formulations correspond to two notions of optimality that have been previously explored in HRL: hierarchical optimality and recursive optimality. We present algorithms that learn to find hierarchically and recursively optimal average reward policies under discrete-time and continuous-time average reward SMDP models. We use two automated guided vehicle (AGV) scheduling tasks as experimental testbeds to study the empirical performance of the proposed algorithms. The first problem is a relatively simple AGV scheduling task, in which the hierarchically and recursively optimal policies are different. We compare the proposed algorithms with three other HRL methods, including a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm on this problem. The second problem is a larger AGV scheduling task. We model this problem using both discrete-time and continuous-time models. We use a hierarchical task decomposition in which the hierarchically and recursively optimal policies are the same for this problem. We compare the performance of the proposed algorithms with a hierarchically optimal discounted reward algorithm and a recursively optimal discounted reward algorithm, as well as a non-hierarchical average reward algorithm. The results show that the proposed hierarchical average reward algorithms converge to the same performance as their discounted reward counterparts. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

ICML Conference 2007 Conference Paper

Learning state-action basis functions for hierarchical MDPs

Sarah Osentoski
Sridhar Mahadevan

ICAPS Conference 2007 Conference Paper

Learning to Plan Using Harmonic Analysis of Diffusion Models

Sridhar Mahadevan
Sarah Osentoski
Jeffrey Johns
Kimberly Ferguson
Chang Wang 0001

This paper describes a new harmonic analysis framework for planning based on estimating a diffusion model that models information flow on a graph (discrete state space) or a manifold (continuous state space) using the Laplace heat equation. Diffusion models can be significantly easier to learn than transition models, and yet provide much of the same speedups in performance over model-free methods. Several types of diffusion models are described, including undirected and directed state-based models, as well as state-action models. Two methods for constructing novel basis representations from diffusion models are described: Fourier methods diagonalize a symmetric diffusion operator called the Laplacian; Wavelet methods dilate unit basis functions progressively using powers of the diffusion operator. A new variant of policy iteration -- called representation policy iteration -- is described consisting of an outer loop that estimates new basis functions by diagonalization or dilation, and an inner loop that learns the best policy representable within the linear span of the current basis functions. Results from continuous and discrete MDPs are provided to illustrate the new approach.

JMLR Journal 2007 Journal Article

Proto-value Functions: A Laplacian Framework for Learning Representation and Control in Markov Decision Processes

Sridhar Mahadevan
Mauro Maggioni

This paper introduces a novel spectral framework for solving Markov decision processes (MDPs) by jointly learning representations and optimal policies. The major components of the framework described in this paper include: (i) A general scheme for constructing representations or basis functions by diagonalizing symmetric diffusion operators (ii) A specific instantiation of this approach where global basis functions called proto-value functions (PVFs) are formed using the eigenvectors of the graph Laplacian on an undirected graph formed from state transitions induced by the MDP (iii) A three-phased procedure called representation policy iteration comprising of a sample collection phase, a representation learning phase that constructs basis functions from samples, and a final parameter estimation phase that determines an (approximately) optimal policy within the (linear) subspace spanned by the (current) basis functions. (iv) A specific instantiation of the RPI framework using least-squares policy iteration (LSPI) as the parameter estimation method (v) Several strategies for scaling the proposed approach to large discrete and continuous state spaces, including the Nyström extension for out-of-sample interpolation of eigenfunctions, and the use of Kronecker sum factorization to construct compact eigenfunctions in product spaces such as factored MDPs (vi) Finally, a series of illustrative discrete and continuous control tasks, which both illustrate the concepts and provide a benchmark for evaluating the proposed approach. Many challenges remain to be addressed in scaling the proposed framework to large MDPs, and several elaboration of the proposed framework are briefly summarized at the end. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

ICML Conference 2006 Conference Paper

Fast direct policy evaluation using multiscale analysis of Markov diffusion processes

Mauro Maggioni
Sridhar Mahadevan

JAAMAS Journal 2006 Journal Article

Hierarchical multi-agent reinforcement learning

Mohammad Ghavamzadeh
Sridhar Mahadevan
Rajbala Makar

Abstract In this paper, we investigate the use of hierarchical reinforcement learning (HRL) to speed up the acquisition of cooperative multi-agent tasks. We introduce a hierarchical multi-agent reinforcement learning (RL) framework, and propose a hierarchical multi-agent RL algorithm called Cooperative HRL. In this framework, agents are cooperative and homogeneous (use the same task decomposition). Learning is decentralized, with each agent learning three interrelated skills: how to perform each individual subtask, the order in which to carry them out, and how to coordinate with other agents. We define cooperative subtasks to be those subtasks in which coordination among agents significantly improves the performance of the overall task. Those levels of the hierarchy which include cooperative subtasks are called cooperation levels. A fundamental property of the proposed approach is that it allows agents to learn coordination faster by sharing information at the level of cooperative subtasks, rather than attempting to learn coordination at the level of primitive actions. We study the empirical performance of the Cooperative HRL algorithm using two testbeds: a simulated two-robot trash collection task, and a larger four-agent automated guided vehicle (AGV) scheduling problem. We compare the performance and speed of Cooperative HRL with other learning algorithms, as well as several well-known industrial AGV heuristics. We also address the issue of rational communication behavior among autonomous agents in this paper. The goal is for agents to learn both action and communication policies that together optimize the task given a communication cost. We extend the multi-agent HRL framework to include communication decisions and propose a cooperative multi-agent HRL algorithm called COM-Cooperative HRL. In this algorithm, we add a communication level to the hierarchical decomposition of the problem below each cooperation level. Before an agent makes a decision at a cooperative subtask, it decides if it is worthwhile to perform a communication action. A communication action has a certain cost and provides the agent with the actions selected by the other agents at a cooperation level. We demonstrate the efficiency of the COM-Cooperative HRL algorithm as well as the relation between the communication cost and the learned communication policy using a multi-agent taxi problem.

AAAI Conference 2006 Conference Paper

Learning Representation and Control in Continuous Markov Decision Processes

Sridhar Mahadevan
Kimberly Ferguson

This paper presents a novel framework for simultaneously learning representation and control in continuous Markov decision processes. Our approach builds on the framework of proto-value functions, in which the underlying representation or basis functions are automatically derived from a spectral analysis of the state space manifold. The proto-value functions correspond to the eigenfunctions of the graph Laplacian. We describe an approach to extend the eigenfunctions to novel states using the Nyström extension. A least-squares policy iteration method is used to learn the control policy, where the underlying subspace for approximating the value function is spanned by the learned proto-value functions. A detailed set of experiments is presented using classic benchmark tasks, including the inverted pendulum and the mountain car, showing the sensitivity in performance to various parameters, and including comparisons with a parametric radial basis function method.

ICML Conference 2005 Conference Paper

Coarticulation: an approach for generating concurrent plans in Markov decision processes

Khashayar Rohanimanesh
Sridhar Mahadevan

ICML Conference 2005 Conference Paper

Proto-value functions: developmental reinforcement learning

Sridhar Mahadevan

UAI Conference 2005 Conference Paper

Representation Policy Iteration

Sridhar Mahadevan

AAAI Conference 2005 Conference Paper

Samuel Meets Amarel: Automating Value Function Approximation Using Global State Space Analysis

Sridhar Mahadevan

Most work on value function approximation adheres to Samuel’s original design: agents learn a task-specific value function using parameter estimation, where the approximation architecture (e. g, polynomials) is specified by a human designer. This paper proposes a novel framework generalizing Samuel’s paradigm using a coordinate-free approach to value function approximation. Agents learn both representations and value functions by constructing geometrically customized taskindependent basis functions that form an orthonormal set for the Hilbert space of smooth functions on the underlying state space manifold. The approach rests on a technical result showing that the space of smooth functions on a (compact) Riemanian manifold has a discrete spectrum associated with the Laplace-Beltrami operator. In the discrete setting, spectral analysis of the graph Laplacian yields a set of geometrically customized basis functions for approximating and decomposing value functions. The proposed framework generalizes Samuel’s value function approximation paradigm by combining it with a formalization of Saul Amarel’s paradigm of representation learning through global state space analysis.

NeurIPS Conference 2005 Conference Paper

Value Function Approximation with Diffusion Wavelets and Laplacian Eigenfunctions

Sridhar Mahadevan
Mauro Maggioni

We investigate the problem of automatically constructing efﬁcient rep- resentations or basis functions for approximating value functions based on analyzing the structure and topology of the state space. In particu- lar, two novel approaches to value function approximation are explored based on automatically constructing basis functions on state spaces that can be represented as graphs or manifolds: one approach uses the eigen- functions of the Laplacian, in effect performing a global Fourier analysis on the graph; the second approach is based on diffusion wavelets, which generalize classical wavelets to graphs using multiscale dilations induced by powers of a diffusion operator or random walk on the graph. Together, these approaches form the foundation of a new generation of methods for solving large Markov decision processes, in which the underlying repre- sentation and policies are simultaneously learned.

NeurIPS Conference 2004 Conference Paper

Coarticulation in Markov Decision Processes

Khashayar Rohanimanesh
Robert Platt
Sridhar Mahadevan
Roderic Grupen

We investigate an approach for simultaneously committing to mul- tiple activities, each modeled as a temporally extended action in a semi-Markov decision process (SMDP). For each activity we de- fine a set of admissible solutions consisting of the redundant set of optimal policies, and those policies that ascend the optimal state- value function associated with them. A plan is then generated by merging them in such a way that the solutions to the subordinate activities are realized in the set of admissible solutions satisfying the superior activities. We present our theoretical results and em- pirically evaluate our approach in a simulated domain.

IROS Conference 2004 Conference Paper

Learning hierarchical models of activity

Sarah Osentoski
Victoria Manfredi
Sridhar Mahadevan

This paper investigates learning hierarchical statistical activity models in indoor environments. The abstract hidden Markov model (AHMM) is used to represent behaviors in stochastic environments. We train the model using both labeled and unlabeled data and estimate the parameters using expectation maximization (EM). Results are shown on three datasets: data collected in lab, entryway, and home environments. The results show that hierarchical models outperform flat models.

AAMAS Conference 2004 Conference Paper

Learning to Communicate and Act using Hierarchical Reinforcement Learning

Mohammad Ghavamzadeh
Sridhar Mahadevan

ICAPS Conference 2004 Conference Paper

Probabilistic Plan Recognition in Multiagent Systems

Suchi Saria
Sridhar Mahadevan

We present a theoretical framework for online probabilistic plan recognition in cooperative multiagent systems. Our model extends the Abstract Hidden Markov Model (AHMM) (Bui, Venkatesh, and West 2002), and consists of a hierarchical dynamic Bayes network that allows reasoning about the interaction among multiple cooperating agents. We provide an in-depth analysis of two different policy termination schemes, Tall and Tany for concurrent action introduced in (Rohanimanesh and Mahadevan 2003). In the Tall scheme, a joint policy terminates only when all agents have terminated executing their individual policies. In the Tany scheme, a joint policy terminates as soon as any of the agents terminates executing its individual policy. Since exact inference is intractable, we describe an approximate algorithm using Rao-Blackwellized particle filtering. Our approximate inference procedure reduces the complexity from exponential time in N, the number of agents and K, the number of levels, to time linear in both N and K^ ≤ K (the lowest-level of plan coordination) for the Tall termination scheme and O(N log N) and linear in K^ for the Tany termination scheme.

ICML Conference 2003 Conference Paper

Hierarchical Policy Gradient Algorithms

Mohammad Ghavamzadeh
Sridhar Mahadevan

ICRA Conference 2002 Conference Paper

Approximate Planning with Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation

Georgios Theocharous
Sridhar Mahadevan

We propose and investigate a planning framework based on the hierarchical partially observable Markov decision process model (HPOMDP), and apply it to robot navigation. We show how this framework can be used to produce more robust plans as compared to flat models such as partially observable Markov decision processes (POMDPs). In our approach the environment is modeled at different levels of resolution, where abstract states represent both spatial and temporal abstraction. We test our hierarchical POMDP approach using a large simulated and real navigation environment. The results show that the robot is more successful in navigating to goals starting with no positional knowledge (uniform initial belief state distribution) using the hierarchical POMDP framework as compared to the flat POMDP approach.

ICML Conference 2002 Conference Paper

Hierarchically Optimal Average Reward Reinforcement Learning

Mohammad Ghavamzadeh
Sridhar Mahadevan

IROS Conference 2002 Conference Paper

Learning the hierarchical structure of spatial environments using multiresolution statistical models

Georgios Theocharous
Sridhar Mahadevan

We explore the use of hierarchical Partially Observable Markov Decision Process (HPOMDP) models to represent and learn a multiresolution spatial structure representation of indoor office environments. The hierarchical POMDP model is based on the hierarchical Hidden Markov Model (HHMM). HPOMDPs can be learned from sequences of observations using an extension of the hierarchical Baum-Welch estimation algorithm for HHMMs. We apply the HPOMDP model to indoor robot navigation and show how this framework can be used to represent multiresolution spatial maps. In the HPOMDP framework the environment is modeled at different levels of resolutions where abstract states represent both spatial and temporal abstraction. We test our hierarchical POMDP approach using a large simulated (modeled after a real environment) navigation environment. The results show that the hierarchical POMDP model is more capable in inferring the spatial structure than a uniform resolution "flat" POMDP.

NeurIPS Conference 2002 Conference Paper

Learning to Take Concurrent Actions

Khashayar Rohanimanesh
Sridhar Mahadevan

We investigate a general semi-Markov Decision Process (SMDP) framework for modeling concurrent decision making, where agents learn optimal plans over concurrent temporally extended actions. We introduce three types of parallel termination schemes { all, any and continue { and theoretically and experimentally compare them.

ICML Conference 2001 Conference Paper

Continuous-Time Hierarchical Reinforcement Learning

Mohammad Ghavamzadeh
Sridhar Mahadevan

UAI Conference 2001 Conference Paper

Decision-Theoretic Planning with Concurrent Temporally Extended Actions

Khashayar Rohanimanesh
Sridhar Mahadevan

ICRA Conference 2001 Conference Paper

Learning Hierarchical Partially Observable Markov Decision Process Models for Robot Navigation

Georgios Theocharous
Khashayar Rohanimanesh
Sridhar Mahadevan

We propose and investigate a general framework for hierarchical modeling of partially observable environments, such as office buildings, using hierarchical hidden Markov models (HHMMs). Our main goal is to explore hierarchical modeling as a basis for designing more efficient methods for model construction and usage. As a case study we focus on indoor robot navigation and show how this framework can be used to learn a hierarchy of models of the environment at different levels of spatial abstraction. We introduce the idea of model reuse that can be used to combine already learned models into a larger model. We describe an extension of the HHMM model to includes actions, which we call hierarchical POMDPs, and describe a modified hierarchical Baum-Welch algorithm to learn these models. We train different families of hierarchical models for a simulated and a real world corridor environment and compare them with the standard "flat" representation of the same environment. We show that the hierarchical POMDP approach, combined with model reuse, allows learning hierarchical models that fit the data better and train faster than flat models.

NeurIPS Conference 2000 Conference Paper

Hierarchical Memory-Based Reinforcement Learning

Natalia Hernandez-Gardiol
Sridhar Mahadevan

Sridhar Mahadevan Department of Computer Science Michigan State University East Lansing, MI 48824 mahadeva@cse. msu. edu A key challenge for reinforcement learning is scaling up to large partially observable domains. In this paper, we show how a hier(cid: 173) archy of behaviors can be used to create and select among variable length short-term memories appropriate for a task. At higher lev(cid: 173) els in the hierarchy, the agent abstracts over lower-level details and looks back over a variable number of high-level decisions in time. We formalize this idea in a framework called Hierarchical Suffix Memory (HSM). HSM uses a memory-based SMDP learning method to rapidly propagate delayed reward across long decision sequences. We describe a detailed experimental study comparing memory vs. hierarchy using the HSM framework on a realistic corridor navigation task.

ICML Conference 1999 Conference Paper

Hierarchical Optimization of Policy-Coupled Semi-Markov Decision Processes

Gang Wang
Sridhar Mahadevan

AAAI Conference 1996 Conference Paper

An Average-Reward Reinforcement Learning Algorithm for Computing Bias-Optimal Policies

Sridhar Mahadevan

Average-reward reinforcement learning (ARL) is an undiscounted optimality framework that is generally applicable to a broad range of control tasks. ARL computes gain-optimal control policies that maximize the expected payoff per step. However, gain-optimality has some intrinsic limitations as an optimality criterion, since for example, it cannot distinguish between different policies that all reach an absorbing goal state, but incur varying costs. A more selective criterion is bias optimality, which can filter gain-optimal policies to select those that reach absorbing goals with the minimum cost. While several ARL algorithms for computing gain-optimal policies have been proposed, none of these algorithms can guarantee bias optimality, since this requires solving at least two nested optimality equations. In this paper, we describe a novel model-based ARL algorithm for computing bias-optimal policies. We test the proposed algorithm using an admission control queuing system, and show that it is able to utilize the queue much more efficiently than a gain-optimal method by learning bias-optimal policies.

ICML Conference 1996 Conference Paper

Sensitive Discount Optimality: Unifying Discounted and Average Reward Reinforcement Learning

Sridhar Mahadevan

ICML Conference 1994 Conference Paper

To Discount or Not to Discount in Reinforcement Learning: A Case Study Comparing R Learning and Q Learning

Sridhar Mahadevan

AIJ Journal 1993 Journal Article

An apprentice-based approach to knowledge acquisition

Sridhar Mahadevan
Tom M. Mitchell
Jack Mostow
Lou Steinberg
Prasad V. Tadepalli

AIJ Journal 1992 Journal Article

Automatic programming of behavior-based robots using reinforcement learning

Sridhar Mahadevan
Jonathan Connell

AAAI Conference 1991 Conference Paper

Automatic Programming of Behavior-Based Robots Using Reinforcement Learning

Sridhar Mahadevan

This paper describes a general approach for automatically programming a behavior-based robot. New behaviors are learned by trial and error using a performance feedback function as reinforcement. Two algorithms for behavior learning are described that combine techniques for propagating reinforcement values temporally across actions and spatially across states. A behavior-based robot called OBELIX (see Figure 1) is described that learns several component behaviors in an example task involving pushing boxes. An experimental study using the robot suggests two conclusions. One, the learning techniques are able to learn the individual behaviors, sometimes outperforming a handcoded program. Two, using a behavior-based architecture is better than using a monolithic architecture for learning the box pushing task.

IJCAI Conference 1985 Conference Paper

LEAP: A Learning Apprentice for VLSI Design

Tom M. Mitchell
Sridhar Mahadevan
Louis I. Steinberg

IJCAI Conference 1985 Conference Paper

Verification-based Learning: A Generalized Strategy for Inferring Problem-Reduction Methods

Sridhar Mahadevan