ICML Conference 2011 Conference Paper
Mean-Variance Optimization in Markov Decision Processes
- Shie Mannor
- John N. Tsitsiklis
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
ICML Conference 2011 Conference Paper
JMLR Journal 2009 Journal Article
We study online learning where a decision maker interacts with Nature with the objective of maximizing her long-term average reward subject to some sample path average constraints. We define the reward-in-hindsight as the highest reward the decision maker could have achieved, while satisfying the constraints, had she known Nature's choices in advance. We show that in general the reward-in-hindsight is not attainable. The convex hull of the reward-in-hindsight function is, however, attainable. For the important case of a single constraint, the convex hull turns out to be the highest attainable function. Using a calibrated forecasting rule, we provide an explicit strategy that attains this convex hull. We also measure the performance of heuristic methods based on non-calibrated forecasters in experiments involving a CPU power management problem. [abs] [ pdf ][ bib ] © JMLR 2009. ( edit, beta )
ICML Conference 2004 Conference Paper
JMLR Journal 2004 Journal Article
We consider the multi-armed bandit problem under the PAC ("probably approximately correct") model. It was shown by Even-Dar et al. (2002) that given n arms, a total of O (( n /ε 2 )log(1/δ)) trials suffices in order to find an ε-optimal arm with probability at least 1-δ. We establish a matching lower bound on the expected number of trials under any sampling policy. We furthermore generalize the lower bound, and show an explicit dependence on the (unknown) statistics of the arms. We also provide a similar bound within a Bayesian setting. The case where the statistics of the arms are known but the identities of the arms are not, is also discussed. For this case, we provide a lower bound of Θ((1/ε 2 )( n +log(1/δ))) on the expected number of trials, as well as a sampling policy with a matching upper bound. If instead of the expected number of trials, we consider the maximum (over all sample paths) number of trials, we establish a matching upper and lower bound of the form Θ(( n /ε 2 )log(1/δ)). Finally, we derive lower bounds on the expected regret, in the spirit of Lai and Robbins. [abs] [ pdf ][ ps.gz ][ ps ]
JMLR Journal 2002 Journal Article
We consider a finite-state Markov decision problem and establish the convergence of a special case of optimistic policy iteration that involves Monte Carlo estimation of Q -values, in conjunction with greedy policy selection. We provide convergence results for a number of algorithmic variations, including one that involves temporal difference learning (bootstrapping) instead of Monte Carlo estimation. We also indicate some extensions that either fail or are unlikely to go through.
TCS Journal 2001 Journal Article
In this paper we study problems such as: given a discrete time dynamical system of the form x(t+1)=f(x(t)) where f: R n→R n is a piecewise affine function, decide whether all trajectories converge to 0. We show in our main theorem that this Attractivity Problem is undecidable as soon as n⩾2. The same is true of two related problems: Stability (is the dynamical system globally asymptotically stable?) and Mortality (do all trajectories go through 0?). We then show that Attractivity and Stability become decidable in dimension 1 for continuous functions.
FOCS Conference 1990 Conference Paper
The authors consider a situation in which two processors P/sub 1/ and P/sub 2/ are to evaluate one or more functions f/sub 1/, .. ., f/sub s/ of two vector variables x and y, under the assumption that processor P/sub 1/ (respectively, P/sub 2/) has access only to the value of x (respectively, y) and the functional form of f/sub 1/, .. ., f/sub s/. They consider a continuous model of communication whereby real-valued messages are transmitted, and they study the minimum number of messages required for the desired computation. Tight lower bounds are established for the following three problems: (1) each f/sub i/ is a rational function and only one-way communication is allowed. (2) The variables x and y are matrices and the processors wish to solve the linear system (x+y)z=b for the unknown z. (3) The processors wish to evaluate a particular root of the polynomial equation Sigma (x/sub i/+y/sub i/)z/sup i/=0, where the sum is from i=0 to n-1. >
FOCS Conference 1990 Conference Paper
The authors show a sharp dichotomy between systems of identical automata with symmetric global control whose behavior is easy to predict and those whose behavior is hard to predict. The division pertains to whether the global control rule is invariant with respect to permutations of the states of the automaton. It is also shown that testing whether the global control rule has this invariance property is an undecidable problem. It is argued that there is a natural analog between complexity in the present model and chaos in dynamical systems. >