Arrow Research search

Author name cluster

Gerald Tesauro

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

40 papers
2 author rows

Possible papers

40

JAAMAS Journal 2026 Journal Article

Pricing in Agent Economies Using Multi-Agent Q-Learning

  • Gerald Tesauro
  • Jeffrey O. Kephart

Abstract This paper investigates how adaptive software agents may utilize reinforcement learning algorithms such as Q-learning to make economic decisions such as setting prices in a competitive marketplace. For a single adaptive agent facing fixed-strategy opponents, ordinary Q-learning is guaranteed to find the optimal policy. However, for a population of agents each trying to adapt in the presence of other adaptive agents, the problem becomes non-stationary and history dependent, and it is not known whether any global convergence will be obtained, and if so, whether such solutions will be optimal. In this paper, we study simultaneous Q-learning by two competing seller agents in three moderately realistic economic models. This is the simplest case in which interesting multi-agent phenomena can occur, and the state space is small enough so that lookup tables can be used to represent the Q-functions. We find that, despite the lack of theoretical guarantees, simultaneous convergence to self-consistent optimal solutions is obtained in each model, at least for small values of the discount parameter. In some cases, exact or approximate convergence is also found even at large discount parameters. We show how the Q-derived policies increase profitability and damp out or eliminate cyclic price “wars” compared to simpler policies based on zero lookahead or short-term lookahead. In one of the models (the “Shopbot” model) where the sellers' profit functions are symmetric, we find that Q-learning can produce either symmetric or broken-symmetry policies, depending on the discount parameter and on initial conditions.

AAAI Conference 2022 Conference Paper

Context-Specific Representation Abstraction for Deep Option Learning

  • Marwa Abdulhai
  • Dong-Ki Kim
  • Matthew Riemer
  • Miao Liu
  • Gerald Tesauro
  • Jonathan P. How

Hierarchical reinforcement learning has focused on discovering temporally extended actions, such as options, that can provide benefits in problems requiring extensive exploration. One promising approach that learns these options end-to-end is the option-critic (OC) framework. We examine and show in this paper that OC does not decompose a problem into simpler sub-problems, but instead increases the size of the search over policy space with each option considering the entire state space during learning. This issue can result in practical limitations of this method, including sample inefficient learning. To address this problem, we introduce Context-Specific Representation Abstraction for Deep Option Learning (CRADOL), a new framework that considers both temporal abstraction and context-specific representation abstraction to effectively reduce the size of the search over policy space. Specifically, our method learns a factored belief state representation that enables each option to learn a policy over only a subsection of the state space. We test our method against hierarchical, nonhierarchical, and modular recurrent neural network baselines, demonstrating significant sample efficiency improvements in challenging partially observable environments.

NeurIPS Conference 2022 Conference Paper

Influencing Long-Term Behavior in Multiagent Reinforcement Learning

  • Dong-Ki Kim
  • Matthew Riemer
  • Miao Liu
  • Jakob Foerster
  • Michael Everett
  • Chuangchuang Sun
  • Gerald Tesauro
  • Jonathan P. How

The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.

ICML Conference 2021 Conference Paper

A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning

  • Dong-Ki Kim
  • Miao Liu 0001
  • Matthew Riemer
  • Chuangchuang Sun
  • Marwa Abdulhai
  • Golnaz Habibi
  • Sebastian Lopez-Cot
  • Gerald Tesauro

A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other simultaneously learning agents. In particular, each agent perceives the environment as effectively non-stationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural non-stationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accounts for the non-stationary policy dynamics inherent to multiagent learning settings. This is achieved by modeling our gradient updates to consider both an agent’s own non-stationary policy dynamics and the non-stationary policy dynamics of other agents in the environment. We show that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently comprises all key aspects of previous state of the art approaches on this topic. We test our method on a diverse suite of multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than baseline methods across the full spectrum of mixed incentive, competitive, and cooperative domains.

PRL Workshop 2021 Workshop Paper

AI Planning Annotation in Reinforcement Learning: Options and Beyond

  • Junkyu Lee
  • Michael Katz
  • Don Joven Agravante
  • Miao Liu
  • Tim Klinger
  • Murray Campbell
  • Shirin Sohrabi
  • Gerald Tesauro

AI planning and reinforcement learning (RL) both solve sequential decision-making problems, taking fundamentally different approaches. In this work, we aim to bring AI planning and RL closer by investigating the relationship between abstractions in AI planning and the options framework in RL. To this end, we propose annotating RL tasks with AI planning models, allowing us to define options based purely on the planning model. Our experimental investigation shows that these options can be quickly trained offline and can improve the sample efficiency of a reinforcement learning algorithm.

IJCAI Conference 2021 Conference Paper

Efficient Black-Box Planning Using Macro-Actions with Focused Effects

  • Cameron Allen
  • Michael Katz
  • Tim Klinger
  • George Konidaris
  • Matthew Riemer
  • Gerald Tesauro

The difficulty of deterministic planning increases exponentially with search-tree depth. Black-box planning presents an even greater challenge, since planners must operate without an explicit model of the domain. Heuristics can make search more efficient, but goal-aware heuristics for black-box planning usually rely on goal counting, which is often quite uninformative. In this work, we show how to overcome this limitation by discovering macro-actions that make the goal-count heuristic more accurate. Our approach searches for macro-actions with focused effects (i. e. macros that modify only a small number of state variables), which align well with the assumptions made by the goal-count heuristic. Focused macros dramatically improve black-box planning efficiency across a wide range of planning domains, sometimes beating even state-of-the-art planners with access to a full domain model.

AAAI Conference 2021 Short Paper

RL Generalization in a Theory of Mind Game Through a Sleep Metaphor (Student Abstract)

  • Tyler Malloy
  • Tim Klinger
  • Miao Liu
  • Gerald Tesauro
  • Matthew Riemer
  • Chris R. Sims

Training agents to learn efficiently in multi-agent environments can benefit from the explicit modelling of other agent’s beliefs, especially in complex limited-information games such as the Hanabi card game. However, generalization is also highly relevant to performance in these games, though model comparisons at large training timescales can be difficult. In this work, we address this by introducing a novel model trained using a sleep metaphor on a reduced complexity version of the Hanabi game. This sleep metaphor consists an altered training regiment, as well as an informationtheoretic constraint on the agent’s policy. Results from experimentation demonstrate improved performance through this sleep-metaphor method, and provide a promising motivation for using similar techniques in more complex methods that incorporate explicit models of other agent’s beliefs.

AAAI Conference 2021 Conference Paper

Text-based RL Agents with Commonsense Knowledge: New Challenges, Environments and Baselines

  • Keerthiram Murugesan
  • Mattia Atzeni
  • Pavan Kapanipathi
  • Pushkar Shukla
  • Sadhana Kumaravel
  • Gerald Tesauro
  • Kartik Talamadupula
  • Mrinmaya Sachan

Text-based games have emerged as an important test-bed for Reinforcement Learning (RL) research, requiring RL agents to combine grounded language understanding with sequential decision making. In this paper, we examine the problem of infusing RL agents with commonsense knowledge. Such knowledge would allow agents to efficiently act in the world by pruning out implausible actions, and to perform lookahead planning to determine how current actions might affect future world states. We design a new text-based gaming environment called TextWorld Commonsense (TWC) for training and evaluating RL agents with a specific kind of commonsense knowledge about objects, their attributes, and affordances. We also introduce several baseline RL agents which track the sequential context and dynamically retrieve the relevant commonsense knowledge from ConceptNet. We show that agents which incorporate commonsense knowledge in TWC perform better, while acting more efficiently. We conduct user-studies to estimate human performance on TWC and show that there is ample room for future improvement.

NeurIPS Conference 2020 Conference Paper

Decentralized TD Tracking with Linear Function Approximation and its Finite-Time Analysis

  • Gang Wang
  • Songtao Lu
  • Georgios Giannakis
  • Gerald Tesauro
  • Jian Sun

The present contribution deals with decentralized policy evaluation in multi-agent Markov decision processes using temporal-difference (TD) methods with linear function approximation for scalability. The agents cooperate to estimate the value function of such a process by observing continual state transitions of a shared environment over the graph of interconnected nodes (agents), along with locally private rewards. Different from existing consensus-type TD algorithms, the approach here develops a simple decentralized TD tracker by wedding TD learning with gradient tracking techniques. The non-asymptotic properties of the novel TD tracker are established for both independent and identically distributed (i. i. d. ) as well as Markovian transitions through a unifying multistep Lyapunov analysis. In contrast to the prior art, the novel algorithm forgoes the limiting error bounds on the number of agents, which endows it with performance comparable to that of centralized TD methods that are the sharpest known to date.

AAAI Conference 2020 Conference Paper

On the Role of Weight Sharing During Deep Option Learning

  • Matthew Riemer
  • Ignacio Cases
  • Clemens Rosenbaum
  • Miao Liu
  • Gerald Tesauro

The options framework is a popular approach for building temporally extended actions in reinforcement learning. In particular, the option-critic architecture provides general purpose policy gradient theorems for learning actions from scratch that are extended in time. However, past work makes the key assumption that each of the components of option-critic has independent parameters. In this work we note that while this key assumption of the policy gradient theorems of option-critic holds in the tabular case, it is always violated in practice for the deep function approximation setting. We thus reconsider this assumption and consider more general extensions of optioncritic and hierarchical option-critic training that optimize for the full architecture with each update. It turns out that not assuming parameter independence challenges a belief in prior work that training the policy over options can be disentangled from the dynamics of the underlying options. In fact, learning can be sped up by focusing the policy over options on states where options are actually likely to terminate. We put our new algorithms to the test in application to sample efficient learning of Atari games, and demonstrate significantly improved stability and faster convergence when learning long options. 1

AAAI Conference 2019 Conference Paper

Hybrid Reinforcement Learning with Expert State Sequences

  • Xiaoxiao Guo
  • Shiyu Chang
  • Mo Yu
  • Gerald Tesauro
  • Murray Campbell

Existing imitation learning approaches often require that the complete demonstration data, including sequences of actions and states, are available. In this paper, we consider a more realistic and difficult scenario where a reinforcement learning agent only has access to the state sequences of an expert, while the expert actions are unobserved. We propose a novel tensor-based model to infer the unobserved actions of the expert state sequences. The policy of the agent is then optimized via a hybrid objective combining reinforcement learning and imitation learning. We evaluated our hybrid approach on an illustrative domain and Atari games. The empirical results show that (1) the agents are able to leverage state expert sequences to learn faster than pure reinforcement learning baselines, (2) our tensor-based action inference model is advantageous compared to standard deep neural networks in inferring expert actions, and (3) the hybrid policy optimization objective is robust against noise in expert state sequences.

AAAI Conference 2019 Conference Paper

Learning to Teach in Cooperative Multiagent Reinforcement Learning

  • Shayegan Omidshafiei
  • Dong-Ki Kim
  • Miao Liu
  • Gerald Tesauro
  • Matthew Riemer
  • Christopher Amato
  • Murray Campbell
  • Jonathan P. How

Collective human knowledge has clearly benefited from the fact that innovations by individuals are taught to others through communication. Similar to human social groups, agents in distributed learning systems would likely benefit from communication to share knowledge and teach skills. The problem of teaching to improve agent learning has been investigated by prior works, but these approaches make assumptions that prevent application of teaching to general multiagent problems, or require domain expertise for problems they can apply to. This learning to teach problem has inherent complexities related to measuring long-term impacts of teaching that compound the standard multiagent coordination challenges. In contrast to existing works, this paper presents the first general framework and algorithm for intelligent agents to learn to teach in a multiagent environment. Our algorithm, Learning to Coordinate and Teach Reinforcement (LeCTR), addresses peer-to-peer teaching in cooperative multiagent reinforcement learning. Each agent in our approach learns both when and what to advise, then uses the received advice to improve local learning. Importantly, these roles are not fixed; these agents learn to assume the role of student and/or teacher at the appropriate moments, requesting and providing advice in order to improve teamwide performance and learning. Empirical comparisons against state-of-the-art teaching methods show that our teaching agents not only learn significantly faster, but also learn to coordinate in tasks where existing methods fail.

NeurIPS Conference 2018 Conference Paper

Dialog-based Interactive Image Retrieval

  • Xiaoxiao Guo
  • Hui Wu
  • Yu Cheng
  • Steven Rennie
  • Gerald Tesauro
  • Rogerio Feris

Existing methods for interactive image retrieval have demonstrated the merit of integrating user feedback, improving retrieval results. However, most current systems rely on restricted forms of user feedback, such as binary relevance responses, or feedback based on a fixed set of relative attributes, which limits their impact. In this paper, we introduce a new approach to interactive image search that enables users to provide feedback via natural language, allowing for more natural and effective interaction. We formulate the task of dialog-based interactive image retrieval as a reinforcement learning problem, and reward the dialog system for improving the rank of the target image during each dialog turn. To mitigate the cumbersome and costly process of collecting human-machine conversations as the dialog system learns, we train our system with a user simulator, which is itself trained to describe the differences between target and candidate images. The efficacy of our approach is demonstrated in a footwear retrieval application. Experiments on both simulated and real-world data show that 1) our proposed learning framework achieves better accuracy than other supervised and reinforcement learning baselines and 2) user feedback based on natural language rather than pre-specified attributes leads to more effective retrieval results, and a more natural and expressive communication interface.

NeurIPS Conference 2018 Conference Paper

Learning Abstract Options

  • Matthew Riemer
  • Miao Liu
  • Gerald Tesauro

Building systems that autonomously create temporal abstractions from data is a key challenge in scaling learning and planning in reinforcement learning. One popular approach for addressing this challenge is the options framework (Sutton et al. , 1999). However, only recently in (Bacon et al. , 2017) was a policy gradient theorem derived for online learning of general purpose options in an end to end fashion. In this work, we extend previous work on this topic that only focuses on learning a two-level hierarchy including options and primitive actions to enable learning simultaneously at multiple resolutions in time. We achieve this by considering an arbitrarily deep hierarchy of options where high level temporally extended options are composed of lower level options with finer resolutions in time. We extend results from (Bacon et al. , 2017) and derive policy gradient theorems for a deep hierarchy of options. Our proposed hierarchical option-critic architecture is capable of learning internal policies, termination conditions, and hierarchical compositions over options without the need for any intrinsic rewards or subgoals. Our empirical results in both discrete and continuous environments demonstrate the efficiency of our framework.

IS Journal 2017 Journal Article

Cognitive Computing

  • Mohan Sridharan
  • Gerald Tesauro
  • James Hendler

The guest editors of this special issue on cognitive computing discuss the field in general and the four articles they selected to represent it in particular.

AAAI Conference 2017 Conference Paper

Multiresolution Recurrent Neural Networks: An Application to Dialogue Response Generation

  • Iulian Serban
  • Tim Klinger
  • Gerald Tesauro
  • Kartik Talamadupula
  • Bowen Zhou
  • Yoshua Bengio
  • Aaron Courville

We introduce a new class of models called multiresolution recurrent neural networks, which explicitly model natural language generation at multiple levels of abstraction. The models extend the sequence-to-sequence framework to generate two parallel stochastic processes: a sequence of high-level coarse tokens, and a sequence of natural language words (e. g. sentences). The coarse sequences follow a latent stochastic process with a factorial representation, which helps the models generalize to new examples. The coarse sequences can also incorporate task-specific knowledge, when available. In our experiments, the coarse sequences are extracted using automatic procedures, which are designed to capture compositional structure and semantics. These procedures enable training the multiresolution recurrent neural networks by maximizing the exact joint log-likelihood over both sequences. We apply the models to dialogue response generation in the technical support domain and compare them with several competing models. The multiresolution recurrent neural networks outperform competing models by a substantial margin, achieving stateof-the-art results according to both a human evaluation study and automatic evaluation metrics. Furthermore, experiments show the proposed models generate more fluent, relevant and goal-oriented responses.

AAAI Conference 2016 Conference Paper

Selecting Near-Optimal Learners via Incremental Data Allocation

  • Ashish Sabharwal
  • Horst Samulowitz
  • Gerald Tesauro

We study a novel machine learning (ML) problem setting of sequentially allocating small subsets of training data amongst a large set of classifiers. The goal is to select a classifier that will give near-optimal accuracy when trained on all data, while also minimizing the cost of misallocated samples. This is motivated by large modern datasets and ML toolkits with many combinations of learning algorithms and hyperparameters. Inspired by the principle of “optimism under uncertainty, ” we propose an innovative strategy, Data Allocation using Upper Bounds (DAUB), which robustly achieves these objectives across a variety of real-world datasets. We further develop substantial theoretical support for DAUB in an idealized setting where the expected accuracy of a classifier trained on n samples can be known exactly. Under these conditions we establish a rigorous sub-linear bound on the regret of the approach (in terms of misallocated data), as well as a rigorous bound on suboptimality of the selected classi- fier. Our accuracy estimates using real-world datasets only entail mild violations of the theoretical scenario, suggesting that the practical behavior of DAUB is likely to approach the idealized behavior.

AAAI Conference 2015 Conference Paper

Budgeted Prediction with Expert Advice

  • Kareem Amin
  • Satyen Kale
  • Gerald Tesauro
  • Deepak Turaga

We consider a budgeted variant of the problem of learning from expert advice with N experts. Each queried expert incurs a cost and there is a given budget B on the total cost of experts that can be queried in any prediction round. We provide an online learning algorithm for this setting with regret after T prediction rounds bounded by O q C B log(N)T, where C is the total cost of all experts. We complement this upper bound with a nearly matching lower bound Ω q C B T on the regret of any algorithm for this problem. We also provide experimental validation of our algorithm.

AAAI Conference 2015 Conference Paper

Towards Cognitive Automation of Data Science

  • Alain Biem
  • Maria Butrico
  • Mark Feblowitz
  • Tim Klinger
  • Yuri Malitsky
  • Kenney Ng
  • Adam Perer
  • Chandra Reddy

A Data Scientist typically performs a number of tedious and time-consuming steps to derive insight from a raw data set. The process usually starts with data ingestion, cleaning, and transformation (e. g. outlier removal, missing value imputation), then proceeds to model building, and finally a presentation of predictions that align with the end-users objectives and preferences. It is a long, complex, and sometimes artful process requiring substantial time and effort, especially because of the combinatorial explosion in choices of algorithms (and platforms), their parameters, and their compositions. Tools that can help automate steps in this process have the potential to accelerate the time-to-delivery of useful results, expand the reach of data science to non-experts, and offer a more systematic exploration of the available options. This work presents a step towards this goal.

UAI Conference 2010 Conference Paper

Bayesian Inference in Monte-Carlo Tree Search

  • Gerald Tesauro
  • V. T. Rajan
  • Richard B. Segal

Monte-Carlo Tree Search (MCTS) methods are drawing great interest after yielding breakthrough results in computer Go. This paper proposes a Bayesian approach to MCTS that is inspired by distributionfree approaches such as UCT [13], yet significantly differs in important respects. The Bayesian framework allows potentially much more accurate (Bayes-optimal) estimation of node values and node uncertainties from a limited number of simulation trials. We further propose propagating inference in the tree via fast analytic Gaussian approximation methods: this can make the overhead of Bayesian inference manageable in domains such as Go, while preserving high accuracy of expected-value estimates. We find substantial empirical outperformance of UCT in an idealized bandit-tree test environment, where we can obtain valuable insights by comparing with known ground truth. Additionally we rigorously prove on-policy and off-policy convergence of the proposed methods.

ICML Conference 2009 Conference Paper

Monte-Carlo simulation balancing

  • David Silver 0001
  • Gerald Tesauro

In this paper we introduce the first algorithms for efficiently learning a simulation policy for Monte-Carlo search. Our main idea is to optimise the balance of a simulation policy, so that an accurate spread of simulation outcomes is maintained, rather than optimising the direct strength of the simulation policy. We develop two algorithms for balancing a simulation policy by gradient descent. The first algorithm optimises the balance of complete simulations, using a policy gradient algorithm; whereas the second algorithm optimises the balance over every two steps of simulation. We compare our algorithms to reinforcement learning and supervised learning algorithms for maximising the strength of the simulation policy. We test each algorithm in the domain of 5 x 5 and 6 x 6 Computer Go, using a softmax policy that is parameterised by weights for a hundred simple patterns. When used in a simple Monte-Carlo search, the policies learnt by simulation balancing achieved significantly better performance, with half the mean squared error of a uniform random policy, and similar overall performance to a sophisticated Go engine.

AAMAS Conference 2008 Conference Paper

Autonomic Multi-Agent Management of Power and Performance in Data Centers

  • Rajarshi Das
  • Jeffrey O. Kephart
  • Charles Lefurgy
  • Gerald Tesauro
  • David W. Levine
  • Hoi Chan

The rapidly rising cost and environmental impact of energy consumption in data centers has become a multi-billion dollar concern globally. In response, the IT Industry is actively engaged in a first-to-market race to develop energyconserving hardware and software solutions that do not sacrifice performance objectives. In this work we demonstrate a prototype of an integrated data center power management solution that employs server management tools, appropriate sensors and monitors, and an agent-based approach to achieve specified power and performance objectives. By intelligently turning off servers under low-load conditions, we can achieve over 25% power savings over the unmanaged case without incurring SLA penalties for typical daily and weekly periodic demands seen in webserver farms.

NeurIPS Conference 2007 Conference Paper

Managing Power Consumption and Performance of Computing Systems Using Reinforcement Learning

  • Gerald Tesauro
  • Rajarshi Das
  • Hoi Chan
  • Jeffrey Kephart
  • David Levine
  • Freeman Rawson
  • Charles Lefurgy

Electrical power management in large-scale IT systems such as commercial data- centers is an application area of rapidly growing interest from both an economic and ecological perspective, with billions of dollars and millions of metric tons of CO2 emissions at stake annually. Businesses want to save power without sac- rificing performance. This paper presents a reinforcement learning approach to simultaneous online management of both performance and power consumption. We apply RL in a realistic laboratory testbed using a Blade cluster and dynam- ically varying HTTP workload running on a commercial web applications mid- dleware platform. We embed a CPU frequency controller in the Blade servers’ firmware, and we train policies for this controller using a multi-criteria reward signal depending on both application performance and CPU power consumption. Our testbed scenario posed a number of challenges to successful use of RL, in- cluding multiple disparate reward functions, limited decision sampling rates, and pathologies arising when using multiple sensor readings as state variables. We describe innovative practical solutions to these challenges, and demonstrate clear performance improvements over both hand-designed policies as well as obvious “cookbook” RL implementations.

AAAI Conference 2005 Conference Paper

New Approaches to Optimization and Utility Elicitation in Autonomic Computing

  • Relu Patrascu
  • Rajarshi Das
  • Gerald Tesauro

Autonomic (self-managing) computing systems face the critical problem of resource allocation to different computing elements. Adopting a recent model, we view the problem of provisioning resources as involving utility elicitation and optimization to allocate resources given imprecise utility information. In this paper, we propose a new algorithm for regret-based optimization that performs significantly faster than that proposed in earlier work. We also explore new regret-based elicitation heuristics that are able to find near-optimal allocations while requiring a very small amount of utility information from the distributed computing elements. Since regret-computation is intensive, we compare these to the more tractable Nelder- Mead optimization technique w. r. t. amount of utility information required.

AAAI Conference 2005 Conference Paper

Online Resource Allocation Using Decompositional Reinforcement Learning

  • Gerald Tesauro

This paper considers a novel application domain for reinforcement learning: that of “autonomic computing, ” i. e. selfmanaging computing systems. RL is applied to an online resource allocation task in a distributed multi-application computing environment with independent time-varying load in each application. The task is to allocate servers in real time so as to maximize the sum of performance-based expected utility in each application. This task may be treated as a composite MDP, and to exploit the problem structure, a simple localized RL approach is proposed, with better scalability than previous approaches. The RL approach is tested in a realistic prototype data center comprising real servers, real HTTP requests, and realistic time-varying demand. This domain poses a number of major challenges associated with live training in a real system, including: the need for rapid training, exploration that avoids excessive penalties, and handling complex, potentially non-Markovian system effects. The early results are encouraging: in overnight training, RL performs as well as or slightly better than heavily researched model-based approaches derived from queuing theory.

UAI Conference 2003 Conference Paper

Cooperative Negotiation in Autonomic Systems using Incremental Utility Elicitation

  • Craig Boutilier
  • Rajarshi Das
  • Jeffrey O. Kephart
  • Gerald Tesauro
  • William E. Walsh

Decentralized resource allocation is a key problem for large-scale autonomic (or self-managing) computing systems. Motivated by a data center scenario, we explore efficient techniques for resolving resource conflicts via cooperative negotiation. Rather than computing in advance the functional dependence of each element's utility upon the amount of resource it receives, which could be prohibitively expensive, each element's utility is elicited incrementally. Such incremental utility elicitation strategies require the evaluation of only a small set of sampled utility function points, yet they find near-optimal allocations with respect to a minimax regret criterion. We describe preliminary computational experiments that illustrate the benefit of our approach.

NeurIPS Conference 2003 Conference Paper

Extending Q-Learning to General Adaptive Multi-Agent Systems

  • Gerald Tesauro

Recent multi-agent extensions of Q-Learning require knowledge of other agents’ payoffs and Q-functions, and assume game-theoretic play at all times by all other agents. This paper proposes a fundamentally different approach, dubbed “Hyper-Q” Learning, in which values of mixed strategies rather than base actions are learned, and in which other agents’ strategies are estimated from observed actions via Bayesian in- ference. Hyper-Q may be effective against many different types of adap- tive agents, even if they are persistently dynamic. Against certain broad categories of adaptation, it is argued that Hyper-Q may converge to ex- act optimal time-varying policies. In tests using Rock-Paper-Scissors, Hyper-Q learns to significantly exploit an Infinitesimal Gradient Ascent (IGA) player, as well as a Policy Hill Climber (PHC) player. Preliminary analysis of Hyper-Q against itself is also presented.

NeurIPS Conference 1996 Conference Paper

On-line Policy Improvement using Monte-Carlo Search

  • Gerald Tesauro
  • Gregory Galperin

We present a Monte-Carlo simulation algorithm for real-time policy improvement of an adaptive controller. In the Monte-Carlo sim(cid: 173) ulation, the long-term expected reward of each possible action is statistically measured, using the initial policy to make decisions in each step of the simulation. The action maximizing the measured expected reward is then taken, resulting in an improved policy. Our algorithm is easily parallelizable and has been implemented on the IBM SP! and SP2 parallel-RISC supercomputers. We have obtained promising initial results in applying this algo(cid: 173) rithm to the domain of backgammon. Results are reported for a wide variety of initial policies, ranging from a random policy to TD-Gammon, an extremely strong multi-layer neural network. In each case, the Monte-Carlo algorithm gives a substantial reduction, by as much as a factor of 5 or more, in the error rate of the base players. The algorithm is also potentially useful in many other adaptive control applications in which it is possible to simulate the environment.

NeurIPS Conference 1991 Conference Paper

Practical Issues in Temporal Difference Learning

  • Gerald Tesauro

This paper examines whether temporal difference methods for training connectionist networks, such as Suttons's TO(') algorithm, can be suc(cid: 173) cessfully applied to complex real-world problems. A number of important practical issues are identified and discussed from a general theoretical per(cid: 173) spective. These practical issues are then examined in the context of a case study in which TO(') is applied to learning the game of backgammon from the outcome of self-play. This is apparently the first application of this algorithm to a complex nontrivial task. It is found that, with zero knowledge built in, the network is able to learn from scratch to play the entire game at a fairly strong intermediate level of performance, which is clearly better than conventional commercial programs, and which in fact surpasses comparable networks trained on a massive human expert data set. The hidden units in these network have apparently discovered useful features, a longstanding goal of computer games research. Furthermore, when a set of hand-crafted features is added to the input representation, the resulting networks reach a near-expert level of performance, and have achieved good results against world-class human play.

NeurIPS Conference 1990 Conference Paper

Can neural networks do better than the Vapnik-Chervonenkis bounds?

  • David Cohn
  • Gerald Tesauro

\Ve describe a series of careful llumerical experiments which measure the average generalization capability of neural networks trained on a variety of simple functions. These experiments are designed to test whether average generalization performance can surpass the worst-case bounds obtained from formal learning theory using the Vapnik-Chervonenkis dimension (Blumer et al. , 1989). We indeed find that, in some cases, the average generalization is significantly better than the VC bound: the approach to perfect performance is exponential in the number of examples m, rather than the 11m result of the bound. In other cases, we do find the 11m behavior of the VC bound, and in these cases, the numerical prefactor is closely related to prefactor contained in the bound.

NeurIPS Conference 1989 Conference Paper

Asymptotic Convergence of Backpropagation: Numerical Experiments

  • Subutai Ahmad
  • Gerald Tesauro
  • Yu He

Yu He Dept. of Physics Ohio State Univ. Columbus, OH 43212 We have calculated, both analytically and in simulations, the rate of convergence at long times in the backpropagation learning al(cid: 173) gorithm for networks with and without hidden units. Our basic finding for units using the standard sigmoid transfer function is lit convergence of the error for large t, with at most logarithmic cor(cid: 173) rections for networks with hidden units. Other transfer functions may lead to a 8lower polynomial rate of convergence. Our analytic calculations were presented in (Tesauro, He & Ahamd, 1989). Here we focus in more detail on our empirical measurements of the con(cid: 173) vergence rate in numerical simulations, which confirm our analytic results.

NeurIPS Conference 1989 Conference Paper

Neural Network Visualization

  • Jakub Wejchert
  • Gerald Tesauro

We have developed graphics to visualize static and dynamic infor(cid: 173) mation in layered neural network learning systems. Emphasis was placed on creating new visuals that make use of spatial arrange(cid: 173) ments, size information, animation and color. We applied these tools to the study of back-propagation learning of simple Boolean predicates, and have obtained new insights into the dynamics of the learning process.

NeurIPS Conference 1988 Conference Paper

Connectionist Learning of Expert Preferences by Comparison Training

  • Gerald Tesauro

A new training paradigm, caned the "eomparison pa. radigm, " is introduced for tasks in which a. network must learn to choose a prdcrred pattern from a set of n alternatives, based on examplcs of Imma. n expert prderences. In this pa. radigm, the inpu t to the network consists of t. wo uf the n alterna tives, and the trained output is the expert's judgement of which pa. ttern is better. This para. digm is applied to the lea, rning of hackgammon, a difficult board ga. me in wllieh the expert selects a move from a. set, of legal mm·es. \Vith compa. rison training, much higher levels of performance can hc a. chiew~d, with networks that are much smaller, and with coding sehemes t. hat are much simpler and easier to understand. Furthermorf', it is possible to set up the network so tha. t it always produces consisten t rank-orderings. 1.

NeurIPS Conference 1988 Conference Paper

Scaling and Generalization in Neural Networks: A Case Study

  • Subutai Ahmad
  • Gerald Tesauro

The issues of scaling and generalization have emerged as key issues in current studies of supervised learning from examples in neural networks. Questions such as how many training patterns and training cycles are needed for a problem of a given size and difficulty, how to represent the inllUh and how to choose useful training exemplars, are of considerable theoretical and practical importance. Several intuitive rules of thumb have been obtained from empirical studies, but as yet there are few rig(cid: 173) orous results. In this paper we summarize a study Qf generalization in the simplest possible case-perceptron networks learning linearly separa(cid: 173) ble functions. The task chosen was the majority function (i. e. return a 1 if a majority of the input units are on), a predicate with a num(cid: 173) ber of useful properties. We find that many aspects of. generalization in multilayer networks learning large, difficult tasks are reproduced in this simple domain, in which concrete numerical results and even some analytic understanding can be achieved.

NeurIPS Conference 1987 Conference Paper

A 'Neural' Network that Learns to Play Backgammon

  • Gerald Tesauro
  • Terrence Sejnowski

We describe a class of connectionist networks that have learned to play back(cid: 173) gammon at an intermediate-to-advanced level. TIle networks were trained by a supervised learning procedure on a large set of sample positions evaluated by a human expert. In actual match play against humans and conventional computer programs, the networks demonstrate substantial ability to generalize on the basis of expert knowledge. Our study touches on some of the most important issues in net(cid: 173) work learning theory, including the development of efficient coding schemes and training procedures, scaling, generalization, the use of real-valued inputs and out(cid: 173) puts, and techniques for escaping from local minima. Practical applications in games and other domains are also discussed.