Author name cluster

Thore Graepel

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

50 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

PerturBench: Benchmarking Machine Learning Models for Cellular Perturbation Analysis

Yan Wu
Esther Wershof
Sebastian Schmon
Marcel Nassar
Błażej Osiński
Ridvan Eksi
Zichao Yan
Rory Stark

We introduce a comprehensive framework for modeling single cell transcriptomic responses to perturbations, aimed at standardizing benchmarking in this rapidly evolving field. Our approach includes a modular and user-friendly model development and evaluation platform, a collection of diverse perturbational datasets, and a set of metrics designed to fairly compare models and dissect their performance. Through extensive evaluation of both published and baseline models across diverse datasets, we highlight the limitations of widely used models, such as mode collapse. We also demonstrate the importance of rank metrics which complement traditional model fit measures, such as RMSE, for validating model effectiveness. Notably, our results show that while no single model architecture clearly outperforms others, simpler architectures are generally competitive and scale well with larger datasets. Overall, this benchmarking exercise sets new standards for model evaluation, supports robust model development, and furthers the use of these models to simulate genetic and chemical screens for therapeutic discovery.

ICLR Conference 2022 Conference Paper

EigenGame Unloaded: When playing games is better than optimizing

Ian Gemp
Brian McWilliams
Claire Vernade
Thore Graepel

We build on the recently proposed EigenGame that views eigendecomposition as a competitive game. EigenGame's updates are biased if computed using minibatches of data, which hinders convergence and more sophisticated parallelism in the stochastic setting. In this work, we propose an unbiased stochastic update that is asymptotically equivalent to EigenGame, enjoys greater parallelism allowing computation on datasets of larger sample sizes, and outperforms EigenGame in experiments. We present applications to finding the principal components of massive datasets and performing spectral clustering of graphs. We analyze and discuss our proposed update in the context of EigenGame and the shift in perspective from optimization to games.

ICLR Conference 2022 Conference Paper

NeuPL: Neural Population Learning

Siqi Liu 0002
Luke Marris
Daniel Hennes
Josh Merel
Nicolas Heess
Thore Graepel

Learning in strategy games (e.g. StarCraft, poker) requires the discovery of diverse policies. This is often achieved by iteratively training new policies against existing ones, growing a policy population that is robust to exploit. This iterative approach suffers from two issues in real-world games: a) under finite budget, approximate best-response operators at each iteration needs truncating, resulting in under-trained good-responses populating the population; b) repeated learning of basic skills at each iteration is wasteful and becomes intractable in the presence of increasingly strong opponents. In this work, we propose Neural Population Learning (NeuPL) as a solution to both issues. NeuPL offers convergence guarantees to a population of best-responses under mild assumptions. By representing a population of policies within a single conditional model, NeuPL enables transfer learning across policies. Empirically, we show the generality, improved performance and efficiency of NeuPL across several test domains. Most interestingly, we show that novel strategies become more accessible, not less, as the neural population expands.

IJCAI Conference 2021 Conference Paper

A Neural Network Auction For Group Decision Making Over a Continuous Space

Yoram Bachrach
Ian Gemp
Marta Garnelo
Janos Kramar
Tom Eccles
Dan Rosenbaum
Thore Graepel

We propose a system for conducting an auction over locations in a continuous space. It enables participants to express their preferences over possible choices of location in the space, selecting the location that maximizes the total utility of all agents. We prevent agents from tricking the system into selecting a location that improves their individual utility at the expense of others by using a pricing rule that gives agents no incentive to misreport their true preferences. The system queries participants for their utility in many random locations, then trains a neural network to approximate the preference function of each participant. The parameters of these neural network models are transmitted and processed by the auction mechanism, which composes these into differentiable models that are optimized through gradient ascent to compute the final chosen location and charged prices.

PDF Details DOI

ICLR Conference 2021 Conference Paper

EigenGame: PCA as a Nash Equilibrium

Ian Gemp
Brian McWilliams
Claire Vernade
Thore Graepel

We present a novel view on principal components analysis as a competitive game in which each approximate eigenvector is controlled by a player whose goal is to maximize their own utility function. We analyze the properties of this PCA game and the behavior of its gradient based updates. The resulting algorithm---which combines elements from Oja's rule with a generalized Gram-Schmidt orthogonalization---is naturally decentralized and hence parallelizable through message passing. We demonstrate the scalability of the algorithm with experiments on large image datasets and neural network activations. We discuss how this new view of PCA as a differentiable game can lead to further algorithmic developments and insights.

JAIR Journal 2021 Journal Article

Game Plan: What AI can do for Football, and What Football can do for AI

Karl Tuyls
Shayegan Omidshafiei
Paul Muller
Zhe Wang
Jerome Connor
Daniel Hennes
Ian Graham
William Spearman

The rapid progress in artificial intelligence (AI) and machine learning has opened unprecedented analytics possibilities in various team and individual sports, including baseball, basketball, and tennis. More recently, AI techniques have been applied to football, due to a huge increase in data collection by professional teams, increased computational power, and advances in machine learning, with the goal of better addressing new scientific challenges involved in the analysis of both individual players’ and coordinated teams’ behaviors. The research challenges associated with predictive and prescriptive football analytics require new developments and progress at the intersection of statistical learning, game theory, and computer vision. In this paper, we provide an overarching perspective highlighting how the combination of these fields, in particular, forms a unique microcosm for AI research, while offering mutual benefits for professional teams, spectators, and broadcasters in the years to come. We illustrate that this duality makes football analytics a game changer of tremendous value, in terms of not only changing the game of football itself, but also in terms of what this domain can mean for the field of AI. We review the state-of-the-art and exemplify the types of analysis enabled by combining the aforementioned fields, including illustrative examples of counterfactual analysis using predictive models, and the combination of game-theoretic analysis of penalty kicks with statistical learning of player attributes. We conclude by highlighting envisioned downstream impacts, including possibilities for extensions to other sports (real and virtual).

PDF Details DOI

ICML Conference 2021 Conference Paper

Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers

Luke Marris
Paul Muller
Marc Lanctot
Karl Tuyls
Thore Graepel

Two-player, constant-sum games are well studied in the literature, but there has been limited progress outside of this setting. We propose Joint Policy-Space Response Oracles (JPSRO), an algorithm for training agents in n-player, general-sum extensive form games, which provably converges to an equilibrium. We further suggest correlated equilibria (CE) as promising meta-solvers, and propose a novel solution concept Maximum Gini Correlated Equilibrium (MGCE), a principled and computationally efficient family of solutions for solving the correlated equilibrium selection problem. We conduct several experiments using CE meta-solvers for JPSRO and demonstrate convergence on n-player, general-sum games.

ICML Conference 2021 Conference Paper

Scalable Evaluation of Multi-Agent Reinforcement Learning with Melting Pot

Joel Z. Leibo
Edgar A. Duéñez-Guzmán
Alexander Sasha Vezhnevets
John P. Agapiou
Peter Sunehag
Raphael Koster
Jayd Matyas
Charlie Beattie

Existing evaluation suites for multi-agent reinforcement learning (MARL) do not assess generalization to novel situations as their primary objective (unlike supervised learning benchmarks). Our contribution, Melting Pot, is a MARL evaluation suite that fills this gap and uses reinforcement learning to reduce the human labor required to create novel test scenarios. This works because one agent’s behavior constitutes (part of) another agent’s environment. To demonstrate scalability, we have created over 80 unique test scenarios covering a broad range of research topics such as social dilemmas, reciprocity, resource sharing, and task partitioning. We apply these test scenarios to standard MARL training algorithms, and demonstrate how Melting Pot reveals weaknesses not apparent from training performance alone.

ICLR Conference 2020 Conference Paper

A Generalized Training Approach for Multiagent Learning

Paul Muller
Shayegan Omidshafiei
Mark Rowland 0001
Karl Tuyls
Julien Pérolat
Siqi Liu 0002
Daniel Hennes
Luke Marris

This paper investigates a population-based training regime based on game-theoretic principles called Policy-Spaced Response Oracles (PSRO). PSRO is general in the sense that it (1) encompasses well-known algorithms such as fictitious play and double oracle as special cases, and (2) in principle applies to general-sum, many-player games. Despite this, prior studies of PSRO have been focused on two-player zero-sum games, a regime where in Nash equilibria are tractably computable. In moving from two-player zero-sum games to more general settings, computation of Nash equilibria quickly becomes infeasible. Here, we extend the theoretical underpinnings of PSRO by considering an alternative solution concept, α-Rank, which is unique (thus faces no equilibrium selection issues, unlike Nash) and applies readily to general-sum, many-player settings. We establish convergence guarantees in several games classes, and identify links between Nash equilibria and α-Rank. We demonstrate the competitive performance of α-Rank-based PSRO against an exact Nash solver-based PSRO in 2-player Kuhn and Leduc Poker. We then go beyond the reach of prior PSRO applications by considering 3- to 5-player poker games, yielding instances where α-Rank achieves faster convergence than approximate Nash solvers, thus establishing it as a favorable general games solver. We also carry out an initial empirical validation in MuJoCo soccer, illustrating the feasibility of the proposed approach in another complex domain.

NeurIPS Conference 2020 Conference Paper

Learning to Play No-Press Diplomacy with Best Response Policy Iteration

Thomas Anthony
Tom Eccles
Andrea Tacchetti
János Kramár
Ian Gemp
Thomas Hudson
Nicolas Porcel
Marc Lanctot

Recent advances in deep reinforcement learning (RL) have led to considerable progress in many 2-player zero-sum games, such as Go, Poker and Starcraft. The purely adversarial nature of such games allows for conceptually simple and principled application of RL methods. However real-world settings are many-agent, and agent interactions are complex mixtures of common-interest and competitive aspects. We consider Diplomacy, a 7-player board game designed to accentuate dilemmas resulting from many-agent interactions. It also features a large combinatorial action space and simultaneous moves, which are challenging for RL algorithms. We propose a simple yet effective approximate best response operator, designed to handle large combinatorial action spaces and simultaneous moves. We also introduce a family of policy iteration methods that approximate fictitious play. With these methods, we successfully apply RL to Diplomacy: we show that our agents convincingly outperform the previous state-of-the-art, and game theoretic equilibrium analysis shows that the new process yields consistent improvements.

AIJ Journal 2020 Journal Article

Negotiating team formation using deep reinforcement learning

Yoram Bachrach
Richard Everett
Edward Hughes
Angeliki Lazaridou
Joel Z. Leibo
Marc Lanctot
Michael Johanson
Wojciech M. Czarnecki

ICLR Conference 2020 Conference Paper

Smooth markets: A basic mechanism for organizing gradient-based learners

David Balduzzi
Wojciech Czarnecki 0001
Thomas W. Anthony 0001
Ian Gemp
Edward Hughes 0001
Joel Z. Leibo
Georgios Piliouras
Thore Graepel

With the success of modern machine learning, it is becoming increasingly important to understand and control how learning algorithms interact. Unfortunately, negative results from game theory show there is little hope of understanding or controlling general n-player games. We therefore introduce smooth markets (SM-games), a class of n-player games with pairwise zero sum interactions. SM-games codify a common design pattern in machine learning that includes some GANs, adversarial training, and other recent algorithms. We show that SM-games are amenable to analysis and optimization using first-order methods.

NeurIPS Conference 2019 Conference Paper

Biases for Emergent Communication in Multi-agent Reinforcement Learning

Tom Eccles
Yoram Bachrach
Guy Lever
Angeliki Lazaridou
Thore Graepel

We study the problem of emergent communication, in which language arises because speakers and listeners must communicate information in order to solve tasks. In temporally extended reinforcement learning domains, it has proved hard to learn such communication without centralized training of agents, due in part to a difficult joint exploration problem. We introduce inductive biases for positive signalling and positive listening, which ease this problem. In a simple one-step environment, we demonstrate how these biases ease the learning problem. We also apply our methods to a more extended environment, showing that agents with these inductive biases achieve better performance, and analyse the resulting communications protocols.

JAAMAS Journal 2019 Journal Article

Bounds and dynamics for empirical game theoretic analysis

Karl Tuyls
Julien Perolat
Thore Graepel

Abstract This paper provides several theoretical results for empirical game theory. Specifically, we introduce bounds for empirical game theoretical analysis of complex multi-agent interactions. In doing so we provide insights in the empirical meta game showing that a Nash equilibrium of the estimated meta-game is an approximate Nash equilibrium of the true underlying meta-game. We investigate and show how many data samples are required to obtain a close enough approximation of the underlying game. Additionally, we extend the evolutionary dynamics analysis of meta-games using heuristic payoff tables (HPTs) to asymmetric games. The state-of-the-art has only considered evolutionary dynamics of symmetric HPTs in which agents have access to the same strategy sets and the payoff structure is symmetric, implying that agents are interchangeable. Finally, we carry out an empirical illustration of the generalised method in several domains, illustrating the theory and evolutionary dynamics of several versions of the AlphaGo algorithm (symmetric), the dynamics of the Colonel Blotto game played by human players on Facebook (symmetric), the dynamics of several teams of players in the capture the flag game (symmetric), and an example of a meta-game in Leduc Poker (asymmetric), generated by the policy-space response oracle multi-agent learning algorithm.

JMLR Journal 2019 Journal Article

Differentiable Game Mechanics

Alistair Letcher
David Balduzzi
Sébastien Racanière
James Martens
Jakob Foerster
Karl Tuyls
Thore Graepel

Deep learning is built on the foundational guarantee that gradient descent on an objective function converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, that exhibit multiple interacting losses. The behavior of gradient-based methods in games is not well understood -- and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new tools to understand and control the dynamics in $n$-player differentiable games. The key result is to decompose the game Jacobian into two components. The first, symmetric component, is related to potential games, which reduce to gradient descent on an implicit function. The second, antisymmetric component, relates to Hamiltonian games, a new class of games that obey a conservation law akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in differentiable games. Basic experiments show SGA is competitive with recently proposed algorithms for finding stable fixed points in GANs -- while at the same time being applicable to, and having guarantees in, much more general cases. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

AAMAS Conference 2019 Conference Paper

Malthusian Reinforcement Learning

Joel Z. Leibo
Julien Perolat
Edward Hughes
Steven Wheelwright
Adam H. Marblestone
Edgar Duéñez-Guzmán
Peter Sunehag
Iain Dunning

Here we explore a new algorithmic framework for multi-agent reinforcement learning, called Malthusian reinforcement learning, which extends self-play to include fitness-linked population size dynamics that drive ongoing innovation. In Malthusian RL, increases in a subpopulation’s average return drive subsequent increases in its size, just as Thomas Malthus argued in 1798 was the relationship between preindustrial income levels and population growth [24]. Malthusian reinforcement learning harnesses the competitive pressures arising from growing and shrinking population size to drive agents to explore regions of state and policy spaces that they could not otherwise reach. Furthermore, in environments where there are potential gains from specialization and division of labor, we show that Malthusian reinforcement learning is better positioned to take advantage of such synergies than algorithms based on self-play.

ICML Conference 2019 Conference Paper

Open-ended learning in symmetric zero-sum games

David Balduzzi
Marta Garnelo
Yoram Bachrach
Wojciech Czarnecki 0001
Julien Pérolat
Max Jaderberg
Thore Graepel

Zero-sum games such as chess and poker are, abstractly, functions that evaluate pairs of agents, for example labeling them ‘winner’ and ‘loser’. If the game is approximately transitive, then self-play generates sequences of agents of increasing strength. However, nontransitive games, such as rock-paper-scissors, can exhibit strategic cycles, and there is no longer a clear objective – we want agents to increase in strength, but against whom is unclear. In this paper, we introduce a geometric framework for formulating agent objectives in zero-sum games, in order to construct adaptive sequences of objectives that yield open-ended learning. The framework allows us to reason about population performance in nontransitive games, and enables the development of a new algorithm (rectified Nash response, PSRO_rN) that uses game-theoretic niching to construct diverse populations of effective agents, producing a stronger set of agents than existing algorithms. We apply PSRO_rN to two highly nontransitive resource allocation games and find that PSRO_rN consistently outperforms the existing alternatives.

AAMAS Conference 2019 Conference Paper

The Body is Not a Given: Joint Agent Policy Learning and Morphology Evolution

Dylan Banarse
Yoram Bachrach
Siqi Liu
Guy Lever
Nicolas Heess
Chrisantha Fernando
Pushmeet Kohli
Thore Graepel

Reinforcement learning (RL) has proven to be a powerful paradigm for deriving complex behaviors from simple reward signals in a wide range of environments. When applying RL to continuous control agents in simulated physics environments, the body is usually considered to be part of the environment. However, during evolution the physical body of biological organisms and their controlling brains are co-evolved, thus exploring a much larger space of actuator/controller configurations. Put differently, the intelligence does not reside only in the agent’s mind, but also in the design of their body. We propose a method for uncovering strong agents, consisting of a good combination of a body and policy, based on combining RL with an evolutionary procedure. Given the resulting agent, we also propose an approach for identifying the body changes that contributed the most to the agent performance. We use the Shapley value from cooperative game theory to find the fair contribution of individual components, taking into account synergies between components. We evaluate our methods in an environment similar to the the recently proposed Robo-Sumo task, where agents in a software physics simulator compete in tipping over their opponent or pushing them out of the arena. Our results show that the proposed methods are indeed capable of generating strong agents, significantly outperforming baselines that focus on optimizing the agent policy alone. A video is available at: https: //youtu. be/CHlecRim9PI

AAMAS Conference 2018 Conference Paper

A Generalised Method for Empirical Game Theoretic Analysis

Karl Tuyls
Julien Perolat
Marc Lanctot
Joel Z. Leibo
Thore Graepel

This paper provides theoretical bounds for empirical game theoretical analysis of complex multi-agent interactions. We provide insights in the empirical meta game showing that a Nash equilibrium of the meta-game is an approximate Nash equilibrium of the true underlying game. We investigate and show how many data samples are required to obtain a close enough approximation of the underlying game. Additionally, we extend the meta-game analysis methodology to asymmetric games. The state-of-the-art has only considered empirical games in which agents have access to the same strategy sets and the payoff structure is symmetric, implying that agents are interchangeable. Finally, we carry out an empirical illustration of the generalised method in several domains, illustrating the theory and evolutionary dynamics of several versions of the AlphaGo algorithm (symmetric), the dynamics of the Colonel Blotto game played by human players on Facebook (symmetric), and an example of a meta-game in Leduc Poker (asymmetric), generated by the PSRO multi-agent learning algorithm.

NeurIPS Conference 2018 Conference Paper

Inequity aversion improves cooperation in intertemporal social dilemmas

Edward Hughes
Joel Leibo
Matthew Phillips
Karl Tuyls
Edgar Dueñez-Guzman
Antonio García Castañeda
Iain Dunning
Tina Zhu

Groups of humans are often able to find ways to cooperate with one another in complex, temporally extended social dilemmas. Models based on behavioral economics are only able to explain this phenomenon for unrealistic stateless matrix games. Recently, multi-agent reinforcement learning has been applied to generalize social dilemma problems to temporally and spatially extended Markov games. However, this has not yet generated an agent that learns to cooperate in social dilemmas as humans do. A key insight is that many, but not all, human individuals have inequity averse social preferences. This promotes a particular resolution of the matrix game social dilemma wherein inequity-averse individuals are personally pro-social and punish defectors. Here we extend this idea to Markov games and show that it promotes cooperation in several types of sequential social dilemma, via a profitable interaction with policy learnability. In particular, we find that inequity aversion improves temporal credit assignment for the important class of intertemporal social dilemmas. These results help explain how large-scale cooperation may emerge and persist.

NeurIPS Conference 2018 Conference Paper

Re-evaluating evaluation

David Balduzzi
Karl Tuyls
Julien Perolat
Thore Graepel

Progress in machine learning is measured by careful evaluation on problems of outstanding common interest. However, the proliferation of benchmark suites and environments, adversarial attacks, and other complications has diluted the basic evaluation model by overwhelming researchers with choices. Deliberate or accidental cherry picking is increasingly likely, and designing well-balanced evaluation suites requires increasing effort. In this paper we take a step back and propose Nash averaging. The approach builds on a detailed analysis of the algebraic structure of evaluation in two basic scenarios: agent-vs-agent and agent-vs-task. The key strength of Nash averaging is that it automatically adapts to redundancies in evaluation data, so that results are not biased by the incorporation of easy tasks or weak agents. Nash averaging thus encourages maximally inclusive evaluation -- since there is no harm (computational cost aside) from including all available tasks and agents.

ICML Conference 2018 Conference Paper

The Mechanics of n-Player Differentiable Games

David Balduzzi
Sébastien Racanière
James Martens
Jakob N. Foerster
Karl Tuyls
Thore Graepel

The cornerstone underpinning deep learning is the guarantee that gradient descent on an objective converges to local minima. Unfortunately, this guarantee fails in settings, such as generative adversarial nets, where there are multiple interacting losses. The behavior of gradient-based methods in games is not well understood – and is becoming increasingly important as adversarial and multi-objective architectures proliferate. In this paper, we develop new techniques to understand and control the dynamics in general games. The key result is to decompose the second-order dynamics into two components. The first is related to potential games, which reduce to gradient descent on an implicit function; the second relates to Hamiltonian games, a new class of games that obey a conservation law, akin to conservation laws in classical mechanical systems. The decomposition motivates Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Basic experiments show SGA is competitive with recently proposed algorithms for finding local Nash equilibria in GANs – whilst at the same time being applicable to – and having guarantees in – much more general games.

AAMAS Conference 2018 Conference Paper

Value-Decomposition Networks For Cooperative Multi-Agent Learning Based On Team Reward

Peter Sunehag
Guy Lever
Audrunas Gruslys
Wojciech Marian Czarnecki
Vinicius Zambaldi
Max Jaderberg
Marc Lanctot
Nicolas Sonnerat

We study the problem of cooperative multi-agent reinforcement learning with a single joint reward signal. This class of learning problems is difficult because of the often large combined action and observation spaces. In the fully centralized and decentralized approaches, we find the problem of spurious rewards and a phenomenon we call the “lazy agent” problem, which arises due to partial observability. We address these problems by training individual agents with a novel value-decomposition network architecture, which learns to decompose the team value function into agent-wise value functions.

NeurIPS Conference 2017 Conference Paper

A multi-agent reinforcement learning model of common-pool resource appropriation

Julien Pérolat
Joel Leibo
Vinicius Zambaldi
Charles Beattie
Karl Tuyls
Thore Graepel

Humanity faces numerous problems of common-pool resource appropriation. This class of multi-agent social dilemma includes the problems of ensuring sustainable use of fresh water, common fisheries, grazing pastures, and irrigation systems. Abstract models of common-pool resource appropriation based on non-cooperative game theory predict that self-interested agents will generally fail to find socially positive equilibria---a phenomenon called the tragedy of the commons. However, in reality, human societies are sometimes able to discover and implement stable cooperative solutions. Decades of behavioral game theory research have sought to uncover aspects of human behavior that make this possible. Most of that work was based on laboratory experiments where participants only make a single choice: how much to appropriate. Recognizing the importance of spatial and temporal resource dynamics, a recent trend has been toward experiments in more complex real-time video game-like environments. However, standard methods of non-cooperative game theory can no longer be used to generate predictions for this case. Here we show that deep reinforcement learning can be used instead. To that end, we study the emergent behavior of groups of independently learning agents in a partially observed Markov game modeling common-pool resource appropriation. Our experiments highlight the importance of trial-and-error learning in common-pool resource appropriation and shed light on the relationship between exclusion, sustainability, and inequality.

NeurIPS Conference 2017 Conference Paper

A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning

Marc Lanctot
Vinicius Zambaldi
Audrunas Gruslys
Angeliki Lazaridou
Karl Tuyls
Julien Perolat
David Silver
Thore Graepel

There has been a resurgence of interest in multiagent reinforcement learning (MARL), due partly to the recent success of deep neural networks. The simplest form of MARL is independent reinforcement learning (InRL), where each agent treats all of its experience as part of its (non stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe a meta-algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game theoretic analysis to compute meta-strategies for policy selection. The meta-algorithm generalizes previous algorithms such as InRL, iterated best response, double oracle, and fictitious play. Then, we propose a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in three partially observable settings: gridworld coordination problems, emergent language games, and poker.

AAMAS Conference 2017 Conference Paper

Multi-agent Reinforcement Learning in Sequential Social Dilemmas

Joel Z. Leibo
Vinicius Zambaldi
Marc Lanctot
Janusz Marecki
Thore Graepel

Matrix games like Prisoner’s Dilemma have guided research on social dilemmas for decades. However, they necessarily treat the choice to cooperate or defect as an atomic action. In real-world social dilemmas these choices are temporally extended. Cooperativeness is a property that applies to policies, not elementary actions. We introduce sequential social dilemmas that share the mixed incentive structure of matrix game social dilemmas but also require agents to learn policies that implement their strategic intentions. We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Qnetwork, on two Markov games we introduce here: 1. a fruit Gathering game and 2. a Wolfpack hunting game. We characterize how learned behavior in each domain changes as a function of environmental factors including resource abundance. Our experiments show how conflict can emerge from competition over shared resources and shed light on how the sequential nature of real world social dilemmas affects cooperation. CCS Concepts •Computing methodologies → Multi-agent reinforcement learning; Agent / discrete models; Stochastic games;

AAMAS Conference 2012 Conference Paper

Crowd IQ - Aggregating Opinions to Boost Performance

Yoram Bachrach
Thore Graepel
Gjergji Kasneci
Michal Kosinski
Jurgen Van-Gael

We show how the quality of the decisions based on the aggregated opinions of the crowd can be conveniently studied using a sample of individual responses to a standard IQ questionnaire. We aggregated the responses to the IQ questionnaire using simple majority voting and a machine learning approach based on a probabilistic graphical model. The score for the aggregated questionnaire, Crowd IQ, serves as a quality measure of decisions based on aggregating opinions, which also allows quantifying individual and crowd performance on the same scale. We show that Crowd IQ grows quickly with the size of the crowd but saturates, and that for small homogeneous crowds the Crown IQ significantly exceeds the IQ of even their most intelligent member. We investigate alternative ways of aggregating the responses and the impact of the aggregation method on the resulting Crowd IQ. We also discuss Contextual IQ, a method of quantifying the individual participant's contribution to the Crowd IQ based on the Shapley value from cooperative game theory.

ICML Conference 2012 Conference Paper

How To Grade a Test Without Knowing the Answers - A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing

Yoram Bachrach
Thore Graepel
Thomas P. Minka
John Guiver

AAAI Conference 2012 Conference Paper

Quality Expectation-Variance Tradeoffs in Crowdsourcing Contests

Xi Gao
Yoram Bachrach
Peter Key
Thore Graepel

We examine designs for crowdsourcing contests, where participants compete for rewards given to superior solutions of a task. We theoretically analyze tradeoffs between the expectation and variance of the principal’s utility (i. e. the best solution’s quality), and empirically test our theoretical predictions using a controlled experiment on Amazon Mechanical Turk. Our evaluation method is also crowdsourcing based and relies on the peer prediction mechanism. Our theoretical analysis shows an expectation-variance tradeoff of the principal’s utility in such contests through a Pareto efﬁcient frontier. In particular, we show that the simple contest with 2 authors and the 2-pair contest have good theoretical properties. In contrast, our empirical results show that the 2-pair contest is the superior design among all designs tested, achieving the highest expectation and lowest variance of the principal’s utility.

AAMAS Conference 2011 Conference Paper

Rip-off: Playing the Cooperative Negotiation Game

Yoram Bachrach
Pushmeet Kohli
Thore Graepel

We propose "Rip-off", a new multi-player bargaining game based on the well-studied weighted voting game (WVG) model from cooperative game theory. Many different solution concepts, such as the Core and the Shapley value have been proposed to analyze models such as WVGs. However, there is little work on analyzing how humans actually play in such settings. We conducted several experiments where we let humans play "Rip-off". Our analysis reveals that although solutions of games played by humans do suffer from certain biases, a player's average payoff over several games is roughly reflected by the Shapley value.

AAAI Conference 2010 Conference Paper

Collaborative Expert Portfolio Management

David Stern
Horst Samulowitz
Ralf Herbrich
Thore Graepel
Luca Pulina
Armando Tacchella

We consider the task of assigning experts from a portfolio of specialists in order to solve a set of tasks. We apply a Bayesian model which combines collaborative filtering with a feature-based description of tasks and experts to yield a general framework for managing a portfolio of experts. The model learns an embedding of tasks and problems into a latent space in which affinity is measured by the inner product. The model can be trained incrementally and can track non-stationary data, tracking potentially changing expert and task characteristics. The approach allows us to use a principled decision theoretic framework for expert selection, allowing the user to choose a utility function that best suits their objectives. The model component for taking into account the performance feedback data is pluggable, allowing flexibility. We apply the model to manage a portfolio of algorithms to solve hard combinatorial problems. This is a well studied area and we demonstrate a large improvement on the state of the art in one domain (constraint solving) and in a second domain (combinatorial auctions) created a portfolio that performed significantly better than any single algorithm.

ICML Conference 2010 Conference Paper

Web-Scale Bayesian Click-Through rate Prediction for Sponsored Search Advertising in Microsoft's Bing Search Engine

Thore Graepel
Joaquin Quiñonero Candela
Thomas Borchert
Ralf Herbrich

ICML Conference 2007 Conference Paper

Learning to solve game trees

David H. Stern
Ralf Herbrich
Thore Graepel

NeurIPS Conference 2007 Conference Paper

TrueSkill Through Time: Revisiting the History of Chess

Pierre Dangauthier
Ralf Herbrich
Tom Minka
Thore Graepel

We extend the Bayesian skill rating system TrueSkill to infer entire time series of skills of players by smoothing through time instead of (cid: 12)ltering. The skill of each participating player, say, every year is represented by a latent skill variable which is a(cid: 11)ected by the relevant game outcomes that year, and coupled with the skill variables of the previous and subsequent year. Inference in the resulting factor graph is carried out by approximate message passing (EP) along the time series of skills. As before the system tracks the uncertainty about player skills, explicitly models draws, can deal with any number of competing entities and can infer individual skills from team results. We extend the system to estimate player-speci(cid: 12)c draw mar- gins. Based on these models we present an analysis of the skill curves of important players in the history of chess over the past 150 years. Results include plots of players’ lifetime skill development as well as the ability to compare the skills of di(cid: 11)erent players across time. Our results indicate that a) the overall playing strength has increased over the past 150 years, and b) that modelling a player’s ability to force a draw provides signi(cid: 12)cantly better predictive power.

ICML Conference 2006 Conference Paper

Bayesian pattern ranking for move prediction in the game of Go

David H. Stern
Ralf Herbrich
Thore Graepel

NeurIPS Conference 2006 Conference Paper

TrueSkill™: A Bayesian Skill Rating System

Ralf Herbrich
Tom Minka
Thore Graepel

Abstract Unavailable

JMLR Journal 2005 Journal Article

Generalization Bounds for the Area Under the ROC Curve

Shivani Agarwal
Thore Graepel
Ralf Herbrich
Sariel Har-Peled
Dan Roth

We study generalization properties of the area under the ROC curve (AUC), a quantity that has been advocated as an evaluation criterion for the bipartite ranking problem. The AUC is a different term than the error rate used for evaluation in classification problems; consequently, existing generalization bounds for the classification error rate cannot be used to draw conclusions about the AUC. In this paper, we define the expected accuracy of a ranking function (analogous to the expected error rate of a classification function), and derive distribution-free probabilistic bounds on the deviation of the empirical AUC of a ranking function (observed on a finite data sequence) from its expected accuracy. We derive both a large deviation bound, which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on a test sequence, and a uniform convergence bound, which serves to bound the expected accuracy of a learned ranking function in terms of its empirical AUC on a training sequence. Our uniform convergence bound is expressed in terms of a new set of combinatorial parameters that we term the bipartite rank-shatter coefficients; these play the same role in our result as do the standard VC-dimension related shatter coefficients (also known as the growth function) in uniform convergence results for the classification error rate. A comparison of our result with a recent uniform convergence result derived by Freund et al. (2003) for a quantity closely related to the AUC shows that the bound provided by our result can be considerably tighter. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

A Large Deviation Bound for the Area Under the ROC Curve

Shivani Agarwal
Thore Graepel
Ralf Herbrich
Dan Roth

The area under the ROC curve (AUC) has been advocated as an evalu- ation criterion for the bipartite ranking problem. We study large devi- ation properties of the AUC; in particular, we derive a distribution-free large deviation bound for the AUC which serves to bound the expected accuracy of a ranking function in terms of its empirical AUC on an inde- pendent test sequence. A comparison of our result with a corresponding large deviation result for the classiﬁcation error rate suggests that the test sample size required to obtain an -accurate estimate of the expected ac- curacy of a ranking function with δ-conﬁdence is larger than that required to obtain an -accurate estimate of the expected error rate of a classiﬁ- cation function with the same conﬁdence. A simple application of the union bound allows the large deviation bound to be extended to learned ranking functions chosen from ﬁnite function classes.

NeurIPS Conference 2004 Conference Paper

Modelling Uncertainty in the Game of Go

David Stern
Thore Graepel
David MacKay

Go is an ancient oriental game whose complexity has defeated at- tempts to automate it. We suggest using probability in a Bayesian sense to model the uncertainty arising from the vast complexity of the game tree. We present a simple conditional Markov ran- dom field model for predicting the pointwise territory outcome of a game. The topology of the model reflects the spatial structure of the Go board. We describe a version of the Swendsen-Wang pro- cess for sampling from the model during learning and apply loopy belief propagation for rapid inference and prediction. The model is trained on several hundred records of professional games. Our experimental results indicate that the model successfully learns to predict territory despite its simplicity. 1 Introduction The game of Go originated in China over 4000 years ago. Its rules are simple (See www. gobase. org for an introduction). Two players, Black and White, take turns to place stones on the intersections of an N N grid (usually N = 19 but smaller boards are in use as well). All the stones of each player are identical. Players place their stones in order to create territory by occupying or surrounding areas of the board. The player with the most territory at the end of the game is the winner. A stone is captured if it has been completely surrounded (in the horizontal and vertical directions) by stones of the opponent's colour. Stones in a contiguous `chain' have the common fate property: they are captured all together or not at all [1]. The game that emerges from these simple rules has a complexity that defeats at- tempts to apply minimax search. The best Go programs play only at the level of weak amateur Go players and Go is therefore considered to be a serious AI challenge not unlike Chess in the 1960s. There are two main reasons for this state of affairs: firstly, the high branching factor of Go (typically 200 to 300 potential moves per position) prevents the expansion of a game tree to any useful depth. Secondly, it is difficult to produce an evaluation function for Go positions. A Go stone has no intrinsic value; its value is determined by its relationships with other stones. Go players evaluate positions using visual pattern recognition and qualitative intuitions which are difficult to formalise. Most Go programs rely on a large amount of hand-tailored rules and expert knowl- edge [2]. Some machine learning techniques have been applied to Go with limited success. Schraudolph, Dayan and Sejnowski [3] trained a multi-layer perceptron to evaluate board positions by temporal difference learning. Enzenberger [4] improved on this by structuring the topologies of his neural networks according to the rela- tionships between stones on the board. Graepel et al. [1] made use of the common fate property of chains to construct an efficient graph-based representation of the board. They trained a Support Vector Machine to use this representation to solve Go problems. Our starting point is the uncertainty about the future course of the game that arises from the vast complexity of the game tree. We propose to explicitly model this uncertainty using probability in a Bayesian sense. The Japanese have a word, aji, much used by Go players. Taken literally it means `taste'. Taste lingers, and likewise the influence of a Go stone lingers (even if it appears weak or dead) because of the uncertainty of the effect it may have in the future. We use a probabilistic model that takes the current board position and predicts for every intersection of the board if it will be Black or White territory. Given such a model the score of the game can be predicted and hence an evaluation function produced. The model is a conditional Markov random field [5] which incorporates the spatial structure of the Go board. 2 Models for Predicting Territory Consider the Go board as an undirected Graph G = (N, E) with N = Nx Ny nodes n N representing vertices on the board and edges e E connecting vertically and horizontally neighbouring points. We can denote a position as the vector c {Black, White, Empty}N for cn = c(n) and similarly the final territory outcome of the game as s {+1, -1}N for sn = s(n). For convenience we score from the point of view of Black so elements of s representing Black territory are valued +1 and elements representing white territory are valued -1. Go players will note that we are adopting the Chinese method of scoring empty as well as occupied intersections. The distribution we wish to model is P (s|c), that is, the distribution over final territory outcomes given the current position. Such a model would be useful for several reasons. Most importantly, the detailed outcomes provide us with a simple evalua- tion function for Go positions by the expected score, u(c): = s i i P (s|c). An alternative (and probably better) evaluation function is given by the probability of winning which takes the form P (Black wins) = P ( s i i > komi), where komi refers to the winning threshold for Black. Connectivity of stones is vital because stones can draw strength from other stones. Connectivity could be measured by the correlation between nodes under the distribution P (s|c). This would allow us to segment the board into `groups' of stones to reduce complexity. It would also be useful to observe cases where we have an anti-correlation between nodes in the territory prediction. Japanese refer to such cases as miai in which only one of two desired results can be achieved at the expense of the other - a consequence of moving in turns. The fate of a group of Go stones could be estimated from the distribution P (s|c) by marginalising out the nodes not involved. The way stones exert long range influence can be considered recursive. A stone influences its neighbours, who influence their neighbours and so on. A simple model which exploits this idea is to consider the Go board itself as an undirected graphical model in the form of a Conditional Random Field (CRF) [5]. We factorize the distribution as 1 1 P (s|c) = exp log( Z( f (sf, cf, f ) = f (sf, cf, f )) c, ) Z(c, ). f F f F (1) The simplest form of this model has one factor for each pair of neighbouring nodes i, j so f (sf, cf, f ) = f (si, sj, ci, cj, f ). Boltzmann5 For our first model we decompose the factors into coupling' terms and external field' terms as follows: 1 P (s|c) = exp {w(c Z( i, cj )sisj + h(ci)si + h(cj )sj } c, ) (2) (i, j)F This gives a Boltzmann machine whose connections have the grid topology of the board. The couplings between territory-outcome nodes depend on the current board position local to those nodes and the external field at each node is determined by the state of the board at that location. We assume that Go positions with their associated territory positions are symmetric with respect to colour reversal so f (si, sj, ci, cj, f ) = f (-si, -sj, -ci, -cj, f ). Pairwise connections are also in- variant to direction reversal so f (si, sj, ci, cj, f ) = f (sj, si, cj, ci, f ). It follows that the model described in 2 can be specified by just five parameters: wchains = w(Black, Black) = w(White, White), winter-chain = w(Black, White) = w(White, Black), wchain-empty = w(Empty, White) = w(Empty, Black), wempty = w(Empty, Empty), hstones = h(Black) = -h(White), and h(empty) is set to zero by symmetry. We will refer to this model as Boltzmann5. This simple model is interesting because all these parameters are readily interpreted. For example we would expect wchains to take on a large positive value since chains have common fate. BoltzmannLiberties A feature that has particular utility for evaluating Go po- sitions is the number of liberties associated with a chain of stones. A liberty of a chain is an empty vertex adjacent to it. The number of liberties indicates a chain's safety because the opponent would have to occupy all the liberties to capture the chain. Our second model takes this information into account: 1 P (s|c) = exp w(c Z( i, cj, si, sj, li, lj ) c, ), (3) (i, j)F where li is element i of a vector l {+1, +2, +3, 4 or more}N the liberty count of each vertex on the Go board. A group with four or more liberties is considered relatively safe. Again we can apply symmetry arguments and end up with 78 parameters. We will refer to this model as BoltzmannLiberties. We trained the two models using board positions from a database of 22, 000 games between expert Go players1. The territory outcomes of a subset of these games 1The GoGoD database, April 2003. URL: http: //www. gogod. demon. co. uk (a) Gibbs Sampling (b) Swendsen Wang Figure 1: Comparing ordinary Gibbs with Swendsen Wang sampling for Boltz- mann5. Shown are the differences between the running averages and the exact marginals for each of the 361 nodes plotted as a function of the number of whole- board samples. were determined using the Go program GnuGo2 to analyse their final positions. Each training example comprised a board position c, with its associated territory outcome s. Training was performed by maximising the likelihood ln P (s |c) using gradient descent. In order to calculate the likelihood it is necessary to perform inference to obtain the marginal expectations of the potentials. 3 Inference Methods It is possible to perform exact inference on the model by variable elimination [6]. Eliminating nodes one diagonal at a time gave an efficient computation. The cost of exact inference was still too high for general use but it was used to compare other inference methods. Sampling The standard method for sampling from a Boltzmann machine is to use Gibbs sampling where each node is updated one at a time, conditional on the others. However, Gibbs sampling mixes slowly for spin systems with strong correlations. A generalisation of the Swendsen-Wang process [7] alleviates this problem. The original Swendsen-Wang algorithm samples from a ferromagnetic Ising model with no external field by adding an additional set of bond' nodes d, one attached to each factor (edge) in the original graph. Each of these nodes can either be in the state bond' or no bond'. The new factor potentials f (sf, cf, df, f ) are chosen such that if a bond exists between a pair of spins then they are forced to be in the same state. Conditional on the bonds, each cluster has an equal probability of having all its spins in the up' state or all in the down' state. The algorithm samples from P (s|d, c, ) and P (d|s, c, ) in turn (flipping clusters and forming bonds respectively). It can be generalised to models with arbitrary couplings and biases [7, 8]. The new factor potentials f (sf, cf, df, f ) have the following effect: if the coupling is positive then when the d node is in the bond' state it forces the two spins to be in the same state; if the coupling is negative the `bond' state forces the two spins to be opposite. The probability of each cluster being in each state depends on the sum of the biases involved. Figure 1 shows that the mixing rate of the sampling process is improved by using Swendsen-Wang allowing us to find accurate marginals for a single position in a couple of seconds. 2URL: http: //www. gnu. org/software/gnugo/gnugo. html Loopy Belief Propagation In order to perform very rapid (approximate) infer- ence we used the loopy belief propagation (BP) algorithm [9] and the results are examined in Section 4. This algorithm is similar to an influence function [10], as often used by Go programmers to segment the board into Black and White territory and for this reason is laid out below. For each board vertex j N, create a data structure called a node containing: 1. A(j), the set of nodes corresponding to the neighbours of vertex j, 2. a set of new messages mnew(s ij j ) Mnew, one for each i A(j), 3. a set of old messages mold(s ij j ) Mold, one for each i A(j), 4. a belief bj(sj). repeat for all j N do for all i A(j) do for all sj {Black, White} do let variable SUM: = 0, for all si {Black, White} do SUM: = SUM + (i, j)(si, sj) mold(s qi i), qA(i)\j end for mnew(s ij j ): = SUM, end for end for end for for all messages, mnew(s xy y ) Mnew do mnew(s (s (s xy y ): = mold xy y ) + (1 - )mnew xy y ), end for until completed I iterations (typically I=50) Belief Update: for all j N do for all sj {Black, White} do bj(sj): = mnew(s qj j ) qA(j) end for end for Here, (typically 0. 5), damps any oscillations. (i, j)(si, sj) is the factor poten- tial (see (1)) and in the case of Boltzmann5 takes on the form (i, j)(si, sj) = exp (w(ci, cj)sisj + h(ci)si + h(cj)sj). Now the probability of each vertex being Black or White territory is found by normalising the beliefs at each node. For example P (sj = Black) = bj(Black)/Z where Z = bj(Black) + bj(White). The accuracy of the loopy BP approximation appears to be improved by using it during the parameter learning stage in cases where it is to be used in evaluation. 4 Results for Territory Prediction Some Learnt Parameters Here are some parameters learnt for the Boltzmann5 model (2). This model was trained on 290 positions from expert Go games at move 80. Training was performed by maximum likelihood as described in Section 2. (a) Boltzmann5 (Exact) (b) Boltzmann5 (Loopy BP) Figure 2: Comparing territory predictions for a Go position from a professional game at move 90. The circles represent stones. The small black and white squares at each vertex represent the average territory prediction at that vertex, from -1 (maximum white square) to +1 (maximum black square). h The values of these parameters can be in- stones = 0. 265 terpreted. For example w w chains corresponds empty = 0. 427 to the correlation between the likely territory wchain-empty = 0. 442 outcome of two adjacent vertices in a chain of w connected stones. The high value of this pa- chains = 2. 74 rameter derives from the `common fate' prop- winter-chain = 0. 521 erty of chains as described in Section 1. Interestingly, the value of the parameter wempty (corresponding to the coupling between territory predictions of neighbouring vertices in empty space) is 0. 427 which is close to the critical coupling for an Ising model, 0. 441. Territory Predictions Figure 2 gives examples of territory predictions generated by Boltzmann5. In comparison, Figure 3 shows the prediction of BoltzmannLiberties and a territory prediction from The Many Faces of Go [2]. Go players confirm that the territory predictions produced by the models are reasonable, even around loose groups of Black and White stones. Compare Figures 2 (a) and 3 (a); when liberty counts are included as features, the model can more confidently identify which of the two small chains competing in the bottom right of the board is dead. Comparing Figure 2 (a) and (b) Loopy BP appears to give over-confident predictions in the top right of the board where few stones are present. However, it is a good approximation where many stones are present (bottom left). Comparing Models and Inference Methods Figure 4 shows cross-entropies between model territory predictions and true final territory outcomes for a dataset of expert games. As we progress through a game, predictions become more accurate (not surprising) but the spread of the accuracy increases, possibly due to incorrect assessment of the life-and-death status of groups. Swendsen-Wang performs better than Loopy BP, which may suffer from its over-confidence. BoltzmannLiberties performs better than Boltzmann5 (when using Swendsen-Wang) the difference in (a) BoltzmannLiberties (Exact) (b) Many Faces of Go Figure 3: Diagram (a) is produced by exact inference (training was also by Loopy BP). Diagram (b) shows the territory predicted by The Many Faces of Go (MFG) [2]. MFG uses of a rule-based expert system and its prediction for each vertex has three possible values: White', Black' or `unknown/neutral'. performance increasing later in the game when liberty counts become more useful. 5 Modelling Move Selection In order to produce a Go playing program we are interested in modelling the selec- tion of moves. A measure of performance of such a model is the likelihood it assigns to professional moves as measured by log P (move|model). (4) games moves We can obtain a probability over moves by choosing a Gibbs distribution with the negative energy replaced by the evaluation function, eu(c, w) P (move|model, w) = (5) Z(w) where u(c, w) is an evaluation function evaluated at the board position c resulting from a given move. The inverse temperature parameter determines the degree to which the move made depends on its evaluation. The territory predictions from the models Boltzmann5 and BoltzmannLiberties can be combined with the evaluation function of Section 2 to produce position evaluators.

NeurIPS Conference 2003 Conference Paper

Invariant Pattern Recognition by Semi-Definite Programming Machines

Thore Graepel
Ralf Herbrich

invariances with respect to given pattern Knowledge about local transformations can greatly improve the accuracy of classiﬁcation. Previous approaches are either based on regularisation or on the gen- eration of virtual (transformed) examples. We develop a new frame- work for learning linear classiﬁers under known transformations based on semideﬁnite programming. We present a new learning algorithm— the Semideﬁnite Programming Machine (SDPM)—which is able to ﬁnd a maximum margin hyperplane when the training examples are polynomial trajectories instead of single points. The solution is found to be sparse in dual variables and allows to identify those points on the trajectory with minimal real-valued output as virtual support vec- tors. Extensions to segments of trajectories, to more than one trans- formation parameter, and to learning with kernels are discussed. In experiments we use a Taylor expansion to locally approximate rota- tional invariance in pixel images from USPS and ﬁnd improvements over known methods.

NeurIPS Conference 2003 Conference Paper

Semi-Definite Programming by Perceptron Learning

Thore Graepel
Ralf Herbrich
Andriy Kharechko
John Shawe-Taylor

We present a modiﬁed version of the perceptron learning algorithm (PLA) which solves semideﬁnite programs (SDPs) in polynomial time. The algorithm is based on the following three observations: (i) Semideﬁnite programs are linear programs with inﬁnitely many (linear) constraints; (ii) every linear program can be solved by a sequence of constraint satisfaction problems with linear constraints; (iii) in general, the perceptron learning algorithm solves a constraint satisfaction problem with linear constraints in ﬁnitely many updates. Combining the PLA with a probabilistic rescaling algorithm (which, on average, increases the size of the feasable region) results in a prob- abilistic algorithm for solving SDPs that runs in polynomial time. We present preliminary results which demonstrate that the algo- rithm works, but is not competitive with state-of-the-art interior point methods.

ICML Conference 2003 Conference Paper

Solving Noisy Linear Operator Equations by Gaussian Processes: Application to Ordinary and Partial Differential Equations

Thore Graepel

JMLR Journal 2001 Journal Article

Bayes Point Machines (Kernel Machines Section)

Ralf Herbrich
Thore Graepel
Colin Campbell

Kernel-classifiers comprise a powerful class of non-linear decision functions for binary classification. The support vector machine is an example of a learning algorithm for kernel classifiers that singles out the consistent classifier with the largest margin, i.e. minimal real-valued output on the training sample, within the set of consistent hypotheses, the so-called version space. We suggest the Bayes point machine as a well-founded improvement which approximates the Bayes-optimal decision by the centre of mass of version space. We present two algorithms to stochastically approximate the centre of mass of version space: a billiard sampling algorithm and a sampling algorithm based on the well known perceptron algorithm. It is shown how both algorithms can be extended to allow for soft-boundaries in order to admit training errors. Experimentally, we find that - for the zero training error case - Bayes point machines consistently outperform support vector machines on both surrogate data and real-world benchmark data sets. In the soft-boundary/soft-margin case, the improvement over support vector machines is shown to be reduced. Finally, we demonstrate that the real-valued output of single Bayes points on novel test points is a valid confidence measure and leads to a steady decrease in generalisation error when used as a rejection criterion.

NeurIPS Conference 2000 Conference Paper

A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs work

Ralf Herbrich
Thore Graepel

We present a bound on the generalisation error of linear classifiers in terms of a refined margin quantity on the training set. The result is obtained in a PAC- Bayesian framework and is based on geometrical arguments in the space of linear classifiers. The new bound constitutes an exponential improvement of the so far tightest margin bound by Shawe-Taylor et al. [8] and scales logarithmically in the inverse margin. Even in the case of less training examples than input dimensions sufficiently large margins lead to non-trivial bound values and - plexity term. Furthermore, the classical margin is too coarse a measure for the essential quantity that controls the generalisation error: the volume ratio between the whole hypothesis space and the subset of consistent hypotheses. The practical relevance of the result lies in the fact that the well-known support vector machine is optimal w. r. t. the new bound only if the feature vectors are all of the same length. As a consequence we recommend to use SVMs on normalised feature vectors only - a recommendation that is well supported by our numerical experiments on two benchmark data sets. for maximum margins - to a vanishing com(cid: 173)

NeurIPS Conference 2000 Conference Paper

From Margin to Sparsity

Thore Graepel
Ralf Herbrich
Robert Williamson

We present an improvement of Novikoff's perceptron convergence theorem. Reinterpreting this mistake bound as a margin dependent sparsity guarantee allows us to give a PAC-style generalisation er(cid: 173) ror bound for the classifier learned by the perceptron learning algo(cid: 173) rithm. The bound value crucially depends on the margin a support vector machine would achieve on the same data set using the same kernel. Ironically, the bound yields better guarantees than are cur(cid: 173) rently available for the support vector solution itself.

NeurIPS Conference 2000 Conference Paper

Large Scale Bayes Point Machines

Ralf Herbrich
Thore Graepel

also known as the Bayes point - The concept of averaging over classifiers is fundamental to the Bayesian analysis of learning. Based on this viewpoint, it has re(cid: 173) cently been demonstrated for linear classifiers that the centre of mass of version space (the set of all classifiers consistent with the training set) - exhibits excel(cid: 173) lent generalisation abilities. However, the billiard algorithm as pre(cid: 173) sented in [4] is restricted to small sample size because it requires o (m 2 ) of memory and 0 (N. m2 ) computational steps where m is the number of training patterns and N is the number of random draws from the posterior distribution. In this paper we present a method based on the simple perceptron learning algorithm which allows to overcome this algorithmic drawback. The method is al(cid: 173) gorithmically simple and is easily extended to the multi-class case. We present experimental results on the MNIST data set of hand(cid: 173) written digits which show that Bayes point machines (BPMs) are competitive with the current world champion, the support vector machine. In addition, the computational complexity of BPMs can be tuned by varying the number of samples from the posterior. Finally, rejecting test points on the basis of their (approximative) posterior probability leads to a rapid decrease in generalisation er(cid: 173) ror, e. g. 0. 1% generalisation error for a given rejection rate of 10%.

NeurIPS Conference 2000 Conference Paper

The Kernel Gibbs Sampler

Thore Graepel
Ralf Herbrich

We present an algorithm that samples the hypothesis space of ker(cid: 173) nel classifiers. Given a uniform prior over normalised weight vectors and a likelihood based on a model of label noise leads to a piece(cid: 173) wise constant posterior that can be sampled by the kernel Gibbs sampler (KGS). The KGS is a Markov Chain Monte Carlo method that chooses a random direction in parameter space and samples from the resulting piecewise constant density along the line chosen. The KGS can be used as an analytical tool for the exploration of Bayesian transduction, Bayes point machines, active learning, and evidence-based model selection on small data sets that are contam(cid: 173) inated with label noise. For a simple toy example we demonstrate experimentally how a Bayes point machine based on the KGS out(cid: 173) performs an SVM that is incapable of taking into account label noise.

NeurIPS Conference 1999 Conference Paper

Bayesian Transduction

Thore Graepel
Ralf Herbrich
Klaus Obermayer

Transduction is an inference principle that takes a training sam(cid: 173) ple and aims at estimating the values of a function at given points contained in the so-called working sample as opposed to the whole of input space for induction. Transduction provides a confidence measure on single predictions rather than classifiers - a feature particularly important for risk-sensitive applications. The possibly infinite number of functions is reduced to a finite number of equiv(cid: 173) alence classes on the working sample. A rigorous Bayesian analysis reveals that for standard classification loss we cannot benefit from considering more than one test point at a time. The probability of the label of a given test point is determined as the posterior measure of the corresponding subset of hypothesis space. We con(cid: 173) sider the PAC setting of binary classification by linear discriminant functions (perceptrons) in kernel space such that the probability of labels is determined by the volume ratio in version space. We suggest to sample this region by an ergodic billiard. Experimen(cid: 173) tal results on real world data indicate that Bayesian Transduction compares favourably to the well-known Support Vector Machine, in particular if the posterior probability of labellings is used as a confidence measure to exclude test points of low confidence.

NeurIPS Conference 1998 Conference Paper

Classification on Pairwise Proximity Data

Thore Graepel
Ralf Herbrich
Peter Bollmann-Sdorra
Klaus Obermayer

We investigate the problem of learning a classification task on data represented in terms of their pairwise proximities. This representa(cid: 173) tion does not refer to an explicit feature representation of the data items and is thus more general than the standard approach of us(cid: 173) ing Euclidean feature vectors, from which pairwise proximities can always be calculated. Our first approach is based on a combined linear embedding and classification procedure resulting in an ex(cid: 173) tension of the Optimal Hyperplane algorithm to pseudo-Euclidean data. As an alternative we present another approach based on a linear threshold model in the proximity values themselves, which is optimized using Structural Risk Minimization. We show that prior knowledge about the problem can be incorporated by the choice of distance measures and examine different metrics W. r. t. their gener(cid: 173) alization. Finally, the algorithms are successfully applied to protein structure data and to data from the cat's cerebral cortex. They show better performance than K-nearest-neighbor classification.

NeurIPS Conference 1997 Conference Paper

An Annealed Self-Organizing Map for Source Channel Coding

Matthias Burger
Thore Graepel
Klaus Obermayer

We derive and analyse robust optimization schemes for noisy vector quantization on the basis of deterministic annealing. Starting from a cost function for central clustering that incorporates distortions from channel noise we develop a soft topographic vector quantization al(cid: 173) gorithm (STVQ) which is based on the maximum entropy principle and which performs a maximum-likelihood estimate in an expectation(cid: 173) maximization (EM) fashion. Annealing in the temperature parameter f3 leads to phase transitions in the existing code vector representation dur(cid: 173) ing the cooling process for which we calculate critical temperatures and modes as a function of eigenvectors and eigenvalues of the covariance matrix of the data and the transition matrix of the channel noise. A whole family of vector quantization algorithms is derived from STVQ, among them a deterministic annealing scheme for Kohonen's self-organizing map (SOM). This algorithm, which we call SSOM, is then applied to vector quantization of image data to be sent via a noisy binary symmetric channel. The algorithm's performance is compared to those of LBG and STVQ. While it is naturally superior to LBG, which does not take into account channel noise, its results compare very well to those of STVQ, which is computationally much more demanding.