Author name cluster

Philip Thomas

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Beyond Prediction: Managing the Repercussions of Machine Learning Applications

Aline Weber
Blossom Metevier
Yuriy Brun
Philip Thomas
Bruno Silva

Machine learning models are often designed to maximize a primary goal, such as accuracy. However, as these models are increasingly used to inform decisions that affect people's lives or well-being, it is often unclear what the real-world repercussions of their deployment might be—making it crucial to understand and manage such repercussions effectively. Models maximizing user engagement on social media platforms, e. g. , may inadvertently contribute to the spread of misinformation and content that deepens political polarization. This issue is not limited to social media—it extends to other applications where machine learning-informed decisions can have real-world repercussions, such as education, employment, and lending. Existing methods addressing this issue require prior knowledge or estimates of analytical models describing the relationship between a classifier's predictions and their corresponding repercussions. We introduce Theia, a novel classification algorithm capable of optimizing a primary objective, such as accuracy, while providing high-confidence guarantees about its potential repercussions. Importantly, Theia solves the open problem of providing such guarantees based solely on existing data with observations of previous repercussions. We prove that it satisfies constraints on a model's repercussions with high confidence and that it is guaranteed to identify a solution, if one exists, given sufficient data. We empirically demonstrate, using real-life data, that Theia can identify models that achieve high accuracy while ensuring, with high confidence, that constraints on their repercussions are satisfied.

NeurIPS Conference 2025 Conference Paper

Fair Continuous Resource Allocation with Equality of Impact

Blossom Metevier
Dennis Wei
Karthikeyan Natesan Ramamurthy
Philip Thomas

Recent works have studied fair resource allocation in social settings, where fairness is judged by the impact of allocation decisions rather than more traditional minimum or maximum thresholds on the allocations themselves. Our work significantly adds to this literature by developing continuous resource allocation strategies that adhere to equality of impact, a generalization of equality of opportunity. We derive methods to maximize total welfare across groups subject to minimal violation of equality of impact, in settings where the outcomes of allocations are unknown but have a diminishing marginal effect. While focused on a two-group setting, our study addresses a broader class of welfare dynamics than explored in prior work. Our contributions are threefold. First, we introduce Equality of Impact (EoI), a fairness criterion defined via group-level impact functions. Second, we design an online algorithm for non-noisy settings that leverages the problem’s geometric structure and achieves constant cumulative fairness regret. Third, we extend this approach to noisy environments with a meta-algorithm and empirically demonstrate that our methods find fair allocations and perform competitively relative to representative baselines.

NeurIPS Conference 2025 Conference Paper

Fair Representation Learning with Controllable High Confidence Guarantees via Adversarial Inference

Yuhong Luo
Austin Hoag
Xintong Wang
Philip Thomas
Przemyslaw Grabowicz

Representation learning is increasingly applied to generate representations that generalize well across multiple downstream tasks. Ensuring fairness guarantees in representation learning is crucial to prevent unfairness toward specific demographic groups in downstream tasks. In this work, we formally introduce the task of learning representations that achieve high-confidence fairness. We aim to guarantee that demographic disparity in every downstream prediction remains bounded by a *user-defined* error threshold $\epsilon$, with *controllable* high probability. To this end, we propose the ***F**air **R**epresentation learning with high-confidence **G**uarantees (FRG)* framework, which provides these high-confidence fairness guarantees by leveraging an optimized adversarial model. We empirically evaluate FRG on three real-world datasets, comparing its performance to six state-of-the-art fair representation learning methods. Our results demonstrate that FRG consistently bounds unfairness across a range of downstream models and tasks.

AAAI Conference 2020 Conference Paper

Lifelong Learning with a Changing Action Set

Yash Chandak
Georgios Theocharous
Chris Nota
Philip Thomas

In many real-world sequential decision making problems, the number of available actions (decisions) can vary over time. While problems like catastrophic forgetting, changing transition dynamics, changing rewards functions, etc. have been well-studied in the lifelong learning literature, the setting where the size of the action set changes remains unaddressed. In this paper, we present ﬁrst steps towards developing an algorithm that autonomously adapts to an action set whose size changes over time. To tackle this open problem, we break it into two problems that can be solved iteratively: inferring the underlying, unknown, structure in the space of actions and optimizing a policy that leverages this structure. We demonstrate the efﬁciency of this approach on large-scale real-world lifelong learning problems.

AAAI Conference 2020 Conference Paper

Reinforcement Learning When All Actions Are Not Always Available

Yash Chandak
Georgios Theocharous
Blossom Metevier
Philip Thomas

The Markov decision process (MDP) formulation used to model many real-world sequential decision making problems does not efﬁciently capture the setting where the set of available decisions (actions) at each time step is stochastic. Recently, the stochastic action set Markov decision process (SAS- MDP) formulation has been proposed, which better captures the concept of a stochastic action set. In this paper we argue that existing RL algorithms for SAS-MDPs can suffer from potential divergence issues, and present new policy gradient algorithms for SAS-MDPs that incorporate variance reduction techniques unique to this setting, and provide conditions for their convergence. We conclude with experiments that demonstrate the practicality of our approaches on tasks inspired by real-life use cases wherein the action set is stochastic.

NeurIPS Conference 2019 Conference Paper

A Meta-MDP Approach to Exploration for Lifelong Reinforcement Learning

Francisco Garcia
Philip Thomas

In this paper we consider the problem of how a reinforcement learning agent that is tasked with solving a sequence of reinforcement learning problems (a sequence of Markov decision processes) can use knowledge acquired early in its lifetime to improve its ability to solve new problems. We argue that previous experience with similar problems can provide an agent with information about how it should explore when facing a new but related problem. We show that the search for an optimal exploration strategy can be formulated as a reinforcement learning problem itself and demonstrate that such strategy can leverage patterns found in the structure of related problems. We conclude with experiments that show the benefits of optimizing an exploration strategy using our proposed framework.

RLDM Conference 2019 Conference Abstract

High-Probability Guarantees for Offline Contextual Bandits

Blossom Metevier
Sarah Brockman
Yuriy Brun
Emma Brunskill
Philip Thomas

We present an offline contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Unlike previous work, our algorithm accepts multiple user-specified and problem-specific def- initions of fairness, including novel ones. Empirically, we evaluate our algorithm on applications related to an intelligent tutoring system (using data we collected via a user study) and criminal recidivism (using data released by ProPublica). In each setting our algorithm always produces fair policies that achieve rewards competitive with unsafe policies constructed by other offline and online contextual bandit algorithms.

RLDM Conference 2019 Conference Abstract

Learning Temporal Abstractions from Demonstration: A Probabilistic Ap- proach to Offline Option Discovery

Francisco M Garcia
Philip Thomas

The use of temporally extended actions often improves a reinforcement learning agent’s ability to learn solutions to complex tasks. The options framework is a popular method for defining closed-loop temporally extended actions, but the question of how to obtain options appropriate for a specific problem remains a subject of debate. In this paper, we consider good options to be those that allow an agent to represent optimal behavior with minimal decision-making by the policy over options, and propose learning options from historical data. Assuming access to demonstrations of (near)-optimal behavior, we formulate an optimization problem whose solution leads to the identification of options that allow an agent to reproduce optimal behavior with a small number of decisions. We provide experiments showing that the learned options lead to significant performance improvement and we show visually that the identified options are able to reproduce the demonstrated behavior.

NeurIPS Conference 2019 Conference Paper

Offline Contextual Bandits with High Probability Fairness Guarantees

Blossom Metevier
Stephen Giguere
Sarah Brockman
Ari Kobren
Yuriy Brun
Emma Brunskill
Philip Thomas

We present RobinHood, an ofﬂine contextual bandit algorithm designed to satisfy a broad family of fairness constraints. Our algorithm accepts multiple fairness deﬁnitions and allows users to construct their own unique fairness deﬁnitions for the problem at hand. We provide a theoretical analysis of RobinHood, which includes a proof that it will not return an unfair solution with probability greater than a user-speciﬁed threshold. We validate our algorithm on three applications: a tutoring system in which we conduct a user study and consider multiple unique fairness deﬁnitions; a loan approval setting (using the Statlog German credit data set) in which well-known fairness deﬁnitions are applied; and criminal recidivism (using data released by ProPublica). In each setting, our algorithm is able to produce fair policies that achieve performance competitive with other ofﬂine and online contextual bandit algorithms.

AAAI Conference 2017 Conference Paper

Importance Sampling with Unequal Support

Philip Thomas
Emma Brunskill

Importance sampling is often used in machine learning when training and testing data come from different distributions. In this paper we propose a new variant of importance sampling that can reduce the variance of importance samplingbased estimates by orders of magnitude when the supports of the training and testing distributions differ. After motivating and presenting our new importance sampling estimator, we provide a detailed theoretical analysis that characterizes both its bias and variance relative to the ordinary importance sampling estimator (in various settings, which include cases where ordinary importance sampling is biased, while our new estimator is not, and vice versa). We conclude with an example of how our new importance sampling estimator can be used to improve estimates of how well a new treatment policy for diabetes will work for an individual, using only data from when the individual used a previous treatment policy.

RLDM Conference 2017 Conference Abstract

Model Selection for Off-Policy Policy Evaluation

Yao Liu
Philip Thomas
Emma Brunskill

In this work we study the off-policy policy evaluation problem, which is about how to predict the value of a policy by data from other policies. This is crucial for many applications where we can not deploy new policy directly due to safety or cost. We consider the model selection problems for a better off-policy estimators, when we have models from different sources. Traditional off-policy policy evaluation method can be divided into importance sampling estimators or model based estimators, and they respectively suffer from high variance and bias. Recent work such as doubly robust and MAGIC, shows that we can get benefit from combining importance sampling method with model value. However they all assume that they have only one model. In case we have several different models, which is common in some complex domains, it may be hard to select the best one from them, and may lose the potential benefit from all models. we present a evidence example to show that select model by simply minimizing the notation of error in previous estimator (MAGIC) can fall into a wrong model, which suggest that selecting a best model for off-policy policy evaluation is non-trivial and worth of further exploration. We propose two new estimators of model bias and a cross validation way to help to choose a model, and shows the preliminary result.

NeurIPS Conference 2017 Conference Paper

Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation

Zhaohan Guo
Philip Thomas
Emma Brunskill

Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy policy evaluation (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sampling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We propose using policies over temporally extended actions, called options, and show that combining these policies with importance sampling can significantly improve performance for long-horizon problems. In addition, we can take advantage of special cases that arise due to options-based policies to further improve the performance of importance sampling. We further generalize these special cases to a general covariance testing rule that can be used to decide which weights to drop in an IS estimate, and derive a new IS algorithm called Incremental Importance Sampling that can provide significantly more accurate estimates for a broad class of domains.

RLDM Conference 2017 Conference Abstract

Using Options for Long-Horizon Off-Policy Evaluation

Zhaohan Guo
Philip Thomas
Emma Brunskill

Evaluating a policy by deploying it in the real world can be risky and costly. Off-policy evalua- tion (OPE) algorithms use historical data collected from running a previous policy to evaluate a new policy, which provides a means for evaluating a policy without requiring it to ever be deployed. Importance sam- pling is a popular OPE method because it is robust to partial observability and works with continuous states and actions. However, we show that the amount of historical data required by importance sampling can scale exponentially with the horizon of the problem: the number of sequential decisions that are made. We pro- pose using policies over temporally extended actions, called options, to address this long-horizon problem. We show theoretically and experimentally that combining importance sampling with options-based policies can significantly improve performance for long-horizon problems.

AAAI Conference 2016 Conference Paper

Increasing the Action Gap: New Operators for Reinforcement Learning

Marc G. Bellemare
Georg Ostrovski
Arthur Guez
Philip Thomas
Remi Munos

EWRL Workshop 2016 Workshop Paper

Magical Policy Search: Data Efficient Reinforcement Learning with Guarantees of Global Optimality

Philip Thomas
Emma Brunskill

We present a batch policy search algorithm that has several desirable properties: it has few parameters that require expert tuning, it can leverage approximate models of the environment, it can seamlessly handle continuous states and actions, (informally speaking) it is guaranteed to converge to a globally optimal policy even in partially observable environments, and in our simulations it outperforms a state-of-the-art baseline. The primary limitation of our algorithm is its high computational complexity—each policy improvement step involves the optimization of a known (not necessarily convex) function.

AAAI Conference 2015 Conference Paper

High-Confidence Off-Policy Evaluation

Philip Thomas
Georgios Theocharous
Mohammad Ghavamzadeh

Many reinforcement learning algorithms use trajectories collected from the execution of one or more policies to propose a new policy. Because execution of a bad policy can be costly or dangerous, techniques for evaluating the performance of the new policy without requiring its execution have been of recent interest in industry. Such off-policy evaluation methods, which estimate the performance of a policy using trajectories collected from the execution of other policies, heretofore have not provided confidences regarding the accuracy of their estimates. In this paper we propose an off-policy method for computing a lower confidence bound on the expected return of a policy.

NeurIPS Conference 2015 Conference Paper

Policy Evaluation Using the Ω-Return

Philip Thomas
Scott Niekum
Georgios Theocharous
George Konidaris

We propose the Ω-return as an alternative to the λ-return currently used by the TD(λ) family of algorithms. The benefit of the Ω-return is that it accounts for the correlation of different length returns. Because it is difficult to compute exactly, we suggest one way of approximating the Ω-return. We provide empirical studies that suggest that it is superior to the λ-return and γ-return for a variety of problems.

ICML Conference 2014 Conference Paper

Bias in Natural Actor-Critic Algorithms

Philip Thomas

We show that several popular discounted reward natural actor-critics, including the popular NAC-LSTD and eNAC algorithms, do not generate unbiased estimates of the natural policy gradient as claimed. We derive the first unbiased discounted reward natural actor-critics using batch and iterative approaches to gradient estimation. We argue that the bias makes the existing algorithms more appropriate for the average reward setting. We also show that, when Sarsa(lambda) is guaranteed to converge to an optimal policy, the objective function used by natural actor-critics is concave, so policy gradient methods are guaranteed to converge to globally optimal policies as well.

ICML Conference 2014 Conference Paper

GeNGA: A Generalization of Natural Gradient Ascent with Positive and Negative Convergence Results

Philip Thomas

Natural gradient ascent (NGA) is a popular optimization method that uses a positive definite metric tensor. In many applications the metric tensor is only guaranteed to be positive semidefinite (e. g. , when using the Fisher information matrix as the metric tensor), in which case NGA is not applicable. In our first contribution, we derive generalized natural gradient ascent (GeNGA), a generalization of NGA which allows for positive semidefinite non-smooth metric tensors. In our second contribution we show that, in standard settings, GeNGA and NGA can both be divergent. We then establish sufficient conditions to ensure that both achieve various forms of convergence. In our third contribution we show how several reinforcement learning methods that use NGA without positive definite metric tensors can be adapted to properly use GeNGA.

AAAI Conference 2014 Conference Paper

Natural Temporal Difference Learning

William Dabney
Philip Thomas

In this paper we investigate the application of natural gradient descent to Bellman error based reinforcement learning algorithms. This combination is interesting because natural gradient descent is invariant to the parameterization of the value function. This invariance property means that natural gradient descent adapts its update directions to correct for poorly conditioned representations. We present and analyze quadratic and linear time natural temporal difference learning algorithms, and prove that they are covariant. We conclude with experiments which suggest that the natural algorithms can match or outperform their non-natural counterparts using linear function approximation, and drastically improve upon their non-natural counterparts when using non-linear function approximation.

RLDM Conference 2013 Conference Abstract

Performance Metrics for Reinforcement Learning Algorithms

William Dabney
Philip Thomas
Andrew Barto

Due to the continued growth of the field of reinforcement learning (RL), the number of RL al- gorithms has increased to the point where an individual researcher cannot experiment with all of them. To facilitate the decision of which algorithms to invest time in, we, as a field, need methods that thoroughly and accurately describe the performance of algorithms. Two approaches for evaluating RL algorithms are commonly used, neither of which fully accomplishes these goals. The first approach is to manually test each algorithm with a collection of parameter values and report the best results found. This can introduce an unintended bias in an algorithm’s favor because researchers have more insight when selecting parameter values for methods with which they are familiar. The second approach is to perform a large parameter opti- mization for each algorithm and to report the best results found. This approach does not accurately capture the difficulty of finding good parameter values. The fundamental problem with both approaches is that the robustness of the algorithm to its parameter values is ignored. In the first approach this results in biased evaluations. On the other hand, in the second approach it causes only the combination of the RL algorithm and parameter optimization to be evaluated, which allows the parameter optimization to compensate for weaknesses in the RL algorithm. By this standard, directly searching for fixed policies in the parameter op- timization will yield the best algorithm possible. We propose a performance metric for RL algorithms that tells a much larger part of the story of an algorithm’s performance and robustness to its parameter values. It allows us as RL researchers to be better informed about the performance of our algorithms, and to report results that are also more informative to our audiences. The key insight is to measure performance in terms of expected percentage of fixed, deterministic policies that the algorithm outperforms.

NeurIPS Conference 2013 Conference Paper

Projected Natural Actor-Critic

Philip Thomas
William Dabney
Stephen Giguere
Sridhar Mahadevan

Natural actor-critics are a popular class of policy search algorithms for finding locally optimal policies for Markov decision processes. In this paper we address a drawback of natural actor-critics that limits their real-world applicability - their lack of safety guarantees. We present a principled algorithm for performing natural gradient descent over a constrained domain. In the context of reinforcement learning, this allows for natural actor-critic algorithms that are guaranteed to remain within a known safe region of policy space. While deriving our class of constrained natural actor-critic algorithms, which we call Projected Natural Actor-Critics (PNACs), we also elucidate the relationship between natural gradient descent and mirror descent.

NeurIPS Conference 2011 Conference Paper

Policy Gradient Coagent Networks

Philip Thomas

We present a novel class of actor-critic algorithms for actors consisting of sets of interacting modules. We present, analyze theoretically, and empirically evaluate an update rule for each module, which requires only local information: the module's input, output, and the TD error broadcast by a critic. Such updates are necessary when computation of compatible features becomes prohibitively difficult and are also desirable to increase the biological plausibility of reinforcement learning methods.

NeurIPS Conference 2011 Conference Paper

TD_gamma: Re-evaluating Complex Backups in Temporal Difference Learning

George Konidaris
Scott Niekum
Philip Thomas

We show that the lambda-return target used in the TD(lambda) family of algorithms is the maximum likelihood estimator for a specific model of how the variance of an n-step return estimate increases with n. We introduce the gamma-return estimator, an alternative target based on a more accurate model of variance, which defines the TD gamma family of complex-backup temporal difference learning algorithms. We derive TD gamma, the gamma-return equivalent of the original TD(lambda) algorithm, which eliminates the lambda parameter but can only perform updates at the end of an episode and requires time and space proportional to the episode length. We then derive a second algorithm, TD gamma(C), with a capacity parameter C. TD gamma(C) requires C times more time and memory than TD(lambda) and is incremental and online. We show that TD gamma outperforms TD(lambda) for any setting of lambda on 4 out of 5 benchmark domains, and that TD gamma(C) performs as well as or better than TD_gamma for intermediate settings of C.

AAAI Conference 2011 Conference Paper

Value Function Approximation in Reinforcement Learning Using the Fourier Basis

George Konidaris
Sarah Osentoski
Philip Thomas

We describe the Fourier basis, a linear value function approximation scheme based on the Fourier series. We empirically demonstrate that it performs well compared to radial basis functions and the polynomial basis, the two most popular ﬁxed bases for linear value function approximation, and is competitive with learned proto-value functions.