Arrow Research search

Author name cluster

Alekh Agarwal

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

70 papers
2 author rows

Possible papers

70

ICML Conference 2025 Conference Paper

Catoni Contextual Bandits are Robust to Heavy-tailed Rewards

  • Chenlu Ye
  • Yujia Jin
  • Alekh Agarwal
  • Tong Zhang 0001

Typical contextual bandit algorithms assume that the rewards at each round lie in some fixed range $[0, R]$, and their regret scales polynomially with this reward range $R$. However, many practical scenarios naturally involve heavy-tailed rewards or rewards where the worst-case range can be substantially larger than the variance. In this paper, we develop an algorithmic approach building on Catoni’s estimator from robust statistics, and apply it to contextual bandits with general function approximation. When the variance of the reward at each round is known, we use a variance-weighted regression approach and establish a regret bound that depends only on the cumulative reward variance and logarithmically on the reward range $R$ as well as the number of rounds $T$. For the unknown-variance case, we further propose a careful peeling-based algorithm and remove the need for cumbersome variance estimation. With additional dependence on the fourth moment, our algorithm also enjoys a variance-based bound with logarithmic reward-range dependence. Moreover, we demonstrate the optimality of the leading-order term in our regret bound through a matching lower bound.

ICML Conference 2025 Conference Paper

Design Considerations in Offline Preference-based RL

  • Alekh Agarwal
  • Christoph Dann
  • Teodor V. Marinov

Offline algorithms for Reinforcement Learning from Human Preferences (RLHF), which use only a fixed dataset of sampled responses given an input, and preference feedback among these responses, have gained increasing prominence in the literature on aligning language models. In this paper, we study how the different design choices made in methods such as DPO, IPO, SLiC and many variants influence the quality of the learned policy, from a theoretical perspective. Our treatment yields insights into the choices of loss function, the policy which is used to normalize log-likelihoods, and also the role of the data sampling policy. Notably, our results do not rely on the standard reparameterization-style arguments used to motivate some of the algorithms in this family, which allows us to give a unified treatment to a broad class of methods. We also conduct a small empirical study to verify some of the theoretical findings on a standard summarization benchmark.

TMLR Journal 2025 Journal Article

Preserving Expert-Level Privacy in Offline Reinforcement Learning

  • Navodita Sharma
  • Vishnu Vinod
  • Abhradeep Guha Thakurta
  • Alekh Agarwal
  • Borja Balle
  • Christoph Dann
  • Aravindan Raghuveer

The offline reinforcement learning (RL) problem aims to learn an optimal policy from historical data collected by one or more behavioural policies (experts) by interacting with an environment. However, the individual experts may be privacy-sensitive in that the learnt policy may retain information about their precise choices. In some domains like personalized retrieval, advertising and healthcare, the expert choices are considered sensitive data. To provably protect the privacy of such experts, we propose a novel consensus-based expert-level differentially private offline RL training approach compatible with any existing offline RL algorithm. We prove rigorous differential privacy guarantees, while maintaining strong empirical performance. Unlike existing work in differentially private RL, we supplement the theory with proof-of-concept experiments on classic RL environments featuring large continuous state spaces, demonstrating substantial improvements over a natural baseline across multiple tasks.

ICLR Conference 2025 Conference Paper

Rewarding Progress: Scaling Automated Process Verifiers for LLM Reasoning

  • Amrith Setlur
  • Chirag Nagpal
  • Adam Fisch
  • Xinyang Geng
  • Jacob Eisenstein
  • Rishabh Agarwal
  • Alekh Agarwal
  • Jonathan Berant

A promising approach for improving reasoning in large language models is to use process reward models (PRMs). PRMs provide feedback at each step of a multi-step reasoning trace, improving credit assignment over outcome reward models (ORMs) that only provide feedback at the final step. However, collecting dense, per-step human labels is not scalable, and training PRMs from automatically-labeled data has thus far led to limited gains. With the goal of using PRMs to improve a *base* policy via test-time search and reinforcement learning (RL), we ask: ``How should we design process rewards?'' Our key insight is that, to be effective, the process reward for a step should measure *progress*: a change in the likelihood of producing a correct response in the future, before and after taking the step, as measured under a *prover* policy distinct from the base policy. Such progress values can {distinguish} good and bad steps generated by the base policy, even though the base policy itself cannot. Theoretically, we show that even weaker provers can improve the base policy, as long as they distinguish steps without being too misaligned with the base policy. Our results show that process rewards defined as progress under such provers improve the efficiency of exploration during test-time search and online RL. We empirically validate our claims by training **process advantage verifiers (PAVs)** to measure progress under such provers and show that compared to ORM, they are >8% more accurate, and 1.5-5x more compute-efficient. Equipped with these insights, our PAVs enable **one of the first results** showing a 6x gain in sample efficiency for a policy trained using online RL with PRMs vs. ORMs.

TMLR Journal 2025 Journal Article

Robust Preference Optimization through Reward Model Distillation

  • Adam Fisch
  • Jacob Eisenstein
  • Vicky Zayats
  • Alekh Agarwal
  • Ahmad Beirami
  • Chirag Nagpal
  • Peter Shaw
  • Jonathan Berant

Language model (LM) post-training (or alignment) involves maximizing a reward function that is derived from preference annotations. Direct Preference Optimization (DPO) is a popular offline alignment method that trains a policy directly on preference data without the need to train a reward model or apply reinforcement learning. However, the empirical evidence suggests that DPO typically assigns implicit rewards that overfit, and trend towards infinite magnitude. This frequently leads to degenerate policies, sometimes causing even the probabilities of the preferred generations to go to zero. In this work, we analyze this phenomenon and use distillation to get a better proxy for the true preference distribution over generation pairs: we train the LM such that its induced implicit reward, i.e., the scaled log-likelihood ratio of the model to the reference model, matches an explicit reward model trained on the preference data. Moreover, to account for uncertainty in the reward model we are distilling from, we optimize against a family of reward models that, as a whole, is likely to include at least one reasonable proxy for the preference distribution. Our results show that distilling from such a family of reward models leads to improved robustness to distribution shift in preference annotations, while preserving the simple supervised nature of DPO.

ICML Conference 2025 Conference Paper

Theoretical guarantees on the best-of-n alignment policy

  • Ahmad Beirami
  • Alekh Agarwal
  • Jonathan Berant
  • Alexander Nicholas D'Amour
  • Jacob Eisenstein
  • Chirag Nagpal
  • Ananda Theertha Suresh

A simple and effective method for the inference-time alignment of generative models is the best-of-$n$ policy, where $n$ samples are drawn from a reference policy, ranked based on a reward function, and the highest ranking one is selected. A commonly used analytical expression in the literature claims that the KL divergence between the best-of-$n$ policy and the reference policy is equal to $\log (n) - (n-1)/n. $ We disprove the validity of this claim, and show that it is an upper bound on the actual KL divergence. We also explore the tightness of this upper bound in different regimes, and propose a new estimator for the KL divergence and empirically show that it provides a tight approximation. We also show that the win rate of the best-of-$n$ policy against the reference policy is upper bounded by $n/(n+1)$ and derive bounds on the tightness of this characterization. We conclude with analyzing the tradeoffs between win rate and KL divergence of the best-of-$n$ alignment policy, which demonstrate that very good tradeoffs are achievable with $n < 1000$.

ICML Conference 2024 Conference Paper

A Minimaximalist Approach to Reinforcement Learning from Human Feedback

  • Gokul Swamy 0001
  • Christoph Dann
  • Rahul Kidambi
  • Zhiwei Steven Wu
  • Alekh Agarwal

We present Self-Play Preference Optimization (SPO), an algorithm for reinforcement learning from human feedback. Our approach is minimalist in that it does not require training a reward model nor unstable adversarial training and is therefore rather simple to implement. Our approach is maximalist in that it provably handles non-Markovian, intransitive, and stochastic preferences while being robust to the compounding errors that plague offline approaches to sequential prediction. To achieve the preceding qualities, we build upon the concept of a Minimax Winner (MW), a notion of preference aggregation from the social choice theory literature that frames learning from preferences as a zero-sum game between two policies. By leveraging the symmetry of this game, we prove that rather than using the traditional technique of dueling two policies to compute the MW, we can simply have a single agent play against itself while maintaining strong convergence guarantees. Practically, this corresponds to sampling multiple trajectories from a policy, asking a preference or teacher model to compare them, and then using the proportion of wins as the reward for a particular trajectory. We demonstrate that on a suite of continuous control tasks, we are able to learn significantly more efficiently than reward-model based approaches while maintaining robustness to the intransitive and stochastic preferences that frequently occur in practice when aggregating human judgments.

JMLR Journal 2024 Journal Article

Model-Free Representation Learning and Exploration in Low-Rank MDPs

  • Aditya Modi
  • Jinglin Chen
  • Akshay Krishnamurthy
  • Nan Jiang
  • Alekh Agarwal

The low-rank MDP has emerged as an important model for studying representation learning and exploration in reinforcement learning. With a known representation, several model-free exploration strategies exist. In contrast, all algorithms for the unknown representation setting are model-based, thereby requiring the ability to model the full dynamics. In this work, we present the first model-free representation learning algorithms for low-rank MDPs. The key algorithmic contribution is a new minimax representation learning objective, for which we provide variants with differing tradeoffs in their statistical and computational properties. We interleave this representation learning step with an exploration strategy to cover the state space in a reward-free manner. The resulting algorithms are provably sample efficient and can accommodate general function approximation to scale to complex environments. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICML Conference 2024 Conference Paper

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

  • Kaiwen Wang
  • Owen Oertell
  • Alekh Agarwal
  • Nathan Kallus
  • Wen Sun 0002

In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learning based optimism algorithm achieves a second-order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is relatively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.

NeurIPS Conference 2024 Conference Paper

Small steps no more: Global convergence of stochastic gradient bandits for arbitrary learning rates

  • Jincheng Mei
  • Bo Dai
  • Alekh Agarwal
  • Sharan Vaswani
  • Anant Raj
  • Csaba Szepesvári
  • Dale Schuurmans

We provide a new understanding of the stochastic gradient bandit algorithm by showing that it converges to a globally optimal policy almost surely using \emph{any} constant learning rate. This result demonstrates that the stochastic gradient algorithm continues to balance exploration and exploitation appropriately even in scenarios where standard smoothness and noise control assumptions break down. The proofs are based on novel findings about action sampling rates and the relationship between cumulative progress and noise, and extend the current understanding of how simple stochastic gradient methods behave in bandit settings.

ICML Conference 2024 Conference Paper

The Non-linear F-Design and Applications to Interactive Learning

  • Alekh Agarwal
  • Jian Qian
  • Alexander Rakhlin
  • Tong Zhang 0001

We propose a generalization of the classical G-optimal design concept to non-linear function classes. The criterion, termed F -design, coincides with G-design in the linear case. We compute the value of the optimal design, termed the F-condition number, for several non-linear function classes. We further provide algorithms to construct designs with a bounded F -condition number. Finally, we employ the F-design in a variety of interactive machine learning tasks, where the design is naturally useful for data collection or exploration. We show that in four diverse settings of confidence band construction, contextual bandits, model-free reinforcement learning, and active learning, F-design can be combined with existing approaches in a black-box manner to yield state-of-the-art results in known problem settings as well as to generalize to novel ones.

ICML Conference 2023 Conference Paper

Learning in POMDPs is Sample-Efficient with Hindsight Observability

  • Jonathan Lee 0002
  • Alekh Agarwal
  • Christoph Dann
  • Tong Zhang 0001

POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.

NeurIPS Conference 2023 Conference Paper

Ordering-based Conditions for Global Convergence of Policy Gradient Methods

  • Jincheng Mei
  • Bo Dai
  • Alekh Agarwal
  • Mohammad Ghavamzadeh
  • Csaba Szepesvari
  • Dale Schuurmans

We prove that, for finite-arm bandits with linear function approximation, the global convergence of policy gradient (PG) methods depends on inter-related properties between the policy update and the representation. textcolor{blue}{First}, we establish a few key observations that frame the study: \textbf{(i)} Global convergence can be achieved under linear function approximation without policy or reward realizability, both for the standard Softmax PG and natural policy gradient (NPG). \textbf{(ii)} Approximation error is not a key quantity for characterizing global convergence in either algorithm. \textbf{(iii)} The conditions on the representation that imply global convergence are different between these two algorithms. Overall, these observations call into question approximation error as an appropriate quantity for characterizing the global convergence of PG methods under linear function approximation. \textcolor{blue}{Second}, motivated by these observations, we establish new general results: \textbf{(i)} NPG with linear function approximation achieves global convergence \emph{if and only if} the projection of the reward onto the representable space preserves the optimal action's rank, a quantity that is not strongly related to approximation error. \textbf{(ii)} The global convergence of Softmax PG occurs if the representation satisfies a non-domination condition and can preserve the ranking of rewards, which goes well beyond policy or reward realizability. We provide experimental results to support these theoretical findings.

ICML Conference 2023 Conference Paper

Stochastic Gradient Succeeds for Bandits

  • Jincheng Mei
  • Zixin Zhong
  • Bo Dai 0001
  • Alekh Agarwal
  • Csaba Szepesvári
  • Dale Schuurmans

We show that the stochastic gradient bandit algorithm converges to a globally optimal policy at an $O(1/t)$ rate, even with a constant step size. Remarkably, global convergence of the stochastic gradient bandit algorithm has not been previously established, even though it is an old algorithm known to be applicable to bandits. The new result is achieved by establishing two novel technical findings: first, the noise of the stochastic updates in the gradient bandit algorithm satisfies a strong “growth condition” property, where the variance diminishes whenever progress becomes small, implying that additional noise control via diminishing step sizes is unnecessary; second, a form of “weak exploration” is automatically achieved through the stochastic gradient updates, since they prevent the action probabilities from decaying faster than $O(1/t)$, thus ensuring that every action is sampled infinitely often with probability $1$. These two findings can be used to show that the stochastic gradient update is already “sufficient” for bandits in the sense that exploration versus exploitation is automatically balanced in a manner that ensures almost sure convergence to a global optimum. These novel theoretical findings are further verified by experimental results.

ICML Conference 2022 Conference Paper

Adversarially Trained Actor Critic for Offline Reinforcement Learning

  • Ching-An Cheng
  • Tengyang Xie
  • Nan Jiang 0008
  • Alekh Agarwal

We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game framing of offline RL: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.

ICML Conference 2022 Conference Paper

Efficient Reinforcement Learning in Block MDPs: A Model-free Representation Learning approach

  • Xuezhou Zhang
  • Yuda Song 0001
  • Masatoshi Uehara
  • Mengdi Wang 0001
  • Alekh Agarwal
  • Wen Sun 0002

We present BRIEE, an algorithm for efficient reinforcement learning in Markov Decision Processes with block-structured dynamics (i. e. , Block MDPs), where rich observations are generated from a set of unknown latent states. BRIEE interleaves latent states discovery, exploration, and exploitation together, and can provably learn a near-optimal policy with sample complexity scaling polynomially in the number of latent states, actions, and the time horizon, with no dependence on the size of the potentially infinite observation space. Empirically, we show that BRIEE is more sample efficient than the state-of-art Block MDP algorithm HOMER and other empirical RL baselines on challenging rich-observation combination lock problems which require deep exploration.

NeurIPS Conference 2022 Conference Paper

Model-based RL with Optimistic Posterior Sampling: Structural Conditions and Sample Complexity

  • Alekh Agarwal
  • Tong Zhang

We propose a general framework to design posterior sampling methods for model-based RL. We show that the proposed algorithms can be analyzed by reducing regret to Hellinger distance in conditional probability estimation. We further show that optimistic posterior sampling can control this Hellinger distance, when we measure model error via data likelihood. This technique allows us to design and analyze unified posterior sampling algorithms with state-of-the-art sample complexity guarantees for many model-based RL settings. We illustrate our general result in many special cases, demonstrating the versatility of our framework.

NeurIPS Conference 2022 Conference Paper

On the Statistical Efficiency of Reward-Free Exploration in Non-Linear RL

  • Jinglin Chen
  • Aditya Modi
  • Akshay Krishnamurthy
  • Nan Jiang
  • Alekh Agarwal

We study reward-free reinforcement learning (RL) under general non-linear function approximation, and establish sample efficiency and hardness results under various standard structural assumptions. On the positive side, we propose the RFOLIVE (Reward-Free OLIVE) algorithm for sample-efficient reward-free exploration under minimal structural assumptions, which covers the previously studied settings of linear MDPs (Jin et al. , 2020b), linear completeness (Zanette et al. , 2020b) and low-rank MDPs with unknown representation (Modi et al. , 2021). Our analyses indicate that the explorability or reachability assumptions, previously made for the latter two settings, are not necessary statistically for reward-free exploration. On the negative side, we provide a statistical hardness result for both reward-free and reward-aware exploration under linear completeness assumptions when the underlying features are unknown, showing an exponential separation between low-rank and linear completeness settings.

ICLR Conference 2022 Conference Paper

Provably Filtering Exogenous Distractors using Multistep Inverse Dynamics

  • Yonathan Efroni
  • Dipendra Misra
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • John Langford 0001

Many real-world applications of reinforcement learning (RL) require the agent to deal with high-dimensional observations such as those generated from a megapixel camera. Prior work has addressed such problems with representation learning, through which the agent can provably extract endogenous, latent state information from raw observations and subsequently plan efficiently. However, such approaches can fail in the presence of temporally correlated noise in the observations, a phenomenon that is common in practice. We initiate the formal study of latent state discovery in the presence of such exogenous noise sources by proposing a new model, the Exogenous Block MDP (EX-BMDP), for rich observation RL. We start by establishing several negative results, by highlighting failure cases of prior representation learning based approaches. Then, we introduce the Predictive Path Elimination (PPE) algorithm, that learns a generalization of inverse dynamics and is provably sample and computationally efficient in EX-BMDPs when the endogenous state dynamics are near deterministic. The sample complexity of PPE depends polynomially on the size of the latent endogenous state space while not directly depending on the size of the observation space, nor the exogenous state space. We provide experiments on challenging exploration problems which show that our approach works empirically.

JMLR Journal 2021 Journal Article

A Contextual Bandit Bake-off

  • Alberto Bietti
  • Alekh Agarwal
  • John Langford

Contextual bandit algorithms are essential for solving many real-world interactive machine learning problems. Despite multiple recent successes on statistically optimal and computationally efficient methods, the practical behavior of these algorithms is still poorly understood. We leverage the availability of large numbers of supervised learning datasets to empirically evaluate contextual bandit algorithms, focusing on practical methods that learn by relying on optimization oracles from supervised learning. We find that a recent method (Foster et al., 2018) using optimism under uncertainty works the best overall. A surprisingly close second is a simple greedy baseline that only explores implicitly through the diversity of contexts, followed by a variant of Online Cover (Agarwal et al., 2014) which tends to be more conservative but robust to problem specification by design. Along the way, we also evaluate various components of contextual bandit algorithm design such as loss estimators. Overall, this is a thorough study and review of contextual bandit methodology. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2021. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

Bellman-consistent Pessimism for Offline Reinforcement Learning

  • Tengyang Xie
  • Ching-An Cheng
  • Nan Jiang
  • Paul Mineiro
  • Alekh Agarwal

The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear function approximation where stronger expressivity assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal O(d)$ in its sample complexity (when the action space is finite). Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.

JMLR Journal 2021 Journal Article

On the Theory of Policy Gradient Methods: Optimality, Approximation, and Distribution Shift

  • Alekh Agarwal
  • Sham M. Kakade
  • Jason D. Lee
  • Gaurav Mahajan

Policy gradient methods are among the most effective methods in challenging reinforcement learning problems with large state and/or action spaces. However, little is known about even their most basic theoretical convergence properties, including: if and how fast they converge to a globally optimal solution or how they cope with approximation error due to using a restricted class of parametric policies. This work provides provable characterizations of the computational, approximation, and sample size properties of policy gradient methods in the context of discounted Markov Decision Processes (MDPs). We focus on both: "tabular" policy parameterizations, where the optimal policy is contained in the class and where we show global convergence to the optimal policy; and parametric policy classes (considering both log-linear and neural policy classes), which may not contain the optimal policy and where we provide agnostic learning results. One central contribution of this work is in providing approximation guarantees that are average case --- which avoid explicit worst-case dependencies on the size of state space --- by making a formal connection to supervised learning under distribution shift. This characterization shows an important interplay between estimation error, approximation error, and exploration (as characterized through a precisely defined condition number). [abs] [ pdf ][ bib ] &copy JMLR 2021. ( edit, beta )

ICML Conference 2021 Conference Paper

Provably Correct Optimization and Exploration with Non-linear Policies

  • Fei Feng
  • Wotao Yin
  • Alekh Agarwal
  • Lin F. Yang

Policy optimization methods remain a powerful workhorse in empirical Reinforcement Learning (RL), with a focus on neural policies that can easily reason over complex and continuous state and/or action spaces. Theoretical understanding of strategic exploration in policy-based methods with non-linear function approximation, however, is largely missing. In this paper, we address this question by designing ENIAC, an actor-critic method that allows non-linear function approximation in the critic. We show that under certain assumptions, e. g. , a bounded eluder dimension $d$ for the critic class, the learner finds to a near-optimal policy in $\widetilde{O}(\mathrm{poly}(d))$ exploration rounds. The method is robust to model misspecification and strictly extends existing works on linear function approximation. We also develop some computational optimizations of our approach with slightly worse statistical guarantees, and an empirical adaptation building on existing deep RL tools. We empirically evaluate this adaptation, and show that it outperforms prior heuristics inspired by linear methods, establishing the value in correctly reasoning about the agent’s uncertainty under non-linear function approximation.

ICLR Conference 2020 Conference Paper

Deep Batch Active Learning by Diverse, Uncertain Gradient Lower Bounds

  • Jordan T. Ash
  • Chicheng Zhang
  • Akshay Krishnamurthy
  • John Langford 0001
  • Alekh Agarwal

We design a new algorithm for batch active learning with deep neural network models. Our algorithm, Batch Active learning by Diverse Gradient Embeddings (BADGE), samples groups of points that are disparate and high-magnitude when represented in a hallucinated gradient space, a strategy designed to incorporate both predictive uncertainty and sample diversity into every selected batch. Crucially, BADGE trades off between diversity and uncertainty without requiring any hand-tuned hyperparameters. While other approaches sometimes succeed for particular batch sizes or architectures, BADGE consistently performs as well or better, making it a useful option for real world active learning problems.

NeurIPS Conference 2020 Conference Paper

FLAMBE: Structural Complexity and Representation Learning of Low Rank MDPs

  • Alekh Agarwal
  • Sham Kakade
  • Akshay Krishnamurthy
  • Wen Sun

In order to deal with the curse of dimensionality in reinforcement learning (RL), it is common practice to make parametric assumptions where values or policies are functions of some low dimensional feature space. This work focuses on the representation learning question: how can we learn such features? Under the assumption that the underlying (unknown) dynamics correspond to a low rank transition matrix, we show how the representation learning question is related to a particular non-linear matrix decomposition problem. Structurally, we make precise connections between these low rank MDPs and latent variable models, showing how they significantly generalize prior formulations, such as block MDPs, for representation learning in RL. Algorithmically, we develop FLAMBE, which engages in exploration and representation learning for provably efficient RL in low rank transition models. On a technical level, our analysis eliminates reachability assumptions that appear in prior results on the simpler block MDP model and may be of independent interest.

AAAI Conference 2020 Conference Paper

Metareasoning in Modular Software Systems: On-the-Fly Configuration Using Reinforcement Learning with Rich Contextual Representations

  • Aditya Modi
  • Debadeepta Dey
  • Alekh Agarwal
  • Adith Swaminathan
  • Besmira Nushi
  • Sean Andrist
  • Eric Horvitz

Assemblies of modular subsystems are being pressed into service to perform sensing, reasoning, and decision making in high-stakes, time-critical tasks in areas such as transportation, healthcare, and industrial automation. We address the opportunity to maximize the utility of an overall computing system by employing reinforcement learning to guide the configuration of the set of interacting modules that comprise the system. The challenge of doing system-wide optimization is a combinatorial problem. Local attempts to boost the performance of a specific module by modifying its configuration often leads to losses in overall utility of the system’s performance as the distribution of inputs to downstream modules changes drastically. We present metareasoning techniques which consider a rich representation of the input, monitor the state of the entire pipeline, and adjust the configuration of modules on-the-fly so as to maximize the utility of a system’s operation. We show significant improvement in both real-world and synthetic pipelines across a variety of reinforcement learning techniques.

NeurIPS Conference 2020 Conference Paper

PC-PG: Policy Cover Directed Exploration for Provable Policy Gradient Learning

  • Alekh Agarwal
  • Mikael Henaff
  • Sham Kakade
  • Wen Sun

Direct policy gradient methods for reinforcement learning are a successful approach for a variety of reasons: they are model free, they directly optimize the performance metric of interest, and they allow for richly parameterized policies. Their primary drawback is that, by being local in nature, they fail to adequately explore the environment. In contrast, while model-based approaches and Q-learning can, at least in theory, directly handle exploration through the use of optimism, their ability to handle model misspecification and function approximation is far less evident. This work introduces the the POLICY COVER GUIDED POLICY GRADIENT (PC- PG) algorithm, which provably balances the exploration vs. exploitation tradeoff using an ensemble of learned policies (the policy cover). PC-PG enjoys polynomial sample complexity and run time for both tabular MDPs and, more generally, linear MDPs in an infinite dimensional RKHS. Furthermore, PC-PG also has strong guarantees under model misspecification that go beyond the standard worst case L infinity assumptions; these include approximation guarantees for state aggregation under an average case error assumption, along with guarantees under a more general assumption where the approximation error under distribution shift is controlled. We complement the theory with empirical evaluation across a variety of domains in both reward-free and reward-driven settings.

NeurIPS Conference 2020 Conference Paper

Policy Improvement via Imitation of Multiple Oracles

  • Ching-An Cheng
  • Andrey Kolobov
  • Alekh Agarwal

Despite its promise, reinforcement learning’s real-world adoption has been hampered by the need for costly exploration to learn a good policy. Imitation learning (IL) mitigates this shortcoming by using an oracle policy during training as a bootstrap to accelerate the learning process. However, in many practical situations, the learner has access to multiple suboptimal oracles, which may provide conflicting advice in a state. The existing IL literature provides a limited treatment of such scenarios. Whereas in the single-oracle case, the return of the oracle’s policy provides an obvious benchmark for the learner to compete against, neither such a benchmark nor principled ways of outperforming it are known for the multi-oracle setting. In this paper, we propose the state-wise maximum of the oracle policies’ values as a natural baseline to resolve conflicting advice from multiple oracles. Using a reduction of policy optimization to online learning, we introduce a novel IL algorithm MAMBA, which can provably learn a policy competitive with this benchmark. In particular, MAMBA optimizes policies by using a gradient estimator in the style of generalized advantage estimation (GAE). Our theoretical analysis shows that this design makes MAMBA robust and enables it to outperform the oracle policies by a larger margin than the IL state of the art, even in the single-oracle case. In an evaluation against standard policy gradient with GAE and AggreVaTe(D), we showcase MAMBA’s ability to leverage demonstrations both from a single and from multiple weak oracles, and significantly speed up policy optimization.

NeurIPS Conference 2020 Conference Paper

Provably Good Batch Off-Policy Reinforcement Learning Without Great Exploration

  • Yao Liu
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

Batch reinforcement learning (RL) is important to apply RL algorithms to many high stakes tasks. Doing batch RL in a way that yields a reliable new policy in large domains is challenging: a new decision policy may visit states and actions outside the support of the batch data, and function approximation and optimization with limited samples can further increase the potential of learning policies with overly optimistic estimates of their future performance. Some recent approaches to address these concerns have shown promise, but can still be overly optimistic in their expected outcomes. Theoretical work that provides strong guarantees on the performance of the output policy relies on a strong concentrability assumption, which makes it unsuitable for cases where the ratio between state-action distributions of behavior policy and some candidate policies is large. This is because, in the traditional analysis, the error bound scales up with this ratio. We show that using \emph{pessimistic value estimates} in the low-data regions in Bellman optimality and evaluation back-up can yield more adaptive and stronger guarantees when the concentrability assumption does not hold. In certain settings, they can find the approximately best policy within the state-action space explored by the batch data, without requiring a priori assumptions of concentrability. We highlight the necessity of our pessimistic update and the limitations of previous algorithms and analyses by illustrative MDP examples and demonstrate an empirical comparison of our algorithm and other state-of-the-art batch RL baselines in standard benchmarks.

NeurIPS Conference 2020 Conference Paper

Safe Reinforcement Learning via Curriculum Induction

  • Matteo Turchetta
  • Andrey Kolobov
  • Shital Shah
  • Andreas Krause
  • Alekh Agarwal

In safety-critical applications, autonomous agents may need to learn in an environment where mistakes can be very costly. In such settings, the agent needs to behave safely not only after but also while learning. To achieve this, existing safe reinforcement learning methods make an agent rely on priors that let it avoid dangerous situations during exploration with high probability, but both the probabilistic guarantees and the smoothness assumptions inherent in the priors are not viable in many scenarios of interest such as autonomous driving. This paper presents an alternative approach inspired by human teaching, where an agent learns under the supervision of an automatic instructor that saves the agent from violating constraints during learning. In this model, we introduce the monitor that neither needs to know how to do well at the task the agent is learning nor needs to know how the environment works. Instead, it has a library of reset controllers that it activates when the agent starts behaving dangerously, preventing it from doing damage. Crucially, the choices of which reset controller to apply in which situation affect the speed of agent learning. Based on observing agents' progress the teacher itself learns a policy for choosing the reset controllers, a curriculum, to optimize the agent's final policy reward. Our experiments use this framework in two environments to induce curricula for safe and efficient learning.

JMLR Journal 2019 Journal Article

Active Learning for Cost-Sensitive Classification

  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Tzu-Kuo Huang
  • Hal Daumé III
  • John Langford

We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label's cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. We empirically compare COAL to passive learning and several active learning baselines, showing significant improvements in labeling effort and test cost on real-world datasets. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

NeurIPS Conference 2019 Conference Paper

Bias Correction of Learned Generative Models using Likelihood-Free Importance Weighting

  • Aditya Grover
  • Jiaming Song
  • Ashish Kapoor
  • Kenneth Tran
  • Alekh Agarwal
  • Eric Horvitz
  • Stefano Ermon

A learned generative model often produces biased statistics relative to the underlying data distribution. A standard technique to correct this bias is importance sampling, where samples from the model are weighted by the likelihood ratio under model and true distributions. When the likelihood ratio is unknown, it can be estimated by training a probabilistic classifier to distinguish samples from the two distributions. We employ this likelihood-free importance weighting method to correct for the bias in generative models. We find that this technique consistently improves standard goodness-of-fit metrics for evaluating the sample quality of state-of-the-art deep generative models, suggesting reduced bias. Finally, we demonstrate its utility on representative applications in a) data augmentation for classification using generative adversarial networks, and b) model-based policy evaluation using off-policy data.

ICML Conference 2019 Conference Paper

Fair Regression: Quantitative Definitions and Reduction-Based Algorithms

  • Alekh Agarwal
  • Miroslav Dudík
  • Zhiwei Steven Wu

In this paper, we study the prediction of a real-valued target, such as a risk score or recidivism rate, while guaranteeing a quantitative notion of fairness with respect to a protected attribute such as gender or race. We call this class of problems fair regression. We propose general schemes for fair regression under two notions of fairness: (1) statistical parity, which asks that the prediction be statistically independent of the protected attribute, and (2) bounded group loss, which asks that the prediction error restricted to any protected group remain below some pre-determined level. While we only study these two notions of fairness, our schemes are applicable to arbitrary Lipschitz-continuous losses, and so they encompass least-squares regression, logistic regression, quantile regression, and many other tasks. Our schemes only require access to standard risk minimization algorithms (such as standard classification or least-squares regression) while providing theoretical guarantees on the optimality and fairness of the obtained solutions. In addition to analyzing theoretical properties of our schemes, we empirically demonstrate their ability to uncover fairness–accuracy frontiers on several standard datasets.

UAI Conference 2019 Conference Paper

Off-Policy Policy Gradient with Stationary Distribution Correction

  • Yao Liu 0009
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

We study the problem of off-policy policy optimization in Markov decision processes, and develop a novel off-policy policy gradient method. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under the learned policy. Here we build on recent progress for estimating the ratio of the state distributions under behavior and evaluation policies for policy evaluation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important and a theoretical convergence guarantee for our approach. Empirically, we compare our method in simulations to several strong baselines which do not correct for this mismatch, significantly improving in the quality of the policy discovered.

RLDM Conference 2019 Conference Abstract

Off-Policy Policy Gradient with Stationary Distribution Correction

  • Yao Liu
  • Adith Swaminathan
  • Alekh Agarwal
  • Emma Brunskill

The ability to use data about prior decisions, and their outcomes, to make counterfactual infer- ences about how alternative decision policies might perform, is a cornerstone of intelligent behavior and has substantial practical importance. We focus on the problem of performing such counterfactual inferences in the context of sequential decision making in a Markov decision process, and consider how to perform off-policy policy optimization using a policy gradient method. Policy gradient methods have had great re- cent success when used in online reinforcement learning, and can be often a nice way to encode inductive bias, as well as to be able to tackle continuous action domains. Prior off-policy policy gradient approaches have generally ignored the mismatch between the distribution of states visited under the behavior policy used to collect data, and what would be the distribution of states under a new target policy. Here we buildon recent progress for estimating the ratio of the Markov chain stationary distribution of states in policy eval- uation, and present an off-policy policy gradient optimization technique that can account for this mismatch in distributions. We present an illustrative example of why this is important, and empirical simulations to suggest the benefits of this approach. We hope this is a step towards practical algorithms that can efficiently leverage prior data in order to inform better future decision policies.

ICML Conference 2019 Conference Paper

Provably efficient RL with Rich Observations via Latent State Decoding

  • Simon S. Du
  • Akshay Krishnamurthy
  • Nan Jiang 0008
  • Alekh Agarwal
  • Miroslav Dudík
  • John Langford 0001

We study the exploration problem in episodic MDPs with rich observations generated from a small number of latent states. Under certain identifiability assumptions, we demonstrate how to estimate a mapping from the observations to latent states inductively through a sequence of regression and clustering steps—where previously decoded latent states provide labels for later regression problems—and use it to construct good exploration policies. We provide finite-sample guarantees on the quality of the learned state decoding function and exploration policies, and complement our theory with an empirical evaluation on a class of hard exploration problems. Our method exponentially improves over $Q$-learning with naïve exploration, even when $Q$-learning has cheating access to latent states.

ICML Conference 2019 Conference Paper

Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

  • Chicheng Zhang
  • Alekh Agarwal
  • Hal Daumé III
  • John Langford 0001
  • Sahand Negahban

We investigate the feasibility of learning from both fully-labeled supervised data and contextual bandit data. We specifically consider settings in which the underlying learning signal may be different between these two data sources. Theoretically, we state and prove no-regret algorithms for learning that is robust to divergences between the two sources. Empirically, we evaluate some of these algorithms on a large selection of datasets, showing that our approaches are feasible, and helpful in practice.

ICML Conference 2018 Conference Paper

A Reductions Approach to Fair Classification

  • Alekh Agarwal
  • Alina Beygelzimer
  • Miroslav Dudík
  • John Langford 0001
  • Hanna M. Wallach

We present a systematic approach for achieving fairness in a binary classification setting. While we focus on two well-known quantitative definitions of fairness, our approach encompasses many other previously studied definitions as special cases. The key idea is to reduce fair classification to a sequence of cost-sensitive classification problems, whose solutions yield a randomized classifier with the lowest (empirical) error subject to the desired constraints. We introduce two reductions that work for any representation of the cost-sensitive classifier and compare favorably to prior baselines on a variety of data sets, while overcoming several of their disadvantages.

ICML Conference 2018 Conference Paper

Hierarchical Imitation and Reinforcement Learning

  • Hoang Minh Le 0002
  • Nan Jiang 0008
  • Alekh Agarwal
  • Miroslav Dudík
  • Yisong Yue
  • Hal Daumé III

We study how to effectively leverage expert feedback to learn sequential decision-making policies. We focus on problems with sparse rewards and long time horizons, which typically pose significant challenges in reinforcement learning. We propose an algorithmic framework, called hierarchical guidance, that leverages the hierarchical structure of the underlying problem to integrate different modes of expert interaction. Our framework can incorporate different combinations of imitation learning (IL) and reinforcement learning (RL) at different levels, leading to dramatic reductions in both expert effort and cost of exploration. Using long-horizon benchmarks, including Montezuma’s Revenge, we demonstrate that our approach can learn significantly faster than hierarchical RL, and be significantly more label-efficient than standard IL. We also theoretically analyze labeling cost for certain instantiations of our framework.

NeurIPS Conference 2018 Conference Paper

On Oracle-Efficient PAC RL with Rich Observations

  • Christoph Dann
  • Nan Jiang
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • John Langford
  • Robert Schapire

We study the computational tractability of PAC reinforcement learning with rich observations. We present new provably sample-efficient algorithms for environments with deterministic hidden state dynamics and stochastic rich observations. These methods operate in an oracle model of computation -- accessing policy and value function classes exclusively through standard optimization primitives -- and therefore represent computationally efficient alternatives to prior algorithms that require enumeration. With stochastic hidden state dynamics, we prove that the only known sample-efficient algorithm, OLIVE, cannot be implemented in the oracle model. We also present several examples that illustrate fundamental challenges of tractable PAC reinforcement learning in such general settings.

ICML Conference 2018 Conference Paper

Practical Contextual Bandits with Regression Oracles

  • Dylan J. Foster
  • Alekh Agarwal
  • Miroslav Dudík
  • Haipeng Luo
  • Robert E. Schapire

A major challenge in contextual bandits is to design general-purpose algorithms that are both practically useful and theoretically well-founded. We present a new technique that has the empirical and computational advantages of realizability-based approaches combined with the flexibility of agnostic methods. Our algorithms leverage the availability of a regression oracle for the value-function class, a more realistic and reasonable oracle than the classification oracles over policies typically assumed by agnostic methods. Our approach generalizes both UCB and LinUCB to far more expressive possible model classes and achieves low regret under certain distributional assumptions. In an extensive empirical evaluation, we find that our approach typically matches or outperforms both realizability-based and agnostic baselines.

ICML Conference 2017 Conference Paper

Active Learning for Cost-Sensitive Classification

  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Tzu-Kuo Huang
  • Hal Daumé III
  • John Langford 0001

We design an active learning algorithm for cost-sensitive multiclass classification: problems where different errors have different costs. Our algorithm, COAL, makes predictions by regressing to each label’s cost and predicting the smallest. On a new example, it uses a set of regressors that perform well on past data to estimate possible costs for each label. It queries only the labels that could be the best, ignoring the sure losers. We prove COAL can be efficiently implemented for any regression family that admits squared loss optimization; it also enjoys strong guarantees with respect to predictive performance and labeling effort. Our experiment with COAL show significant improvements in labeling effort and test cost over passive and active baselines.

ICML Conference 2017 Conference Paper

Contextual Decision Processes with low Bellman rank are PAC-Learnable

  • Nan Jiang 0008
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • John Langford 0001
  • Robert E. Schapire

This paper studies systematic exploration for reinforcement learning (RL) with rich observations and function approximation. We introduce contextual decision processes (CDPs), that unify most prior RL settings. Our first contribution is a complexity measure, the Bellman rank, that we show enables tractable learning of near-optimal behavior in CDPs and is naturally small for many well-studied RL models. Our second contribution is a new RL algorithm that does systematic exploration to learn near-optimal behavior in CDPs with low Bellman rank. The algorithm requires a number of samples that is polynomial in all relevant parameters but independent of the number of unique contexts. Our approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for RL with function approximation.

RLDM Conference 2017 Conference Abstract

Contextual Decision Processes with low Bellman rank are PAC-Learnable

  • Nan Jiang
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • John Langford
  • Robert Schapire

This paper studies systematic exploration for reinforcement learning (RL) with rich observations and function approximation. We introduce contextual decision processes (CDPs), that unify and gener- alize most prior RL settings. Our first contribution is a complexity measure, the Bellman Rank, that we show enables tractable learning of near-optimal behavior in these processes and is naturally small for many well-studied RL settings. Our second contribution is a new RL algorithm that engages in systematic explo- ration to learn near-optimal behavior in CDPs with low Bellman Rank. The algorithm requires a number of samples that is polynomial in all relevant parameters but independent of the number of unique contexts. Our approach uses Bellman error minimization with optimistic exploration and provides new insights into efficient exploration for RL with function approximation.

NeurIPS Conference 2017 Conference Paper

Off-policy evaluation for slate recommendation

  • Adith Swaminathan
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Miro Dudik
  • John Langford
  • Damien Jose
  • Imed Zitouni

This paper studies the evaluation of policies that recommend an ordered set of items (e. g. , a ranking) based on some context---a common scenario in web search, ads, and recommendation. We build on techniques from combinatorial bandits to introduce a new practical estimator that uses logged data to estimate a policy's performance. A thorough empirical evaluation on real-world data reveals that our estimator is accurate in a variety of settings, including as a subroutine in a learning-to-rank task, where it achieves competitive performance. We derive conditions under which our estimator is unbiased---these conditions are weaker than prior heuristics for slate evaluation---and experimentally demonstrate a smaller bias than parametric approaches, even when these conditions are violated. Finally, our theory and experiments also show exponential savings in the amount of required data compared with general unbiased estimators.

ICML Conference 2017 Conference Paper

Optimal and Adaptive Off-policy Evaluation in Contextual Bandits

  • Yu-Xiang Wang 0003
  • Alekh Agarwal
  • Miroslav Dudík

We study the off-policy evaluation problem—estimating the value of a target policy using data collected by another policy—under the contextual bandit model. We consider the general (agnostic) setting without access to a consistent model of rewards and establish a minimax lower bound on the mean squared error (MSE). The bound is matched up to constants by the inverse propensity scoring (IPS) and doubly robust (DR) estimators. This highlights the difficulty of the agnostic contextual setting, in contrast with multi-armed bandits and contextual bandits with access to a consistent reward model, where IPS is suboptimal. We then propose the SWITCH estimator, which can use an existing reward model (not necessarily consistent) to achieve a better bias-variance tradeoff than IPS and DR. We prove an upper bound on its MSE and demonstrate its benefits empirically on a diverse collection of datasets, often outperforming prior work by orders of magnitude.

NeurIPS Conference 2016 Conference Paper

Contextual semibandits via supervised learning oracles

  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Miro Dudik

We study an online decision making problem where on each round a learner chooses a list of items based on some side information, receives a scalar feedback value for each individual item, and a reward that is linearly related to this feedback. These problems, known as contextual semibandits, arise in crowdsourcing, recommendation, and many other domains. This paper reduces contextual semibandits to supervised learning, allowing us to leverage powerful supervised learning methods in this partial-feedback setting. Our first reduction applies when the mapping from feedback to reward is known and leads to a computationally efficient algorithm with near-optimal regret. We show that this algorithm outperforms state-of-the-art approaches on real-world learning-to-rank datasets, demonstrating the advantage of oracle-based algorithms. Our second reduction applies to the previously unstudied setting when the linear mapping from feedback to reward is unknown. Our regret guarantees are superior to prior techniques that ignore the feedback.

NeurIPS Conference 2016 Conference Paper

Efficient Second Order Online Learning by Sketching

  • Haipeng Luo
  • Alekh Agarwal
  • Nicolò Cesa-Bianchi
  • John Langford

We propose Sketched Online Newton (SON), an online second order learning algorithm that enjoys substantially improved regret guarantees for ill-conditioned data. SON is an enhanced version of the Online Newton Step, which, via sketching techniques enjoys a running time linear in the dimension and sketch size. We further develop sparse forms of the sketching methods (such as Oja's rule), making the computation linear in the sparsity of features. Together, the algorithm eliminates all computational obstacles in previous second order online learning approaches.

NeurIPS Conference 2016 Conference Paper

PAC Reinforcement Learning with Rich Observations

  • Akshay Krishnamurthy
  • Alekh Agarwal
  • John Langford

We propose and study a new model for reinforcement learning with rich observations, generalizing contextual bandits to sequential decision making. These models require an agent to take actions based on observations (features) with the goal of achieving long-term performance competitive with a large set of policies. To avoid barriers to sample-efficient learning associated with large observation spaces and general POMDPs, we focus on problems that can be summarized by a small number of hidden states and have long-term rewards that are predictable by a reactive function class. In this setting, we design and analyze a new reinforcement learning algorithm, Least Squares Value Elimination by Exploration. We prove that the algorithm learns near optimal behavior after a number of episodes that is polynomial in all relevant parameters, logarithmic in the number of policies, and independent of the size of the observation space. Our result provides theoretical justification for reinforcement learning with function approximation.

ICML Conference 2015 Conference Paper

A Lower Bound for the Optimization of Finite Sums

  • Alekh Agarwal
  • Léon Bottou

This paper presents a lower bound for optimizing a finite sum of n functions, where each function is L-smooth and the sum is μ-strongly convex. We show that no algorithm can reach an error εin minimizing all functions from this class in fewer than Ω(n + \sqrtn(κ-1)\log(1/ε)) iterations, where κ=L/μis a surrogate condition number. We then compare this lower bound to upper bounds for recently developed methods specializing to this setting. When the functions involved in this sum are not arbitrary, but based on i. i. d. random data, then we further contrast these complexity results with those for optimal first-order methods to directly optimize the sum. The conclusion we draw is that a lot of caution is necessary for an accurate comparison, and identify machine learning scenarios where the new methods help computationally.

NeurIPS Conference 2015 Conference Paper

Efficient and Parsimonious Agnostic Active Learning

  • Tzu-Kuo Huang
  • Alekh Agarwal
  • Daniel Hsu
  • John Langford
  • Robert Schapire

We develop a new active learning algorithm for the streaming settingsatisfying three important properties: 1) It provably works for anyclassifier representation and classification problem including thosewith severe noise. 2) It is efficiently implementable with an ERMoracle. 3) It is more aggressive than all previous approachessatisfying 1 and 2. To do this, we create an algorithm based on a newlydefined optimization problem and analyze it. We also conduct the firstexperimental analysis of all efficient agnostic active learningalgorithms, evaluating their strengths and weaknesses in differentsettings.

NeurIPS Conference 2015 Conference Paper

Fast Convergence of Regularized Learning in Games

  • Vasilis Syrgkanis
  • Alekh Agarwal
  • Haipeng Luo
  • Robert Schapire

We show that natural classes of regularized learning algorithms with a form of recency bias achieve faster convergence rates to approximate efficiency and to coarse correlated equilibria in multiplayer normal form games. When each player in a game uses an algorithm from our class, their individual regret decays at $O(T^{-3/4})$, while the sum of utilities converges to an approximate optimum at $O(T^{-1})$--an improvement upon the worst case $O(T^{-1/2})$ rates. We show a black-box reduction for any algorithm in the class to achieve $\tilde{O}(T^{-1/2})$ rates against an adversary, while maintaining the faster rates against algorithms in the class. Our results extend those of Rakhlin and Shridharan~\cite{Rakhlin2013} and Daskalakis et al. ~\cite{Daskalakis2014}, who only analyzed two-player zero-sum games for specific algorithms.

ICML Conference 2015 Conference Paper

Learning to Search Better than Your Teacher

  • Kai-Wei Chang 0001
  • Akshay Krishnamurthy
  • Alekh Agarwal
  • Hal Daumé III
  • John Langford 0001

Methods for learning to search for structured prediction typically imitate a reference policy, with existing theoretical guarantees demonstrating low regret compared to that reference. This is unsatisfactory in many applications where the reference policy is suboptimal and the goal of learning is to improve upon it. Can learning to search work even when the reference is poor? We provide a new learning to search algorithm, LOLS, which does well relative to the reference policy, but additionally guarantees low regret compared to deviations from the learned policy: a local-optimality guarantee. Consequently, LOLS can improve upon the reference policy, unlike previous algorithms. This enables us to develop structured contextual bandits, a partial information structured prediction setting with many potential applications.

JMLR Journal 2014 Journal Article

A Reliable Effective Terascale Linear Learning System

  • Alekh Agarwal
  • Oliveier Chapelle
  • Miroslav Dudík
  • John Langford

We present a system and a set of techniques for learning linear predictors with convex losses on terascale data sets, with trillions of features, (The number of features here refers to the number of non-zero entries in the data matrix.) billions of training examples and millions of parameters in an hour using a cluster of 1000 machines. Individually none of the component techniques are new, but the careful synthesis required to obtain an efficient implementation is. The result is, up to our knowledge, the most scalable and efficient linear learning system reported in the literature. (All the empirical evaluation reported in this work was carried out between May-Oct 2011.) We describe and thoroughly evaluate the components of the system, showing the importance of the various design choices. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

ICML Conference 2014 Conference Paper

Least Squares Revisited: Scalable Approaches for Multi-class Prediction

  • Alekh Agarwal
  • Sham M. Kakade
  • Nikos Karampatziakis
  • Le Song
  • Gregory Valiant

This work provides simple algorithms for multi-class (and multi-label) prediction in settings where both the number of examples n and the data dimension d are relatively large. These robust and parameter free algorithms are essentially iterative least-squares updates and very versatile both in theory and in practice. On the theoretical front, we present several variants with convergence guarantees. Owing to their effective use of second-order structure, these algorithms are substantially better than first-order methods in many practical scenarios. On the empirical side, we show how to scale our approach to high dimensional datasets, achieving dramatic computational speedups over popular optimization packages such as Liblinear and Vowpal Wabbit on standard datasets (MNIST and CIFAR-10), while attaining state-of-the-art accuracies.

NeurIPS Conference 2014 Conference Paper

Scalable Non-linear Learning with Adaptive Polynomial Expansions

  • Alekh Agarwal
  • Alina Beygelzimer
  • Daniel Hsu
  • John Langford
  • Matus Telgarsky

Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines.

ICML Conference 2014 Conference Paper

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

  • Alekh Agarwal
  • Daniel J. Hsu
  • Satyen Kale
  • John Langford 0001
  • Lihong Li 0001
  • Robert E. Schapire

We present a new algorithm for the contextual bandit learning problem, where the learner repeatedly takes one of K \emphactions in response to the observed \emphcontext, and observes the \emphreward only for that action. Our method assumes access to an oracle for solving fully supervised cost-sensitive classification problems and achieves the statistically optimal regret guarantee with only \otil(\sqrtKT) oracle calls across all T rounds. By doing so, we obtain the most practical contextual bandit learning algorithm amongst approaches that work for general policy classes. We conduct a proof-of-concept experiment which demonstrates the excellent computational and statistical performance of (an online variant of) our algorithm relative to several strong baselines.

ICML Conference 2013 Conference Paper

Selective sampling algorithms for cost-sensitive multiclass prediction

  • Alekh Agarwal

In this paper, we study the problem of active learning for cost-sensitive multiclass classification. We propose selective sampling algorithms, which process the data in a streaming fashion, querying only a subset of the labels. For these algorithms, we analyze the regret and label complexity when the labels are generated according to a generalized linear model. We establish that the gains of active learning over passive learning can range from none to exponentially large, based on a natural notion of margin. We also present a safety guarantee to guard against model mismatch. Numerical simulations show that our algorithms indeed obtain a low regret with a small number of queries.

NeurIPS Conference 2012 Conference Paper

Stochastic optimization and sparse statistical recovery: Optimal algorithms for high dimensions

  • Alekh Agarwal
  • Sahand Negahban
  • Martin Wainwright

We develop and analyze stochastic optimization algorithms for problems in which the expected loss is strongly convex, and the optimum is (approximately) sparse. Previous approaches are able to exploit only one of these two structures, yielding a $\order(\pdim/T)$ convergence rate for strongly convex objectives in $\pdim$ dimensions and $\order(\sqrt{\spindex( \log\pdim)/T})$ convergence rate when the optimum is $\spindex$-sparse. Our algorithm is based on successively solving a series of $\ell_1$-regularized optimization problems using Nesterov's dual averaging algorithm. We establish that the error of our solution after $T$ iterations is at most $\order(\spindex(\log\pdim)/T)$, with natural extensions to approximate sparsity. Our results apply to locally Lipschitz losses including the logistic, exponential, hinge and least-squares losses. By recourse to statistical minimax results, we show that our convergence rates are optimal up to constants. The effectiveness of our approach is also confirmed in numerical simulations where we compare to several baselines on a least-squares regression problem.

NeurIPS Conference 2011 Conference Paper

Distributed Delayed Stochastic Optimization

  • Alekh Agarwal
  • John Duchi

We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show $n$-node architectures whose optimization error in stochastic problems---in spite of asynchronous delays---scales asymptotically as $\order(1 / \sqrt{nT})$, which is known to be optimal even in the absence of delays.

UAI Conference 2011 Conference Paper

Learning with Missing Features

  • Afshin Rostamizadeh
  • Alekh Agarwal
  • Peter L. Bartlett

We introduce new online and batch algorithms that are robust to data with missing features, a situation that arises in many practical applications. In the online setup, we allow for the comparison hypothesis to change as a function of the subset of features that is observed on any given round, extending the standard setting where the comparison hypothesis is fixed throughout. In the batch setup, we present a convex relaxation of a non-convex problem to jointly estimate an imputation function, used to fill in the values of missing features, along with the classification hypothesis. We prove regret bounds in the online setting and Rademacher complexity bounds for the batch i.i.d. setting. The algorithms are tested on several UCI datasets, showing superior performance over baseline imputation methods.

NeurIPS Conference 2011 Conference Paper

Stochastic convex optimization with bandit feedback

  • Alekh Agarwal
  • Dean Foster
  • Daniel Hsu
  • Sham Kakade
  • Alexander Rakhlin

This paper addresses the problem of minimizing a convex, Lipschitz function $f$ over a convex, compact set $X$ under a stochastic bandit feedback model. In this model, the algorithm is allowed to observe noisy realizations of the function value $f(x)$ at any query point $x \in X$. We demonstrate a generalization of the ellipsoid algorithm that incurs $O(\poly(d)\sqrt{T})$ regret. Since any algorithm has regret at least $\Omega(\sqrt{T})$ on this problem, our algorithm is optimal in terms of the scaling with $T$.

NeurIPS Conference 2010 Conference Paper

Distributed Dual Averaging In Networks

  • Alekh Agarwal
  • Martin Wainwright
  • John Duchi

The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and we provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks.

NeurIPS Conference 2010 Conference Paper

Fast global convergence rates of gradient methods for high-dimensional statistical recovery

  • Alekh Agarwal
  • Sahand Negahban
  • Martin Wainwright

Many statistical $M$-estimators are based on convex optimization problems formed by the weighted sum of a loss function with a norm-based regularizer. We analyze the convergence rates of first-order gradient methods for solving such problems within a high-dimensional framework that allows the data dimension $d$ to grow with (and possibly exceed) the sample size $n$. This high-dimensional structure precludes the usual global assumptions---namely, strong convexity and smoothness conditions---that underlie classical optimization analysis. We define appropriately restricted versions of these conditions, and show that they are satisfied with high probability for various statistical models. Under these conditions, our theory guarantees that Nesterov's first-order method~\cite{Nesterov07} has a globally geometric rate of convergence up to the statistical precision of the model, meaning the typical Euclidean distance between the true unknown parameter $\theta^*$ and the optimal solution $\widehat{\theta}$. This globally linear rate is substantially faster than previous analyses of global convergence for specific methods that yielded only sublinear rates. Our analysis applies to a wide range of $M$-estimators and statistical models, including sparse linear regression using Lasso ($\ell_1$-regularized regression), group Lasso, block sparsity, and low-rank matrix recovery using nuclear norm regularization. Overall, this result reveals an interesting connection between statistical precision and computational efficiency in high-dimensional estimation.

JMLR Journal 2010 Journal Article

Message-passing for Graph-structured Linear Programs: Proximal Methods and Rounding Schemes

  • Pradeep Ravikumar
  • Alekh Agarwal
  • Martin J. Wainwright

The problem of computing a maximum a posteriori (MAP) configuration is a central computational challenge associated with Markov random fields. There has been some focus on "tree-based" linear programming (LP) relaxations for the MAP problem. This paper develops a family of super-linearly convergent algorithms for solving these LPs, based on proximal minimization schemes using Bregman divergences. As with standard message-passing on graphs, the algorithms are distributed and exploit the underlying graphical structure, and so scale well to large problems. Our algorithms have a double-loop character, with the outer loop corresponding to the proximal sequence, and an inner loop of cyclic Bregman projections used to compute each proximal update. We establish convergence guarantees for our algorithms, and illustrate their performance via some simulations. We also develop two classes of rounding schemes, deterministic and randomized, for obtaining integral configurations from the LP solutions. Our deterministic rounding schemes use a "re-parameterization" property of our algorithms so that when the LP solution is integral, the MAP solution can be obtained even before the LP-solver converges to the optimum. We also propose graph-structured randomized rounding schemes applicable to iterative LP-solving algorithms in general. We analyze the performance of and report simulations comparing these rounding schemes. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

Information-theoretic lower bounds on the oracle complexity of convex optimization

  • Alekh Agarwal
  • Martin Wainwright
  • Peter Bartlett
  • Pradeep Ravikumar

Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to the understanding the inherent complexity of large-scale learning and estimation problems.

NeurIPS Conference 2007 Conference Paper

An Analysis of Inference with the Universum

  • Olivier Chapelle
  • Alekh Agarwal
  • Fabian Sinz
  • Bernhard Schölkopf

We study a pattern classification algorithm which has recently been proposed by Vapnik and coworkers. It builds on a new inductive principle which assumes that in addition to positive and negative data, a third class of data is available, termed the Universum. We assay the behavior of the algorithm by establishing links with Fisher discriminant analysis and oriented PCA, as well as with an SVM in a pro- jected subspace (or, equivalently, with a data-dependent reduced kernel). We also provide experimental results.

ICML Conference 2007 Conference Paper

Learning random walks to rank nodes in graphs

  • Alekh Agarwal
  • Soumen Chakrabarti

Ranking nodes in graphs is of much recent interest. Edges, via the graph Laplacian, are used to encourage local smoothness of node scores in SVM-like formulations with generalization guarantees. In contrast, Page-rank variants are based on Markovian random walks. For directed graphs, there is no simple known correspondence between these views of scoring/ranking. Recent scalable algorithms for learning the Pagerank transition probabilities do not have generalization guarantees. In this paper we show some correspondence results between the Laplacian and the Pagerank approaches, and give new generalization guarantees for the latter. We enhance the Pagerank-learning approaches to use an additive margin. We also propose a general framework for rank-sensitive score-learning, and apply it to Laplacian smoothing. Experimental results are promising.