Arrow Research search

Author name cluster

Peter Bartlett

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

48 papers
1 author row

Possible papers

48

JMLR Journal 2025 Journal Article

Contextual Bandits with Stage-wise Constraints

  • Aldo Pacchiano
  • Mohammad Ghavamzadeh
  • Peter Bartlett

We study contextual bandits in the presence of a stage-wise constraint when the constraint must be satisfied both with high probability and in expectation. We start with the linear case where both the reward function and the stage-wise constraint (cost function) are linear. In each of the high probability and in expectation settings, we propose an upper-confidence bound algorithm for the problem and prove a $T$-round regret bound for it. We also prove a lower-bound for this constrained problem, show how our algorithms and analyses can be extended to multiple constraints, and provide simulations to validate our theoretical results. In the high probability setting, we describe the minimum requirements for the action set for our algorithm to be tractable. In the setting that the constraint is in expectation, we specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting with regret analysis. Finally, we extend our results to the case where the reward and cost functions are both non-linear. We propose an algorithm for this case and prove a regret bound for it that characterize the function class complexity by the eluder dimension. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

Improved Scaling Laws in Linear Regression via Data Reuse

  • Licong Lin
  • Jingfeng Wu
  • Peter Bartlett

Neural scaling laws suggest that the test error of large language models trained online decreases polynomially as the model size and data size increase. However, such scaling can be unsustainable when running out of new data. In this work, we show that data reuse can improve existing scaling laws in linear regression. Specifically, we derive sharp test error bounds on $M$-dimensional linear models trained by multi-pass *stochastic gradient descent* (multi-pass SGD) on $N$ data with sketched features. Assuming that the data covariance has a power-law spectrum of degree $a$, and that the true parameter follows a prior with an aligned power-law spectrum of degree $b-a$ (with $a > b > 1$), we show that multi-pass SGD achieves a test error of $\Theta(M^{1-b} + L^{(1-b)/a})$, where $L \lesssim N^{a/b}$ is the number of iterations. In the same setting, one-pass SGD only attains a test error of $\Theta(M^{1-b} + N^{(1-b)/a})$ (see, e. g. , Lin et al. , 2024). This suggests an improved scaling law via data reuse (i. e. , choosing $L>N$) in data-constrained regimes. Numerical simulations are also provided to verify our theoretical findings.

NeurIPS Conference 2025 Conference Paper

Large Stepsizes Accelerate Gradient Descent for Regularized Logistic Regression

  • Jingfeng Wu
  • Pierre Marion
  • Peter Bartlett

We study *gradient descent* (GD) with a constant stepsize for $\ell_2$-regularized logistic regression with linearly separable data. Classical theory suggests small stepsizes to ensure monotonic reduction of the optimization objective, achieving exponential convergence in $\widetilde{\mathcal{O}}(\kappa)$ steps with $\kappa$ being the condition number. Surprisingly, we show that this can be *accelerated* to $\widetilde{\mathcal{O}}(\sqrt{\kappa})$ by simply using a large stepsize---for which the objective evolves *nonmonotonically*. The acceleration brought by large stepsizes extends to minimizing the population risk for separable distributions, improving on the best-known upper bounds on the number of steps to reach a near-optimum. Finally, we characterize the largest stepsize for the local convergence of GD, which also determines the global convergence in special scenarios. Our results extend the analysis of Wu et al. (2024) from convex settings with minimizers at infinity to strongly convex cases with finite minimizers.

NeurIPS Conference 2024 Conference Paper

Fast Best-of-N Decoding via Speculative Rejection

  • Hanshi Sun
  • Momin Haider
  • Ruiqi Zhang
  • Huitao Yang
  • Jiahao Qiu
  • Ming Yin
  • Mengdi Wang
  • Peter Bartlett

The safe and effective deployment of Large Language Models (LLMs) involves a critical step called alignment, which ensures that the model's responses are in accordance with human preferences. Prevalent alignment techniques, such as DPO, PPO and their variants, align LLMs by changing the pre-trained model weights during a phase called post-training. While predominant, these post-training methods add substantial complexity before LLMs can be deployed. Inference-time alignment methods avoid the complex post-training step and instead bias the generation towards responses that are aligned with human preferences. The best-known inference-time alignment method, called Best-of-N, is as effective as the state-of-the-art post-training procedures. Unfortunately, Best-of-N requires vastly more resources at inference time than standard decoding strategies, which makes it computationally not viable. In this work, we introduce Speculative Rejection, a computationally-viable inference-time alignment algorithm. It generates high-scoring responses according to a given reward model, like Best-of-N does, while being between 16 to 32 times more computationally efficient.

NeurIPS Conference 2024 Conference Paper

In-Context Learning of a Linear Transformer Block: Benefits of the MLP Component and One-Step GD Initialization

  • Ruiqi Zhang
  • Jingfeng Wu
  • Peter Bartlett

We study the \emph{in-context learning} (ICL) ability of a \emph{Linear Transformer Block} (LTB) that combines a linear attention component and a linear multi-layer perceptron (MLP) component. For ICL of linear regression with a Gaussian prior and a \emph{non-zero mean}, we show that LTB can achieve nearly Bayes optimal ICL risk. In contrast, using only linear attention must incur an irreducible additive approximation error. Furthermore, we establish a correspondence between LTB and one-step gradient descent estimators with learnable initialization ($\mathsf{GD}-\beta$), in the sense that every $\mathsf{GD}-\beta$ estimator can be implemented by an LTB estimator and every optimal LTB estimator that minimizes the in-class ICL risk is effectively a $\mathsf{GD}-\beta$ estimator. Finally, we show that $\mathsf{GD}-\beta$ estimators can be efficiently optimized with gradient flow, despite a non-convex training objective. Our results reveal that LTB achieves ICL by implementing $\mathsf{GD}-\beta$, and they highlight the role of MLP layers in reducing approximation error.

NeurIPS Conference 2024 Conference Paper

Large Stepsize Gradient Descent for Non-Homogeneous Two-Layer Networks: Margin Improvement and Fast Optimization

  • Yuhang Cai
  • Jingfeng Wu
  • Song Mei
  • Michael Lindsey
  • Peter Bartlett

The typical training of neural networks using large stepsize gradient descent (GD) under the logistic loss often involves two distinct phases, where the empirical risk oscillates in the first phase but decreases monotonically in the second phase. We investigate this phenomenon in two-layer networks that satisfy a near-homogeneity condition. We show that the second phase begins once the empirical risk falls below a certain threshold, dependent on the stepsize. Additionally, we show that the normalized margin grows nearly monotonically in the second phase, demonstrating an implicit bias of GD in training non-homogeneous predictors. If the dataset is linearly separable and the derivative of the activation function is bounded away from zero, we show that the average empirical risk decreases, implying that the first phase must stop in finite steps. Finally, we demonstrate that by choosing a suitably large stepsize, GD that undergoes this phase transition is more efficient than GD that monotonically decreases the risk. Our analysis applies to networks of any width, beyond the well-known neural tangent kernel and mean-field regimes.

JMLR Journal 2023 Journal Article

A Complete Characterization of Linear Estimators for Offline Policy Evaluation

  • Juan C. Perdomo
  • Akshay Krishnamurthy
  • Peter Bartlett
  • Sham Kakade

Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation. [abs] [ pdf ][ bib ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

  • Spencer Frei
  • Gal Vardi
  • Peter Bartlett
  • Nati Srebro

In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are vulnerable to adversarial examples. Our results hold even in cases where the network is highly overparameterized. Despite the potential for harmful overfitting in such settings, we prove that the implicit bias of gradient flow prevents it. However, the implicit bias also leads to non-robust solutions (susceptible to small adversarial $\ell_2$-perturbations), even though robust networks that fit the data exist.

NeurIPS Conference 2021 Conference Paper

Adversarial Examples in Multi-Layer Random ReLU Networks

  • Peter Bartlett
  • Sebastien Bubeck
  • Yeshwanth Cherapanamjeri

We consider the phenomenon of adversarial examples in ReLU networks with independent Gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. Our proof shows that adversarial examples arise in these networks because the functions they compute are \emph{locally} very similar to random linear functions. Bottleneck layers play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.

NeurIPS Conference 2021 Conference Paper

Near Optimal Policy Optimization via REPS

  • Aldo Pacchiano
  • Jonathan N Lee
  • Peter Bartlett
  • Ofir Nachum

Since its introduction a decade ago, relative entropy policy search (REPS) has demonstrated successful policy learning on a number of simulated and real-world robotic domains, not to mention providing algorithmic components used by many recently proposed reinforcement learning (RL) algorithms. While REPS is commonly known in the community, there exist no guarantees on its performance when using stochastic and gradient-based solvers. In this paper we aim to fill this gap by providing guarantees and convergence rates for the sub-optimality of a policy learned using first-order optimization methods applied to the REPS objective. We first consider the setting in which we are given access to exact gradients and demonstrate how near-optimality of the objective translates to near-optimality of the policy. We then consider the practical setting of stochastic gradients, and introduce a technique that uses generative access to the underlying Markov decision process to compute parameter updates that maintain favorable convergence to the optimal regularized policy.

NeurIPS Conference 2021 Conference Paper

On the Theory of Reinforcement Learning with Once-per-Episode Feedback

  • Niladri Chatterji
  • Aldo Pacchiano
  • Peter Bartlett
  • Michael Jordan

We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either good'' or bad, '' but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret.

IJCAI Conference 2020 Conference Paper

Greedy Convex Ensemble

  • Thanh Tan Nguyen
  • Nan Ye
  • Peter Bartlett

We consider learning a convex combination of basis models, and present some new theoretical and empirical results that demonstrate the effectiveness of a greedy approach. Theoretically, we first consider whether we can use linear, instead of convex, combinations, and obtain generalization results similar to existing ones for learning from a convex hull. We obtain a negative result that even the linear hull of very simple basis functions can have unbounded capacity, and is thus prone to overfitting; on the other hand, convex hulls are still rich but have bounded capacities. Secondly, we obtain a generalization bound for a general class of Lipschitz loss functions. Empirically, we first discuss how a convex combination can be greedily learned with early stopping, and how a convex combination can be non-greedily learned when the number of basis models is known a priori. Our experiments suggest that the greedy scheme is competitive with or better than several baselines, including boosting and random forests. The greedy algorithm requires little effort in hyper-parameter tuning, and also seems able to adapt to the underlying complexity of the problem. Our code is available at https: //github. com/tan1889/gce.

NeurIPS Conference 2020 Conference Paper

Preference learning along multiple criteria: A game-theoretic perspective

  • Kush Bhatia
  • Ashwin Pananjady
  • Peter Bartlett
  • Anca Dragan
  • Martin J. Wainwright

The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well-known that any Nash equilibrium of the zero-sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects) known as a von Neumann winner. Many real-world problems, however, are inevitably multi-criteria, with different pairwise preferences governing the different criteria. In this work, we generalize the notion of a von Neumann winner to the multi-criteria setting by taking inspiration from Blackwell’s approachability. Our framework allows for non-linear aggregation of preferences across criteria, and generalizes the linearization-based approach from multi-objective optimization. From a theoretical standpoint, we show that the Blackwell winner of a multi-criteria problem instance can be computed as the solution to a convex optimization problem. Furthermore, given random samples of pairwise comparisons, we show that a simple, "plug-in" estimator achieves (near-)optimal minimax sample complexity. Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences.

NeurIPS Conference 2020 Conference Paper

Self-Distillation Amplifies Regularization in Hilbert Space

  • Hossein Mobahi
  • Mehrdad Farajtabar
  • Peter Bartlett

Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of why this happens. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to L2 regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.

NeurIPS Conference 2018 Conference Paper

Gen-Oja: Simple & Efficient Algorithm for Streaming Generalized Eigenvector Computation

  • Kush Bhatia
  • Aldo Pacchiano
  • Nicolas Flammarion
  • Peter Bartlett
  • Michael Jordan

In this paper, we study the problems of principle Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-Time-Scale Stochastic Approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest.

NeurIPS Conference 2018 Conference Paper

Horizon-Independent Minimax Linear Regression

  • Alan Malek
  • Peter Bartlett

We consider online linear regression: at each round, an adversary reveals a covariate vector, the learner predicts a real value, the adversary reveals a label, and the learner suffers the squared prediction error. The aim is to minimize the difference between the cumulative loss and that of the linear predictor that is best in hindsight. Previous work demonstrated that the minimax optimal strategy is easy to compute recursively from the end of the game; this requires the entire sequence of covariate vectors in advance. We show that, once provided with a measure of the scale of the problem, we can invert the recursion and play the minimax strategy without knowing the future covariates. Further, we show that this forward recursion remains optimal even against adaptively chosen labels and covariates, provided that the adversary adheres to a set of constraints that prevent misrepresentation of the scale of the problem. This strategy is horizon-independent in that the regret and minimax strategies depend on the size of the constraint set and not on the time-horizon, and hence it incurs no more regret than the optimal strategy that knows in advance the number of rounds of the game. We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer and obtain an explicit expression for the minimax regret.

NeurIPS Conference 2017 Conference Paper

Acceleration and Averaging in Stochastic Descent Dynamics

  • Walid Krichene
  • Peter Bartlett

We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.

NeurIPS Conference 2017 Conference Paper

Alternating minimization for dictionary learning with random initialization

  • Niladri Chatterji
  • Peter Bartlett

We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem. The dictionary learning problem is to factorize vector samples $y^{1}, y^{2}, \ldots, y^{n}$ into an appropriate basis (dictionary) $A^*$ and sparse vectors $x^{1*}, \ldots, x^{n*}$. Our algorithm is a simple alternating minimization procedure that switches between $\ell_1$ minimization and gradient descent in alternate steps. Dictionary learning and specifically alternating minimization algorithms for dictionary learning are well studied both theoretically and empirically. However, in contrast to previous theoretical analyses for this problem, we replace a condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term). This not only allows us to get convergence rates for the error of the estimated dictionary measured in the matrix infinity norm, but also ensures that a random initialization will provably converge to the global optimum. Our guarantees are under a reasonable generative model that allows for dictionaries with growing operator norms, and can handle an arbitrary level of overcompleteness, while having sparsity that is information theoretically optimal. We also establish upper bounds on the sample complexity of our algorithm.

AAAI Conference 2017 Conference Paper

Fast-Tracking Stationary MOMDPs for Adaptive Management Problems

  • Martin PŽron
  • Kai Becker
  • Peter Bartlett
  • Iadine Chads

Adaptive management is applied in conservation and natural resource management, and consists of making sequential decisions when the transition matrix is uncertain. Informally described as ’learning by doing’, this approach aims to trade off between decisions that help achieve the objective and decisions that will yield a better knowledge of the true transition matrix. When the true transition matrix is assumed to be an element of a finite set of possible matrices, solving a mixed observability Markov decision process (MOMDP) leads to an optimal trade-off but is very computationally demanding. Under the assumption (common in adaptive management) that the true transition matrix is stationary, we propose a polynomial-time algorithm to find a lower bound of the value function. In the corners of the domain of the value function (belief space), this lower bound is provably equal to the optimal value function. We also show that under further assumptions, it is a linear approximation of the optimal value function in a neighborhood around the corners. We evaluate the benefits of our approach by using it to initialize the solvers MO-SARSOP and Perseus on a novel computational sustainability problem and a recent adaptive management data challenge. Our approach leads to an improved initial value function and translates into significant computational gains for both solvers.

NeurIPS Conference 2017 Conference Paper

Near Minimax Optimal Players for the Finite-Time 3-Expert Prediction Problem

  • Yasin Abbasi Yadkori
  • Peter Bartlett
  • Victor Gabillon

We study minimax strategies for the online prediction problem with expert advice. It has been conjectured that a simple adversary strategy, called COMB, is near optimal in this game for any number of experts. Our results and new insights make progress in this direction by showing that, up to a small additive term, COMB is minimax optimal in the finite-time three expert problem. In addition, we provide for this setting a new near minimax optimal COMB-based learner. Prior to this work, in this problem, learners obtaining the optimal multiplicative constant in their regret rate were known only when $K=2$ or $K\rightarrow\infty$. We characterize, when $K=3$, the regret of the game scaling as $\sqrt{8/(9\pi)T}\pm \log(T)^2$ which gives for the first time the optimal constant in the leading ($\sqrt{T}$) term of the regret.

NeurIPS Conference 2017 Conference Paper

Spectrally-normalized margin bounds for neural networks

  • Peter Bartlett
  • Dylan Foster
  • Matus Telgarsky

This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the MNIST and CIFAR10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.

NeurIPS Conference 2016 Conference Paper

Adaptive Averaging in Accelerated Descent Dynamics

  • Walid Krichene
  • Alexandre Bayen
  • Peter Bartlett

We study accelerated descent dynamics for constrained convex optimization. This dynamics can be described naturally as a coupling of a dual variable accumulating gradients at a given rate $\eta(t)$, and a primal variable obtained as the weighted average of the mirrored dual trajectory, with weights $w(t)$. Using a Lyapunov argument, we give sufficient conditions on $\eta$ and $w$ to achieve a desired convergence rate. As an example, we show that the replicator dynamics (an example of mirror descent on the simplex) can be accelerated using a simple averaging scheme. We then propose an adaptive averaging heuristic which adaptively computes the weights to speed up the decrease of the Lyapunov function. We provide guarantees on adaptive averaging in continuous-time, prove that it preserves the quadratic convergence rate of accelerated first-order methods in discrete-time, and give numerical experiments to compare it with existing heuristics, such as adaptive restarting. The experiments indicate that adaptive averaging performs at least as well as adaptive restarting, with significant improvements in some cases.

NeurIPS Conference 2015 Conference Paper

Accelerated Mirror Descent in Continuous and Discrete Time

  • Walid Krichene
  • Alexandre Bayen
  • Peter Bartlett

We study accelerated mirror descent dynamics in continuous and discrete time. Combining the original continuous-time motivation of mirror descent with a recent ODE interpretation of Nesterov's accelerated method, we propose a family of continuous-time descent dynamics for convex functions with Lipschitz gradients, such that the solution trajectories are guaranteed to converge to the optimum at a $O(1/t^2)$ rate. We then show that a large family of first-order accelerated methods can be obtained as a discretization of the ODE, and these methods converge at a $O(1/k^2)$ rate. This connection between accelerated mirror descent and the ODE provides an intuitive approach to the design and analysis of accelerated first-order algorithms.

NeurIPS Conference 2015 Conference Paper

Minimax Time Series Prediction

  • Wouter Koolen
  • Alan Malek
  • Peter Bartlett
  • Yasin Abbasi Yadkori

We consider an adversarial formulation of the problem ofpredicting a time series with square loss. The aim is to predictan arbitrary sequence of vectors almost as well as the bestsmooth comparator sequence in retrospect. Our approach allowsnatural measures of smoothness such as the squared norm ofincrements. More generally, we consider a linear time seriesmodel and penalize the comparator sequence through the energy ofthe implied driving noise terms. We derive the minimax strategyfor all problems of this type and show that it can be implementedefficiently. The optimal predictions are linear in the previousobservations. We obtain an explicit expression for the regret interms of the parameters defining the problem. For typical, simple definitions of smoothness, the computation of the optimalpredictions involves only sparse matrices. In the case ofnorm-constrained data, where the smoothness is defined in termsof the squared norm of the comparator's increments, we show thatthe regret grows as $T/\sqrt{\lambda_T}$, where $T$ is the lengthof the game and $\lambda_T$ is an increasing limit on comparatorsmoothness.

NeurIPS Conference 2014 Conference Paper

Efficient Minimax Strategies for Square Loss Games

  • Wouter Koolen
  • Alan Malek
  • Peter Bartlett

We consider online prediction problems where the loss between the prediction and the outcome is measured by the squared Euclidean distance and its generalization, the squared Mahalanobis distance. We derive the minimax solutions for the case where the prediction and action spaces are the simplex (this setup is sometimes called the Brier game) and the $\ell_2$ ball (this setup is related to Gaussian density estimation). We show that in both cases the value of each sub-game is a quadratic function of a simple statistic of the state, with coefficients that can be efficiently computed using an explicit recurrence relation. The resulting deterministic minimax strategy and randomized maximin strategy are linear functions of the statistic.

NeurIPS Conference 2014 Conference Paper

Large-Margin Convex Polytope Machine

  • Alex Kantchelian
  • Michael Tschantz
  • Ling Huang
  • Peter Bartlett
  • Anthony Joseph
  • J. D. Tygar

We present the Convex Polytope Machine (CPM), a novel non-linear learning algorithm for large-scale binary classification tasks. The CPM finds a large margin convex polytope separator which encloses one class. We develop a stochastic gradient descent based algorithm that is amenable to massive datasets, and augment it with a heuristic procedure to avoid sub-optimal local minima. Our experimental evaluations of the CPM on large-scale datasets from distinct domains (MNIST handwritten digit recognition, text topic, and web security) demonstrate that the CPM trains models faster, sometimes several orders of magnitude, than state-of-the-art similar approaches and kernel-SVM methods while achieving comparable or better classification performance. Our empirical results suggest that, unlike prior similar approaches, we do not need to control the number of sub-classifiers (sides of the polytope) to avoid overfitting.

NeurIPS Conference 2013 Conference Paper

How to Hedge an Option Against an Adversary: Black-Scholes Pricing is Minimax Optimal

  • Jacob Abernethy
  • Peter Bartlett
  • Rafael Frongillo
  • Andre Wibisono

We consider a popular problem in finance, option pricing, through the lens of an online learning game between Nature and an Investor. In the Black-Scholes option pricing model from 1973, the Investor can continuously hedge the risk of an option by trading the underlying asset, assuming that the asset's price fluctuates according to Geometric Brownian Motion (GBM). We consider a worst-case model, in which Nature chooses a sequence of price fluctuations under a cumulative quadratic volatility constraint, and the Investor can make a sequence of hedging decisions. Our main result is to show that the value of our proposed game, which is the regret'' of hedging strategy, converges to the Black-Scholes option price. We use significantly weaker assumptions than previous work---for instance, we allow large jumps in the asset price---and show that the Black-Scholes hedging strategy is near-optimal for the Investor even in this non-stochastic framework. "

NeurIPS Conference 2013 Conference Paper

Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

  • Yasin Abbasi Yadkori
  • Peter Bartlett
  • Varun Kanade
  • Yevgeny Seldin
  • Csaba Szepesvari

We study the problem of online learning Markov Decision Processes (MDPs) when both the transition distributions and loss functions are chosen by an adversary. We present an algorithm that, under a mixing assumption, achieves $O(\sqrt{T\log|\Pi|}+\log|\Pi|)$ regret with respect to a comparison set of policies $\Pi$. The regret is independent of the size of the state and action spaces. When expectations over sample paths can be computed efficiently and the comparison set $\Pi$ has polynomial size, this algorithm is efficient. We also consider the episodic adversarial online shortest path problem. Here, in each episode an adversary may choose a weighted directed acyclic graph with an identified start and finish node. The goal of the learning algorithm is to choose a path that minimizes the loss while traversing from the start to finish node. At the end of each episode the loss function (given by weights on the edges) is revealed to the learning algorithm. The goal is to minimize regret with respect to a fixed policy for selecting paths. This problem is a special case of the online MDP problem. For randomly chosen graphs and adversarial losses, this problem can be efficiently solved. We show that it also can be efficiently solved for adversarial graphs and randomly chosen losses. When both graphs and losses are adversarially chosen, we present an efficient algorithm whose regret scales linearly with the number of distinct graphs. Finally, we show that designing efficient algorithms for the adversarial online shortest path problem (and hence for the adversarial MDP problem) is as hard as learning parity with noise, a notoriously difficult problem that has been used to design efficient cryptographic schemes.

NeurIPS Conference 2009 Conference Paper

Information-theoretic lower bounds on the oracle complexity of convex optimization

  • Alekh Agarwal
  • Martin Wainwright
  • Peter Bartlett
  • Pradeep Ravikumar

Despite the large amount of literature on upper bounds on complexity of convex analysis, surprisingly little is known about the fundamental hardness of these problems. The extensive use of convex optimization in machine learning and statistics makes such an understanding critical to understand fundamental computational limits of learning and estimation. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for some function classes. We also discuss implications of these results to the understanding the inherent complexity of large-scale learning and estimation problems.

NeurIPS Conference 2007 Conference Paper

Adaptive Online Gradient Descent

  • Elad Hazan
  • Alexander Rakhlin
  • Peter Bartlett

We study the rates of growth of the regret in online convex optimization. First, we show that a simple extension of the algorithm of Hazan et al eliminates the need for a priori knowledge of the lower bound on the second derivatives of the observed functions. We then provide an algorithm, Adaptive Online Gradient Descent, which interpolates between the results of Zinkevich for linear functions and of Hazan et al for strongly convex functions, achieving intermediate rates T and log T. Furthermore, we show strong optimality of the algorithm. between Finally, we provide an extension of our results to general norms.

NeurIPS Conference 2007 Conference Paper

Optimistic Linear Programming gives Logarithmic Regret for Irreducible MDPs

  • Ambuj Tewari
  • Peter Bartlett

We present an algorithm called Optimistic Linear Programming (OLP) for learning to optimize average reward in an irreducible but otherwise unknown Markov decision process (MDP). OLP uses its experience so far to estimate the MDP. It chooses actions by optimistically maximizing estimated future rewards over a set of next-state transition probabilities that are close to the estimates: a computation that corresponds to solving linear programs. We show that the total expected reward obtained by OLP up to time $T$ is within $C(P)\log T$ of the reward obtained by the optimal policy, where $C(P)$ is an explicit, MDP-dependent constant. OLP is closely related to an algorithm proposed by Burnetas and Katehakis with four key differences: OLP is simpler, it does not require knowledge of the supports of transition probabilities and the proof of the regret bound is simpler, but our regret bound is a constant factor larger than the regret of their algorithm. OLP is also similar in flavor to an algorithm recently proposed by Auer and Ortner. But OLP is simpler and its regret bound has a better dependence on the size of the MDP.

NeurIPS Conference 2006 Conference Paper

AdaBoost is Consistent

  • Peter Bartlett
  • Mikhail Traskin

The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n iterations--for sample size n and 0.

NeurIPS Conference 2006 Conference Paper

Sample Complexity of Policy Search with Known Dynamics

  • Peter Bartlett
  • Ambuj Tewari

We consider methods that try to find a good policy for a Markov decision process by choosing one from a given class. The policy is chosen based on its empirical performance in simulations. We are interested in conditions on the complexity of the policy class that ensure the success of such simulation based policy search methods. We show that under bounds on the amount of computation involved in computing policies, transition dynamics and rewards, uniform convergence of empirical estimates to true value functions occurs. Previously, such results were derived by assuming boundedness of pseudodimension and Lipschitz continuity. These assumptions and ours are both stronger than the usual combinatorial complexity measures. We show, via minimax inequalities, that this is essential: boundedness of pseudodimension or fat-shattering dimension alone is not sufficient.

NeurIPS Conference 2006 Conference Paper

Shifting, One-Inclusion Mistake Bounds and Tight Multiclass Expected Risk Bounds

  • Benjamin Rubinstein
  • Peter Bartlett
  • J. Rubinstein

Under the prediction model of learning, a prediction strategy is presented with an i. i. d. sample of n - 1 points in X and corresponding labels from a concept f F, and aims to minimize the worst-case probability of erring on an nth point. By exploiting the structure of F, Haussler et al. achieved a VC(F )/n bound for the natural one-inclusion prediction strategy, improving on bounds implied by PAC-type results by a O(log n) factor. The key data structure in their result is the natural subgraph of the hypercube--the one-inclusion graph; the key step is a d = VC(F ) bound on one-inclunion graph density. The first main result of this s /n -1 paper is a density bound of n d-1 ( d ) < d, which positively resolves a conjecture of Kuzmin & Warmuth relating to their unlabeled Peeling compression scheme and also leads to an improved mistake bound for the randomized (deterministic) one-inclusion strategy for all d (for d (n)). The proof uses a new form of VC-invariant shifting and a group-theoretic symmetrization. Our second main result is a k -class analogue of the d/n mistake bound, replacing the VC-dimension by the Pollard pseudo-dimension and the one-inclusion strategy by its natural hypergraph generalization. This bound on expected risk improves on known PAC-based results by a factor of O(log n) and is shown to be optimal up to a O(log k ) factor. The combinatorial technique of shifting takes a central role in understanding the one-inclusion (hyper)graph and is a running theme throughout.

NeurIPS Conference 2004 Conference Paper

Exponentiated Gradient Algorithms for Large-margin Structured Classification

  • Peter Bartlett
  • Michael Collins
  • Ben Taskar
  • David McAllester

We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal struc- ture. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expecta- tions under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the ap- plication of exponentiated gradient updates [7, 8] to quadratic programs.

JMLR Journal 2004 Journal Article

Learning the Kernel Matrix with Semidefinite Programming

  • Gert R.G. Lanckriet
  • Nello Cristianini
  • Peter Bartlett
  • Laurent El Ghaoui
  • Michael I. Jordan

Kernel-based learning algorithms work by embedding the data into a Euclidean space, and then searching for linear relations among the embedded data points. The embedding is performed implicitly, by specifying the inner products between each pair of points in the embedding space. This information is contained in the so-called kernel matrix, a symmetric and positive semidefinite matrix that encodes the relative positions of all points. Specifying this matrix amounts to specifying the geometry of the embedding space and inducing a notion of similarity in the input space---classical model selection problems in machine learning. In this paper we show how the kernel matrix can be learned from data via semidefinite programming (SDP) techniques. When applied to a kernel matrix associated with both training and test data this gives a powerful transductive algorithm---using the labeled part of the data one can learn an embedding also for the unlabeled part. The similarity between test points is inferred from training points and their labels. Importantly, these learning problems are convex, so we obtain a method for learning both the model class and the function without local minima. Furthermore, this approach leads directly to a convex method for learning the 2-norm soft margin parameter in support vector machines, solving an important open problem. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2003 Conference Paper

Large Margin Classifiers: Convex Loss, Low Noise, and Convergence Rates

  • Peter Bartlett
  • Michael Jordan
  • Jon McAuliffe

Many classification algorithms, including the support vector machine, boosting and logistic regression, can be viewed as minimum contrast methods that minimize a convex surrogate of the 0-1 loss function. We characterize the statistical consequences of using such a surrogate by pro- viding a general quantitative relationship between the risk as assessed us- ing the 0-1 loss and the risk as assessed using any nonnegative surrogate loss function. We show that this relationship gives nontrivial bounds un- der the weakest possible condition on the loss function—that it satisfy a pointwise form of Fisher consistency for classification. The relationship is based on a variational transformation of the loss function that is easy to compute in many applications. We also present a refined version of this result in the case of low noise. Finally, we present applications of our results to the estimation of convergence rates in the general setting of function classes that are scaled hulls of a finite-dimensional base class.

NeurIPS Conference 2001 Conference Paper

Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning

  • Evan Greensmith
  • Peter Bartlett
  • Jonathan Baxter

We consider the use of two additive control variate methods to reduce the variance of performance gradient estimates in reinforcement learn- ing problems. The first approach we consider is the baseline method, in which a function of the current state is added to the discounted value estimate. We relate the performance of these methods, which use sam- ple paths, to the variance of estimates based on iid data. We derive the baseline function that minimizes this variance, and we show that the vari- ance for any baseline is the sum of the optimal variance and a weighted squared distance to the optimal baseline. We show that the widely used average discounted value baseline (where the reward is replaced by the difference between the reward and its expectation) is suboptimal. The second approach we consider is the actor-critic method, which uses an approximate value function. We give bounds on the expected squared error of its estimates. We show that minimizing distance to the true value function is suboptimal in general; we provide an example for which the true value function gives an estimate with positive variance, but the op- timal value function gives an unbiased estimate with zero variance. Our bounds suggest algorithms to estimate the gradient of the performance of parameterized baseline or value functions. We present preliminary exper- iments that illustrate the performance improvements on a simple control problem. 1 Introduction, Background, and Preliminary Results In reinforcement learning problems, the aim is to select a controller that will maximize the average reward in some environment. We model the environment as a partially ob- servable Markov decision process (POMDP). Gradient ascent methods (e. g. , [7, 12, 15]) estimate the gradient of the average reward, usually using Monte Carlo techniques to cal- (cid: 3)Most of this work was performed while the authors were with the Research School of Information Sciences and Engineering at the Australian National University. culate an average over a sample path of the controlled POMDP. However such estimates tend to have a high variance, which means many steps are needed to obtain a good esti- mate. GPOMDP [4] is an algorithm for generating an estimate of the gradient in this way. Compared with other approaches, it is suitable for large systems, when the time between visits to a state is large but the mixing time of the controlled POMDP is short. However, it can suffer from the problem of producing high variance estimates. In this paper, we investi- gate techniques for variance reduction in GPOMDP. One generic approach to reducing the variance of Monte Carlo estimates of integrals is to use an additive control variate (see, for example, [6]). Suppose we wish to estimate the integral of f: X! R, and we know the integral of another function ’: X! R. Since RX f = RX (f (cid: 0) ’) +RX ’, the integral of f (cid: 0) ’ can be estimated instead. Obviously if ’ = f then the variance is zero. More generally, Var(f (cid: 0) ’) = Var(f ) (cid: 0) 2Cov(f; ’) + Var(’), so that if (cid: 30) and f are strongly correlated, the variance of the estimate is reduced. In this paper, we consider two approaches of this form. The first (Section 2) is the technique of adding a baseline. We find the optimal baseline and we show that the additional variance of a suboptimal baseline can be expressed as a weighted squared distance from the optimal baseline. Constant baselines, which do not depend on the state or observations, have been widely used [13, 15, 9, 11]. In particular, the expectation over all states of the discounted value of the state is a popular constant baseline (where, for example, the reward at each step is replaced by the difference between the reward and the expected reward). We give bounds on the estimation variance that show that, perhaps surprisingly, this may not be the best choice. The second approach (Section 3) is the use of an approximate value function. Such actor- critic methods have been investigated extensively [3, 1, 14, 10]. Generally the idea is to minimize some notion of distance between the fixed value function and the true value function. In this paper we show that this may not be the best approach: selecting the fixed value function to be equal to the true value function is not always the best choice. Even more surprisingly, we give an example for which the use of a fixed value function that is different from the true value function reduces the variance to zero, for no increase in bias. We give a bound on the expected squared error (that is, including the estimation variance) of the gradient estimate produced with a fixed value function. Our results suggest new algorithms to learn the optimum baseline, and to learn a fixed value function that minimizes the bound on the error of the estimate. In Section 5, we describe the results of preliminary experiments, which show that these algorithms give performance improvements. POMDP with Reactive, Parameterized Policy A partially observable Markov decision process (POMDP) consists of a state space, S, a control space, U, an observation space, Y, a set of transition probability matrices fP(u): u 2 Ug, each with components pij (u) for i; j 2 S; u 2 U, an observation pro- cess (cid: 23): S! PY, where PY is the space of probability distributions over Y, and a reward function r: S! R. We assume that S; U; Y are finite, although all our re- sults extend easily to infinite U and Y, and with more restrictive assumptions can be extended to infinite S. A reactive, parameterized policy for a POMDP is a set of map- pings f(cid: 22)((cid: 1); (cid: 18)): Y! PU j(cid: 18) 2 RKg. Together with the POMDP, this defines the con- trolled POMDP (S; U; Y; P; (cid: 23); r; (cid: 22)). The joint state, observation and control process, fXt; Yt; Utg, is Markov. The state process, fXtg, is also Markov, with transition prob- abilities pij ((cid: 18)) = Py2Y; u2U (cid: 23)y(i)(cid: 22)u(y; (cid: 18))pij (u), where (cid: 23)y(i) denotes the probability of observation y given the state i, and (cid: 22)u(y; (cid: 18)) denotes the probability of action u given pa- rameters (cid: 18) and observation y. The Markov chain M((cid: 18)) = (S; P((cid: 18))) then describes the behaviour of the process fXtg. Assumption 1 The controlled POMDP (S; U; Y; P; (cid: 23); r; (cid: 22)) satisfies: For all (cid: 18) 2 RK there exists a unique stationary distribution satisfying (cid: 25) 0((cid: 18)) P((cid: 18)) = (cid: 25)0((cid: 18)). There is an R < 1 such that, for all i 2 S, jr(i)j (cid: 20) R. There is a B < 1 such that, for all u 2 U, y 2 Y and (cid: 18) 2 RK the deriva- tives @(cid: 22)u(y; (cid: 18))=@(cid: 18)k (1 (cid: 20) k (cid: 20) K) exist, and the vector of these derivatives satisfies kr(cid: 22)u(y; (cid: 18))=(cid: 22)u(y; (cid: 18))k (cid: 20) B, where k (cid: 1) k denotes the Euclidean norm on RK. implies that this limit exists, and does not depend on the start state X0. The aim is to select a policy to maximize this quantity. Define the discounted value function, J(cid: 12)(i; (cid: 18)) def= We consider the average reward, (cid: 17)((cid: 18)) def= limT! 1 Eh 1 t=0 r(Xt)i. Assumption 1 X0 = i i. Throughout the rest of the paper, dependences limT! 1 EhPT (cid: 0)1 Var(A) = Eh(A (cid: 0) E [A])2i, where a2 denotes a0a, and a0 denotes the transpose of the upon (cid: 18) are assumed, and dropped in the notation. For a random vector A, we denote t=0 (cid: 12)tr(Xt)(cid: 12)(cid: 12)(cid: 12)

NeurIPS Conference 2000 Conference Paper

Sparse Greedy Gaussian Process Regression

  • Alex Smola
  • Peter Bartlett

We present a simple sparse greedy technique to approximate the maximum a posteriori estimate of Gaussian Processes with much improved scaling behaviour in the sample size m. In particular, computational requirements are O(n2m), storage is O(nm), the cost for prediction is 0 ( n) and the cost to compute confidence bounds is O(nm), where n «: m. We show how to compute a stopping criterion, give bounds on the approximation error, and show applications to large scale problems.

NeurIPS Conference 1999 Conference Paper

Boosting Algorithms as Gradient Descent

  • Llew Mason
  • Jonathan Baxter
  • Peter Bartlett
  • Marcus Frean

We provide an abstract characterization of boosting algorithms as gradient decsent on cost-functionals in an inner-product function space. We prove convergence of these functional-gradient-descent algorithms under quite weak conditions. Following previous theo(cid: 173) retical results bounding the generalization performance of convex combinations of classifiers in terms of general cost functions of the margin, we present a new algorithm (DOOM II) for performing a gradient descent optimization of such cost functions. Experiments on several data sets from the UC Irvine repository demonstrate that DOOM II generally outperforms AdaBoost, especially in high noise situations, and that the overfitting behaviour of AdaBoost is predicted by our cost functions.

NeurIPS Conference 1998 Conference Paper

Almost Linear VC Dimension Bounds for Piecewise Polynomial Networks

  • Peter Bartlett
  • Vitaly Maiorov
  • Ron Meir

We compute upper and lower bounds on the VC dimension of feedforward networks of units with piecewise polynomial activa(cid: 173) tion functions. We show that if the number of layers is fixed, then the VC dimension grows as W log W, where W is the number of parameters in the network. This result stands in opposition to the case where the number of layers is unbounded, in which case the VC dimension grows as W 2 •

NeurIPS Conference 1998 Conference Paper

Direct Optimization of Margins Improves Generalization in Combined Classifiers

  • Llew Mason
  • Peter Bartlett
  • Jonathan Baxter

Cumulative training margin dis(cid: 173) tributions for AdaBoost versus our "Direct Optimization Of Margins" (DOOM) algorithm. The dark curve is AdaBoost, the light curve is DOOM. DOOM sacrifices significant training er(cid: 173) ror for improved test error (hori(cid: 173) zontal marks on margin= 0 line)_ -1 -0. 8 -0. 6 -0. 4 -0. 2 0 0. 2 0. 4 0. 6 0. 8

NeurIPS Conference 1998 Conference Paper

Shrinking the Tube: A New Support Vector Regression Algorithm

  • Bernhard Schölkopf
  • Peter Bartlett
  • Alex Smola
  • Robert Williamson

A new algorithm for Support Vector regression is described. For a priori chosen 1/, it automatically adjusts a flexible tube of minimal radius to the data such that at most a fraction 1/ of the data points lie outside. More(cid: 173) over, it is shown how to use parametric tube shapes with non-constant radius. The algorithm is analysed theoretically and experimentally.

NeurIPS Conference 1997 Conference Paper

Generalization in Decision Trees and DNF: Does Size Matter?

  • Mostefa Golea
  • Peter Bartlett
  • Wee Sun Lee
  • Llew Mason

Recent theoretical results for pattern classification with thresh(cid: 173) olded real-valued functions (such as support vector machines, sig(cid: 173) moid networks, and boosting) give bounds on misclassification probability that do not depend on the size of the classifier, and hence can be considerably smaller than the bounds that follow from the VC theory. In this paper, we show that these techniques can be more widely applied, by representing other boolean functions as two-layer neural networks (thresholded convex combinations of boolean functions). For example, we show that with high probabil(cid: 173) ity any decision tree of depth no more than d that is consistent with m training examples has misclassification probability no more than o ( (~ (Neff VCdim(U) log2 m log d)) 1/2), where U is the class of node decision functions, and Neff: :; N can be thought of as the effective number of leaves (it becomes small as the distribution on the leaves induced by the training data gets far from uniform). This bound is qualitatively different from the VC bound and can be considerably smaller. We use the same technique to give similar results for DNF formulae. • Author to whom correspondence should be addressed 260 M. Golea, P Bartlett, W. S. Lee and L Mason

NeurIPS Conference 1997 Conference Paper

The Canonical Distortion Measure in Feature Space and 1-NN Classification

  • Jonathan Baxter
  • Peter Bartlett

We prove that the Canonical Distortion Measure (CDM) [2, 3] is the optimal distance measure to use for I nearest-neighbour (l-NN) classifi(cid: 173) cation, and show that it reduces to squared Euclidean distance in feature space for function classes that can be expressed as linear combinations of a fixed set of features. PAC-like bounds are given on the sample(cid: 173) complexity required to learn the CDM. An experiment is presented in which a neural network CDM was learnt for a Japanese OCR environ(cid: 173) ment and then used to do I-NN classification.

NeurIPS Conference 1996 Conference Paper

For Valid Generalization the Size of the Weights is More Important than the Size of the Network

  • Peter Bartlett

This paper shows that if a large neural network is used for a pattern classification problem, and the learning algorithm finds a network with small weights that has small squared error on the training patterns, then the generalization performance depends on the size of the weights rather than the number of weights. More specifi(cid: 173) cally, consider an i-layer feed-forward network of sigmoid units, in which the sum of the magnitudes of the weights associated with each unit is bounded by A. The misclassification probability con(cid: 173) verges to an error estimate (that is closely related to squared error on the training set) at rate O((cA)l(l+1)/2J(log n)jm) ignoring log factors, where m is the number of training patterns, n is the input dimension, and c is a constant. This may explain the gen(cid: 173) eralization performance of neural networks, particularly when the number of training examples is considerably smaller than the num(cid: 173) ber of weights. It also supports heuristics (such as weight decay and early stopping) that attempt to keep the weights small during training.

NeurIPS Conference 1995 Conference Paper

Examples of learning curves from a modified VC-formalism

  • Adam Kowalczyk
  • Jacek Szymanski
  • Peter Bartlett
  • Robert Williamson

We examine the issue of evaluation of model specific parameters in a modified VC-formalism. Two examples are analyzed: the 2-dimensional homogeneous perceptron and the I-dimensional higher order neuron. Both models are solved theoretically, and their learning curves are com(cid: 173) pared against true learning curves. It is shown that the formalism has the potential to generate a variety of learning curves, including ones displaying ''phase transitions. "

NeurIPS Conference 1991 Conference Paper

Splines, Rational Functions and Neural Networks

  • Robert Williamson
  • Peter Bartlett

Connections between spline approximation, approximation with rational functions, and feedforward neural networks are studied. The potential improvement in the degree of approximation in going from single to two hidden layer networks is examined. Some results of Birman and Solomjak regarding the degree of approximation achievable when knot positions are chosen on the basis of the probability distribution of examples rather than the function values are extended.