Author name cluster

John Duchi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

26 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Sample-Conditional Coverage in Split-Conformal Prediction

John Duchi

We revisit the problem of constructing predictive confidence sets for which we wish to obtain some type of conditional validity. We provide new arguments showing how ``split conformal'' methods achieve near desired coverage levels with high probability, a guarantee conditional on the validation data rather than marginal over it. In addition, we directly consider (approximate) conditional coverage, where, e. g. , conditional on a covariate $X$ belonging to some group of interest, we seek a guarantee that a predictive set covers the true outcome $Y$. We show that the natural method of performing quantile regression on a held-out (validation) dataset yields minimax optimal guarantees of coverage in these cases. Complementing these positive results, we also provide experimental evidence highlighting work that remains to develop computationally efficient valid predictive inference methods.

NeurIPS Conference 2019 Conference Paper

Necessary and Sufficient Geometries for Gradient Methods

Daniel Levy
John Duchi

We study the impact of the constraint set and gradient geometry on the convergence of online and stochastic methods for convex optimization, providing a characterization of the geometries for which stochastic gradient and adaptive gradient methods are (minimax) optimal. In particular, we show that when the constraint set is quadratically convex, diagonally pre-conditioned stochastic gradient methods are minimax optimal. We further provide a converse that shows that when the constraints are not quadratically convex---for example, any $\ell_p$-ball for $p < 2$---the methods are far from optimal. Based on this, we can provide concrete recommendations for when one should use adaptive, mirror or stochastic gradient methods.

NeurIPS Conference 2019 Conference Paper

Unlabeled Data Improves Adversarial Robustness

Yair Carmon
Aditi Raghunathan
Ludwig Schmidt
John Duchi
Percy Liang

We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.

JMLR Journal 2019 Journal Article

Variance-based Regularization with Convex Objectives

John Duchi
Hongseok Namkoong

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

NeurIPS Conference 2018 Conference Paper

Analysis of Krylov Subspace Solutions of Regularized Non-Convex Quadratic Problems

Yair Carmon
John Duchi

We provide convergence rates for Krylov subspace solutions to the trust-region and cubic-regularized (nonconvex) quadratic problems. Such solutions may be efficiently computed by the Lanczos method and have long been used in practice. We prove error bounds of the form $1/t^2$ and $e^{-4t/\sqrt{\kappa}}$, where $\kappa$ is a condition number for the problem, and $t$ is the Krylov subspace order (number of Lanczos iterations). We also provide lower bounds showing that our analysis is sharp.

NeurIPS Conference 2018 Conference Paper

Generalizing to Unseen Domains via Adversarial Data Augmentation

Riccardo Volpi
Hongseok Namkoong
Ozan Sener
John Duchi
Vittorio Murino
Silvio Savarese

We are concerned with learning models that generalize well to different unseen domains. We consider a worst-case formulation over data distributions that are near the source domain in the feature space. Only using training data from a single source distribution, we propose an iterative procedure that augments the dataset with examples from a fictitious target domain that is "hard" under the current model. We show that our iterative scheme is an adaptive data augmentation method where we append adversarial examples at each iteration. For softmax losses, we show that our method is a data-dependent regularization scheme that behaves differently from classical regularizers that regularize towards zero (e. g. , ridge or lasso). On digit recognition and semantic segmentation tasks, our method learns models improve performance across a range of a priori unknown target domains.

NeurIPS Conference 2018 Conference Paper

Scalable End-to-End Autonomous Vehicle Testing via Rare-event Simulation

Matthew O'Kelly
Aman Sinha
Hongseok Namkoong
Russ Tedrake
John Duchi

While recent developments in autonomous vehicle (AV) technology highlight substantial progress, we lack tools for rigorous and scalable testing. Real-world testing, the de facto evaluation environment, places the public in danger, and, due to the rare nature of accidents, will require billions of miles in order to statistically validate performance claims. We implement a simulation framework that can test an entire modern autonomous driving system, including, in particular, systems that employ deep-learning perception and control algorithms. Using adaptive importance-sampling methods to accelerate rare-event probability evaluation, we estimate the probability of an accident under a base distribution governing standard traffic behavior. We demonstrate our framework on a highway scenario, accelerating system evaluation by 2-20 times over naive Monte Carlo sampling methods and 10-300P times (where P is the number of processors) over real-world testing.

NeurIPS Conference 2017 Conference Paper

Unsupervised Transformation Learning via Convex Relaxations

Tatsunori Hashimoto
Percy Liang
John Duchi

Our goal is to extract meaningful transformations from raw images, such as varying the thickness of lines in handwriting or the lighting in a portrait. We propose an unsupervised approach to learn such transformations by attempting to reconstruct an image from a linear combination of transformations of its nearest neighbors. On handwritten digits and celebrity portraits, we show that even with linear transformations, our method generates visually high-quality modified images. Moreover, since our method is semiparametric and does not model the data distribution, the learned transformations extrapolate off the training data and can be applied to new types of images.

NeurIPS Conference 2017 Conference Paper

Variance-based Regularization with Convex Objectives

Hongseok Namkoong
John Duchi

We develop an approach to risk minimization and stochastic optimization that provides a convex surrogate for variance, allowing near-optimal and computationally efficient trading between approximation and estimation error. Our approach builds off of techniques for distributionally robust optimization and Owen's empirical likelihood, and we provide a number of finite-sample and asymptotic results characterizing the theoretical performance of the estimator. In particular, we show that our procedure comes with certificates of optimality, achieving (in some scenarios) faster rates of convergence than empirical risk minimization by virtue of automatically balancing bias and variance. We give corroborating empirical evidence showing that in practice, the estimator indeed trades between variance and absolute performance on a training sample, improving out-of-sample (test) performance over standard empirical risk minimization for a number of classification problems.

NeurIPS Conference 2016 Conference Paper

Learning Kernels with Random Features

Aman Sinha
John Duchi

Randomized features provide a computationally efficient way to approximate kernel machines in machine learning tasks. However, such methods require a user-defined kernel as input. We extend the randomized-feature approach to the task of learning a kernel (via its associated random features). Specifically, we present an efficient optimization problem that learns a kernel in a supervised manner. We prove the consistency of the estimated kernel as well as generalization bounds for the class of estimators induced by the optimized kernel, and we experimentally evaluate our technique on several datasets. Our approach is efficient and highly scalable, and we attain competitive results with a fraction of the training cost of other techniques.

NeurIPS Conference 2016 Conference Paper

Local Minimax Complexity of Stochastic Convex Optimization

Sabyasachi Chatterjee
John Duchi
John Lafferty
Yuancheng Zhu

We extend the traditional worst-case, minimax analysis of stochastic convex optimization by introducing a localized form of minimax complexity for individual functions. Our main result gives function-specific lower and upper bounds on the number of stochastic subgradient evaluations needed to optimize either the function or its ``hardest local alternative'' to a given numerical precision. The bounds are expressed in terms of a localized and computational analogue of the modulus of continuity that is central to statistical minimax analysis. We show how the computational modulus of continuity can be explicitly calculated in concrete cases, and relates to the curvature of the function at the optimum. We also prove a superefficiency result that demonstrates it is a meaningful benchmark, acting as a computational analogue of the Fisher information in statistical estimation. The nature and practical implications of the results are demonstrated in simulations.

NeurIPS Conference 2016 Conference Paper

Stochastic Gradient Methods for Distributionally Robust Optimization with f-divergences

Hongseok Namkoong
John Duchi

We develop efficient solution methods for a robust empirical risk minimization problem designed to give calibrated confidence intervals on performance and provide optimal tradeoffs between bias and variance. Our methods apply to distributionally robust optimization problems proposed by Ben-Tal et al. , which put more weight on observations inducing high loss via a worst-case approach over a non-parametric uncertainty set on the underlying data distribution. Our algorithm solves the resulting minimax problems with nearly the same computational cost of stochastic gradient descent through the use of several carefully designed data structures. For a sample of size n, the per-iteration cost of our method scales as O(log n), which allows us to give optimality certificates that distributionally robust optimization provides at little extra cost compared to empirical risk minimization and stochastic gradient methods.

NeurIPS Conference 2015 Conference Paper

Asynchronous stochastic convex optimization: the noise is in the noise and SGD don't care

Sorathan Chaturapruek
John Duchi
Christopher Ré

We show that asymptotically, completely asynchronous stochastic gradient procedures achieve optimal (even to constant factors) convergence rates for the solution of convex optimization problems under nearly the same conditions required for asymptotic optimality of standard stochastic gradient procedures. Roughly, the noise inherent to the stochastic approximation scheme dominates any noise from asynchrony. We also give empirical evidence demonstrating the strong performance of asynchronous, parallel stochastic optimization schemes, demonstrating that the robustness inherent to stochastic approximation problems allows substantially faster parallel and asynchronous solution methods. In short, we show that for many stochastic approximation problems, as Freddie Mercury sings in Queen's \emph{Bohemian Rhapsody}, ``Nothing really matters. ''

JMLR Journal 2015 Journal Article

Divide and Conquer Kernel Ridge Regression: A Distributed Algorithm with Minimax Optimal Rates

Yuchen Zhang
John Duchi
Martin Wainwright

We study a decomposition-based scalable approach to kernel ridge regression, and show that it achieves minimax optimal convergence rates under relatively mild conditions. The method is simple to describe: it randomly partitions a dataset of size $N$ into $m$ subsets of equal size, computes an independent kernel ridge regression estimator for each subset using a careful choice of the regularization parameter, then averages the local solutions into a global predictor. This partitioning leads to a substantial reduction in computation time versus the standard approach of performing kernel ridge regression on all $N$ samples. Our two main theorems establish that despite the computational speed-up, statistical optimality is retained: as long as $m$ is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of $N$ samples. As concrete examples, our theory guarantees that the number of subsets $m$ may grow nearly linearly for finite-rank or Gaussian kernels and polynomially in $N$ for Sobolev spaces, which in turn allows for substantial reductions in computational cost. We conclude with experiments on both simulated data and a music-prediction task that complement our theoretical results, exhibiting the computational and statistical benefits of our approach. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

NeurIPS Conference 2013 Conference Paper

Estimation, Optimization, and Parallelism when Data is Sparse

John Duchi
Michael Jordan
Brendan McMahan

We study stochastic optimization problems when the \emph{data} is sparse, which is in a sense dual to the current understanding of high-dimensional statistical learning and optimization. We highlight both the difficulties---in terms of increased sample complexity that sparse data necessitates---and the potential benefits, in terms of allowing parallelism and asynchrony in the design of algorithms. Concretely, we derive matching upper and lower bounds on the minimax rate for optimization and learning with sparse data, and we exhibit algorithms achieving these rates. Our algorithms are adaptive: they achieve the best possible rate for the data observed. We also show how leveraging sparsity leads to (still minimax optimal) parallel and asynchronous algorithms, providing experimental evidence complementing our theoretical results on medium to large-scale learning tasks.

NeurIPS Conference 2013 Conference Paper

Information-theoretic lower bounds for distributed statistical estimation with communication constraints

Yuchen Zhang
John Duchi
Michael Jordan
Martin Wainwright

We establish minimax risk lower bounds for distributed statistical estimation given a budget $B$ of the total number of bits that may be communicated. Such lower bounds in turn reveal the minimum amount of communication required by any procedure to achieve the classical optimal rate for statistical estimation. We study two classes of protocols in which machines send messages either independently or interactively. The lower bounds are established for a variety of problems, from estimating the mean of a population to estimating parameters in linear regression or binary classification.

NeurIPS Conference 2013 Conference Paper

Local Privacy and Minimax Bounds: Sharp Rates for Probability Estimation

John Duchi
Martin Wainwright
Michael Jordan

We provide a detailed study of the estimation of probability distributions---discrete and continuous---in a stringent setting in which data is kept private even from the statistician. We give sharp minimax rates of convergence for estimation in these locally private settings, exhibiting fundamental tradeoffs between privacy and convergence rate, as well as providing tools to allow movement along the privacy-statistical efficiency continuum. One of the consequences of our results is that Warner's classical work on randomized response is an optimal way to perform survey sampling while maintaining privacy of the respondents.

NeurIPS Conference 2012 Conference Paper

Communication-Efficient Algorithms for Statistical Optimization

Yuchen Zhang
Martin Wainwright
John Duchi

We study two communication-efficient algorithms for distributed statistical optimization on large-scale data. The first algorithm is an averaging method that distributes the $N$ data samples evenly to $m$ machines, performs separate minimization on each subset, and then averages the estimates. We provide a sharp analysis of this average mixture algorithm, showing that under a reasonable set of conditions, the combined parameter achieves mean-squared error that decays as $\order(N^{-1}+(N/m)^{-2})$. Whenever $m \le \sqrt{N}$, this guarantee matches the best possible rate achievable by a centralized algorithm having access to all $N$ samples. The second algorithm is a novel method, based on an appropriate form of the bootstrap. Requiring only a single round of communication, it has mean-squared error that decays as $\order(N^{-1}+(N/m)^{-3})$, and so is more robust to the amount of parallelization. We complement our theoretical results with experiments on large-scale problems from the Microsoft Learning to Rank dataset.

NeurIPS Conference 2012 Conference Paper

Finite Sample Convergence Rates of Zero-Order Stochastic Optimization Methods

Andre Wibisono
Martin Wainwright
Michael Jordan
John Duchi

We consider derivative-free algorithms for stochastic optimization problems that use only noisy function values rather than gradients, analyzing their finite-sample convergence rates. We show that if pairs of function values are available, algorithms that use gradient estimates based on random perturbations suffer a factor of at most $\sqrt{\dim}$ in convergence rate over traditional stochastic gradient methods, where $\dim$ is the dimension of the problem. We complement our algorithmic development with information-theoretic lower bounds on the minimax convergence rate of such problems, which show that our bounds are sharp with respect to all problem-dependent quantities: they cannot be improved by more than constant factors.

NeurIPS Conference 2012 Conference Paper

Privacy Aware Learning

Martin Wainwright
Michael Jordan
John Duchi

We study statistical risk minimization problems under a version of privacy in which the data is kept confidential even from the learner. In this local privacy framework, we show sharp upper and lower bounds on the convergence rates of statistical estimation procedures. As a consequence, we exhibit a precise tradeoff between the amount of privacy the data preserves and the utility, measured by convergence rate, of any statistical estimator.

JMLR Journal 2011 Journal Article

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

John Duchi
Elad Hazan
Yoram Singer

We present a new family of subgradient methods that dynamically incorporate knowledge of the geometry of the data observed in earlier iterations to perform more informative gradient-based learning. Metaphorically, the adaptation allows us to find needles in haystacks in the form of very predictive but rarely seen features. Our paradigm stems from recent advances in stochastic optimization and online learning which employ proximal functions to control the gradient steps of the algorithm. We describe and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal function that can be chosen in hindsight. We give several efficient algorithms for empirical risk minimization problems with common and important regularization functions and domain constraints. We experimentally study our theoretical analysis and show that adaptive subgradient methods outperform state-of-the-art, yet non-adaptive, subgradient algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

NeurIPS Conference 2011 Conference Paper

Distributed Delayed Stochastic Optimization

Alekh Agarwal
John Duchi

We analyze the convergence of gradient-based optimization algorithms whose updates depend on delayed stochastic gradient information. The main application of our results is to the development of distributed minimization algorithms where a master node performs parameter updates while worker nodes compute stochastic gradients based on local information in parallel, which may give rise to delays due to asynchrony. Our main contribution is to show that for smooth stochastic problems, the delays are asymptotically negligible. In application to distributed optimization, we show $n$-node architectures whose optimization error in stochastic problems---in spite of asynchronous delays---scales asymptotically as $\order(1 / \sqrt{nT})$, which is known to be optimal even in the absence of delays.

NeurIPS Conference 2010 Conference Paper

Distributed Dual Averaging In Networks

Alekh Agarwal
Martin Wainwright
John Duchi

The goal of decentralized optimization over a network is to optimize a global objective formed by a sum of local (possibly nonsmooth) convex functions using only local computation and communication. We develop and analyze distributed algorithms based on dual averaging of subgradients, and we provide sharp bounds on their convergence rates as a function of the network size and topology. Our analysis clearly separates the convergence of the optimization algorithm itself from the effects of communication constraints arising from the network structure. We show that the number of iterations required by our algorithm scales inversely in the spectral gap of the network. The sharpness of this prediction is confirmed both by theoretical lower bounds and simulations for various networks.

NeurIPS Conference 2009 Conference Paper

Efficient Learning using Forward-Backward Splitting

Yoram Singer
John Duchi

We describe, analyze, and experiment with a new framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an {\em unconstrained} gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This yields a simple yet effective algorithm for both batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as $\ell_1$. We derive concrete and very simple algorithms for minimization of loss functions with $\ell_1$, $\ell_2$, $\ell_2^2$, and $\ell_\infty$ regularization. We also show how to construct efficient algorithms for mixed-norm $\ell_1/\ell_q$ regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in experiments with synthetic and natural datasets.

JMLR Journal 2009 Journal Article

Efficient Online and Batch Learning Using Forward Backward Splitting

John Duchi
Yoram Singer

We describe, analyze, and experiment with a framework for empirical loss minimization with regularization. Our algorithmic framework alternates between two phases. On each iteration we first perform an unconstrained gradient descent step. We then cast and solve an instantaneous optimization problem that trades off minimization of a regularization term while keeping close proximity to the result of the first phase. This view yields a simple yet effective algorithm that can be used for batch penalized risk minimization and online learning. Furthermore, the two phase approach enables sparse solutions when used in conjunction with regularization functions that promote sparsity, such as l 1. We derive concrete and very simple algorithms for minimization of loss functions with l 1, l 2, l 2 2, and l ∞ regularization. We also show how to construct efficient algorithms for mixed-norm l 1 / l q regularization. We further extend the algorithms and give efficient implementations for very high-dimensional data with sparsity. We demonstrate the potential of the proposed framework in a series of experiments with synthetic and natural data sets. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

NeurIPS Conference 2006 Conference Paper

Using Combinatorial Optimization within Max-Product Belief Propagation

Daniel Tarlow
Gal Elidan
Daphne Koller
John Duchi

In general, the problem of computing a maximum a posteriori (MAP) assignment in a Markov random field (MRF) is computationally intractable. However, in certain subclasses of MRF, an optimal or close-to-optimal assignment can be found very efficiently using combinatorial optimization algorithms: certain MRFs with mutual exclusion constraints can be solved using bipartite matching, and MRFs with regular potentials can be solved using minimum cut methods. However, these solutions do not apply to the many MRFs that contain such tractable components as sub-networks, but also other non-complying potentials. In this paper, we present a new method, called C O M P O S E, for exploiting combinatorial optimization for sub-networks within the context of a max-product belief propagation algorithm. C O M P O S E uses combinatorial optimization for computing exact maxmarginals for an entire sub-network; these can then be used for inference in the context of the network as a whole. We describe highly efficient methods for computing max-marginals for subnetworks corresponding both to bipartite matchings and to regular networks. We present results on both synthetic and real networks encoding correspondence problems between images, which involve both matching constraints and pairwise geometric constraints. We compare to a range of current methods, showing that the ability of C O M P O S E to transmit information globally across the network leads to improved convergence, decreased running time, and higher-scoring assignments.