Author name cluster

Yair Carmon

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

25 papers

2 author rows

TMLR Journal 2025 Journal Article

An Analytical Model for Overparameterized Learning Under Class Imbalance

Eliav Mor
Yair Carmon

We study class-imbalanced linear classification in a high-dimensional Gaussian mixture model. We develop a tight, closed form approximation for the test error of several practical learning methods, including logit adjustment and class dependent temperature. Our approximation allows us to analytically tune and compare these methods, highlighting how and when they overcome the pitfalls of standard cross-entropy minimization. We test our theoretical findings on simulated data and imbalanced CIFAR10, MNIST and FashionMNIST datasets.

PDF Details

NeurIPS Conference 2025 Conference Paper

Convergence of Clipped SGD on Convex $(L_0,L_1)$-Smooth Functions

Ofir Gaash
Kfir Y. Levy
Yair Carmon

We study stochastic gradient descent (SGD) with gradient clipping on convex functions under a generalized smoothness assumption called $(L_0, L_1)$-smoothness. Using gradient clipping, we establish a high probability convergence rate that matches the SGD rate in the $L$ smooth case up to polylogarithmic factors and additive terms. We also propose a variation of adaptive SGD with gradient clipping, which achieves the same guarantee. We perform empirical experiments to examine our theory and algorithmic choices.

PDF Details

NeurIPS Conference 2025 Conference Paper

Filter Like You Test: Data-Driven Data Filtering for CLIP Pretraining

Mikey Shechter
Yair Carmon

We introduce Filter Like You Test (FLYT), an algorithm for curating large-scale vision-language datasets that learns the usefulness of each data point as a pretraining example. FLYT trains a scoring model that learns to weigh each example's features using gradient signals from downstream tasks training sets. Based on FLYT, we implement Mixing-FLYT (M-FLYT), which takes the per-example scores generated by different scoring methods as features, and learns to unify them into a single score. FLYT naturally produces a distribution over the training examples, which we leverage through Soft Cap Sampling (SCS), a strategy for obtaining a filtered pretraining dataset from per-example probabilities that samples examples while preventing over-representation through a repetition penalty. Using these methods, we achieve 40. 1\% ImageNet zero-shot accuracy on the DataComp medium scale filtering benchmark, a 2\% absolute accuracy increase over all previous results and a 5. 5\% increase over results that---like us---use only public resources. Our approach also yields 37. 7\% on the average of 38 DataComp evaluation tasks, outperforming previous public-resource approaches by 0. 4\%.

PDF Details

SODA Conference 2024 Conference Paper

A Whole New Ball Game: A Primal Accelerated Method for Matrix Games and Minimizing the Maximum of Smooth Functions

Yair Carmon
Arun Jambulapati
Yujia Jin
Aaron Sidford

Details

NeurIPS Conference 2024 Conference Paper

DataComp-LM: In search of the next generation of training sets for language models

Jeffrey Li
Alex Fang
Georgios Smyrnis
Maor Ivgi
Matt Jordan
Samir Gadre
Hritik Bansal
Etash Guha

We introduce DataComp for Language Models, a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing atmodel scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline, enables training a 7B parameter language model from scratch to 63% 5-shot accuracy on MMLU with 2T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6 percentage point improvement on MMLU while being trained with half the compute. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation. We release the \dclm benchmark, framework, models, and datasets at https: //www. datacomp. ai/dclm/

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Resolving Discrepancies in Compute-Optimal Scaling of Language Models

Tomer Porian
Mitchell Wortsman
Jenia Jitsev
Ludwig Schmidt
Yair Carmon

Kaplan et al. and Hoffmann et al. developed influential scaling laws for the optimal model size as a function of the compute budget, but these laws yield substantially different predictions. We explain the discrepancy by reproducing the Kaplan scaling law on two datasets (OpenWebText2 and RefinedWeb) and identifying three factors causing the difference: last layer computational cost, warmup duration, and scale-dependent optimizer tuning. With these factors corrected, we obtain excellent agreement with the Hoffmann et al. (i. e. , "Chinchilla") scaling law. Counter to a hypothesis of Hoffmann et al. , we find that careful learning rate decay is not essential for the validity of their scaling law. As a secondary result, we derive scaling laws for the optimal learning rate and batch size, finding that tuning the AdamW $\beta_2$ parameter is essential at lower batch sizes.

PDF Details DOI

NeurIPS Conference 2023 Conference Paper

DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre
Gabriel Ilharco
Alex Fang
Jonathan Hayase
Georgios Smyrnis
Thao Nguyen
Ryan Marten
Mitchell Wortsman

Multimodal datasets are a critical component in recent breakthroughs such as CLIP, Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12. 8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. Our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79. 2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3. 7 percentage points while using the same training procedure and compute. We release \datanet and all accompanying code at www. datacomp. ai.

PDF Details

ICML Conference 2023 Conference Paper

DoG is SGD's Best Friend: A Parameter-Free Dynamic Step Size Schedule

Maor Ivgi
Oliver Hinder
Yair Carmon

We propose a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). The DoG step sizes depend on simple empirical quantities (distance from the initial point and norms of gradients) and have no “learning rate” parameter. Theoretically, we show that, for stochastic convex optimization, a slight variation of the DoG formula enjoys strong, high-probability parameter-free convergence guarantees and iterate movement bounds. Empirically, we consider a broad range of vision and language transfer learning tasks, and show that DoG’s performance is close to that of SGD with tuned learning rate. We also propose a per-layer variant of DoG that generally outperforms tuned SGD, approaching the performance of tuned Adam. A PyTorch implementation of our algorithms is available at https: //github. com/formll/dog.

Details

ICML Conference 2023 Conference Paper

Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond

Itai Kreisler
Mor Shpigel Nacson
Daniel Soudry
Yair Carmon

Recent research shows that when Gradient Descent (GD) is applied to neural networks, the loss almost never decreases monotonically. Instead, the loss oscillates as gradient descent converges to its “Edge of Stability” (EoS). Here, we find a quantity that does decrease monotonically throughout GD training: the sharpness attained by the gradient flow solution (GFS)—the solution that would be obtained if, from now until convergence, we train with an infinitesimal step size. Theoretically, we analyze scalar neural networks with the squared loss, perhaps the simplest setting where the EoS phenomena still occur. In this model, we prove that the GFS sharpness decreases monotonically. Using this result, we characterize settings where GD provably converges to the EoS in scalar networks. Empirically, we show that GD monotonically decreases the GFS sharpness in a squared regression model as well as practical neural network architectures.

Details

ICLR Conference 2023 Conference Paper

Malign Overfitting: Interpolation and Invariance are Fundamentally at Odds

Yoav Wald
Gal Yona
Uri Shalit
Yair Carmon

Learned classifiers should often possess certain invariance properties meant to encourage fairness, robustness, or out-of-distribution generalization. However, multiple recent works empirically demonstrate that common invariance-inducing regularizers are ineffective in the over-parameterized regime, in which classifiers perfectly fit (i.e. interpolate) the training data. This suggests that the phenomenon of ``benign overfitting," in which models generalize well despite interpolating, might not favorably extend to settings in which robustness or fairness are desirable. In this work, we provide a theoretical justification for these observations. We prove that---even in the simplest of settings---any interpolating learning rule (with an arbitrarily small margin) will not satisfy these invariance properties. We then propose and analyze an algorithm that---in the same setting---successfully learns a non-interpolating classifier that is provably invariant. We validate our theoretical observations on simulated data and the Waterbirds dataset.

Details

FOCS Conference 2023 Conference Paper

ReSQueing Parallel and Private Stochastic Convex Optimization

Yair Carmon
Arun Jambulapati
Yujia Jin
Yin Tat Lee
Daogao Liu
Aaron Sidford
Kevin Tian

We introduce a new tool for stochastic convex optimization (SCO): a Reweighted Stochastic Query (ReSQue) estimator for the gradient of a function convolved with a (Gaussian) probability density. Combining ReSQue with recent advances in ball oracle acceleration [CJJ+20], [ACJ+21], we develop algorithms achieving state-of-the-art complexities for SCO in parallel and private settings. For a SCO objective constrained to the unit ball in $\mathbb{R}^{d}$, we obtain the following results (up to polylogarithmic factors). 1)We give a parallel algorithm obtaining optimization error $\epsilon_{\text {opt} }$ with $d^{1 / 3} \epsilon_{\text {opt} }^{-2 / 3}$ gradient oracle query depth and $d^{1 / 3} \epsilon_{\text {opt} }^{-2 / 3}+\epsilon_{\text {opt} }^{-2}$ gradient queries in total, assuming access to a bounded-variance stochastic gradient estimator. For $\epsilon_{\text {opt} } \in\left[d^{-1}, d^{-1 / 4}\right]$, our algorithm matches the state-of-the-art oracle depth of [BJL+19] while maintaining the optimal total work of stochastic gradient descent. 2)Given n samples of Lipschitz loss functions, prior works [BFTT19], [BFGT20], [AFKT21], [KLL21] established that if $n \gt rsim d \epsilon_{\mathrm{dp}}^{-2}, \left(\epsilon_{\mathrm{dp}}, \delta\right)$-differential privacy is attained at no asymptotic cost to the SCO utility. However, these prior works all required a superlinear number of gradient queries. We close this gap for sufficiently large $n \gt rsim d^{2} \epsilon_{{\bf d p}}^{-3}$, by using ReSQue to design an algorithm with near-linear gradient query complexity in this regime.

Details

NeurIPS Conference 2022 Conference Paper

Distributionally Robust Optimization via Ball Oracle Acceleration

Yair Carmon
Danielle Hausler

We develop and analyze algorithms for distributionally robust optimization (DRO) of convex losses. In particular, we consider group-structured and bounded $f$-divergence uncertainty sets. Our approach relies on an accelerated method that queries a ball optimization oracle, i. e. , a subroutine that minimizes the objective within a small ball around the query point. Our main contribution is efficient implementations of this oracle for DRO objectives. For DRO with $N$ non-smooth loss functions, the resulting algorithms find an $\epsilon$-accurate solution with $\widetilde{O}\left(N\epsilon^{-2/3} + \epsilon^{-2}\right)$ first-order oracle queries to individual loss functions. Compared to existing algorithms for this problem, we improve complexity by a factor of up to $\epsilon^{-4/3}$.

PDF Details

ICML Conference 2022 Conference Paper

Model soups: averaging weights of multiple fine-tuned models improves accuracy without increasing inference time

Mitchell Wortsman
Gabriel Ilharco
Samir Yitzhak Gadre
Rebecca Roelofs
Raphael Gontijo Lopes
Ari S. Morcos
Hongseok Namkoong
Ali Farhadi

The conventional recipe for maximizing model accuracy is to (1) train multiple models with various hyperparameters and (2) pick the individual model which performs best on a held-out validation set, discarding the remainder. In this paper, we revisit the second step of this procedure in the context of fine-tuning large pre-trained models, where fine-tuned models often appear to lie in a single low error basin. We show that averaging the weights of multiple models fine-tuned with different hyperparameter configurations often improves accuracy and robustness. Unlike a conventional ensemble, we may average many models without incurring any additional inference or memory costs—we call the results “model soups. ” When fine-tuning large pre-trained models such as CLIP, ALIGN, and a ViT-G pre-trained on JFT, our soup recipe provides significant improvements over the best model in a hyperparameter sweep on ImageNet. The resulting ViT-G model, which attains 90. 94% top-1 accuracy on ImageNet, achieved a new state of the art. Furthermore, we show that the model soup approach extends to multiple image classification and natural language processing tasks, improves out-of-distribution performance, and improves zero-shot performance on new downstream tasks. Finally, we analytically relate the performance similarity of weight-averaging and logit-ensembling to flatness of the loss and confidence of the predictions, and validate this relation empirically. Code is available at https: //github. com/mlfoundations/model-soups.

Details

NeurIPS Conference 2022 Conference Paper

Optimal and Adaptive Monteiro-Svaiter Acceleration

Yair Carmon
Danielle Hausler
Arun Jambulapati
Yujia Jin
Aaron Sidford

We develop a variant of the Monteiro-Svaiter (MS) acceleration framework that removes the need to solve an expensive implicit equation at every iteration. Consequently, for any $p\ge 2$ we improve the complexity of convex optimization with Lipschitz $p$th derivative by a logarithmic factor, matching a lower bound. We also introduce an MS subproblem solver that requires no knowledge of problem parameters, and implement it as either a second- or first-order method by solving linear systems or applying MinRes, respectively. On logistic regression problems our method outperforms previous accelerated second-order methods, but under-performs Newton's method; simply iterating our first-order adaptive subproblem solver is competitive with L-BFGS.

PDF Details

ICML Conference 2022 Conference Paper

RECAPP: Crafting a More Efficient Catalyst for Convex Optimization

Yair Carmon
Arun Jambulapati
Yujia Jin
Aaron Sidford

The accelerated proximal point method (APPA), also known as "Catalyst", is a well-established reduction from convex optimization to approximate proximal point computation (i. e. , regularized minimization). This reduction is conceptually elegant and yields strong convergence rate guarantees. However, these rates feature an extraneous logarithmic term arising from the need to compute each proximal point to high accuracy. In this work, we propose a novel Relaxed Error Criterion for Accelerated Proximal Point (RECAPP) that eliminates the need for high accuracy subproblem solutions. We apply RECAPP to two canonical problems: finite-sum and max-structured minimization. For finite-sum problems, we match the best known complexity, previously obtained by carefully-designed problem-specific algorithms. For minimizing max_y f(x, y) where f is convex in x and strongly-concave in y, we improve on the best known (Catalyst-based) bound by a logarithmic factor.

Details

ICML Conference 2021 Conference Paper

Accuracy on the Line: on the Strong Correlation Between Out-of-Distribution and In-Distribution Generalization

John Miller 0001
Rohan Taori
Aditi Raghunathan
Shiori Sagawa
Pang Wei Koh
Vaishaal Shankar
Percy Liang
Yair Carmon

For machine learning systems to be reliable, we must understand their performance in unseen, out- of-distribution environments. In this paper, we empirically show that out-of-distribution performance is strongly correlated with in-distribution performance for a wide range of models and distribution shifts. Specifically, we demonstrate strong correlations between in-distribution and out-of- distribution performance on variants of CIFAR- 10 & ImageNet, a synthetic pose estimation task derived from YCB objects, FMoW-WILDS satellite imagery classification, and wildlife classification in iWildCam-WILDS. The correlation holds across model architectures, hyperparameters, training set size, and training duration, and is more precise than what is expected from existing domain adaptation theory. To complete the picture, we also investigate cases where the correlation is weaker, for instance some synthetic distribution shifts from CIFAR-10-C and the tissue classification dataset Camelyon17-WILDS. Finally, we provide a candidate theory based on a Gaussian data model that shows how changes in the data covariance arising from distribution shift can affect the observed correlations.

Details

NeurIPS Conference 2021 Conference Paper

Never Go Full Batch (in Stochastic Convex Optimization)

Idan Amir
Yair Carmon
Tomer Koren
Roi Livni

We study the generalization performance of $\text{\emph{full-batch}}$ optimization algorithms for stochastic convex optimization: these are first-order methods that only access the exact gradient of the empirical risk (rather than gradients with respect to individual data points), that include a wide range of algorithms such as gradient descent, mirror descent, and their regularized and/or accelerated variants. We provide a new separation result showing that, while algorithms such as stochastic gradient descent can generalize and optimize the population risk to within $\epsilon$ after $O(1/\epsilon^2)$ iterations, full-batch methods either need at least $\Omega(1/\epsilon^4)$ iterations or exhibit a dimension-dependent sample complexity.

PDF Details

NeurIPS Conference 2021 Conference Paper

Stochastic Bias-Reduced Gradient Methods

Hilal Asi
Yair Carmon
Arun Jambulapati
Yujia Jin
Aaron Sidford

We develop a new primitive for stochastic optimization: a low-bias, low-cost estimator of the minimizer $x_\star$ of any Lipschitz strongly-convex function $f$. In particular, we use a multilevel Monte-Carlo approach due to Blanchet and Glynn to turn any optimal stochastic gradient method into an estimator of $x_\star$ with bias $\delta$, variance $O(\log(1/\delta))$, and an expected sampling cost of $O(\log(1/\delta))$ stochastic gradient evaluations. As an immediate consequence, we obtain cheap and nearly unbiased gradient estimators for the Moreau envelope of any Lipschitz convex function. We demonstrate the potential of our estimator through four applications. First, we develop a method for minimizing the maximum of $N$ functions, improving on recent results and matching a lower bound up to logarithmic factors. Second and third, we recover state-of-the-art rates for projection-efficient and gradient-efficient optimization using simple algorithms with a transparent analysis. Finally, we show that an improved version of our estimator would yield a nearly linear-time, optimal-utility, differentially-private non-smooth stochastic optimization method.

PDF Details

NeurIPS Conference 2020 Conference Paper

Acceleration with a Ball Optimization Oracle

Yair Carmon
Arun Jambulapati
Qijia Jiang
Yujia Jin
Yin Tat Lee
Aaron Sidford
Kevin Tian

Consider an oracle which takes a point x and returns the minimizer of a convex function f in an l 2 ball of radius r around x. It is straightforward to show that roughly r^{-1}\log(1/epsilon) calls to the oracle suffice to find an \epsilon-approximate minimizer of f in an l 2 unit ball. Perhaps surprisingly, this is not optimal: we design an accelerated algorithm which attains an epsilon-approximate minimizer with roughly r^{-2/3} \log(1/epsilon) oracle queries, and give a matching lower bound. Further, we implement ball optimization oracles for functions with a locally stable Hessian using a variant of Newton's method and, in certain cases, stochastic first-order methods. The resulting algorithms apply to a number of problems of practical and theoretical import, improving upon previous results for logistic and l infinity regression and achieving guarantees comparable to the state-of-the-art for l p regression.

PDF Details

FOCS Conference 2020 Conference Paper

Coordinate Methods for Matrix Games

Yair Carmon
Yujia Jin
Aaron Sidford
Kevin Tian

We develop primal-dual coordinate methods for solving bilinear saddle-point problems of the form minx ∈ Xmaxy ∈ Yy T Ax which contain linear programming, classification, and regression as special cases. Our methods push existing fully stochastic sublinear methods and variance-reduced methods towards their limits in terms of per-iteration complexity and sample complexity. We obtain nearly-constant per-iteration complexity by designing efficient data structures leveraging Taylor approximations to the exponential and a binomial heap. We improve sample complexity via low-variance gradient estimators using dynamic sampling distributions that depend on both the iterates and the magnitude of the matrix entries. Our runtime bounds improve upon those of existing primal-dual methods by a factor depending on sparsity measures of the m by n matrix A. For example, when rows and columns have constant l1/l2 norm ratios, we offer improvements by a factor of m+n in the fully stochastic setting and √{m+n} in the variance-reduced setting. We apply our methods to computational geometry problems, i. e. minimum enclosing ball, maximum inscribed ball, and linear regression, and obtain improved complexity bounds. For linear regression with an elementwise nonnegative matrix, our guarantees improve on exact gradient methods by a factor of √{nnz(A)/(m+n)}.

Details

NeurIPS Conference 2020 Conference Paper

Large-Scale Methods for Distributionally Robust Optimization

Daniel Levy
Yair Carmon
John C. Duchi
Aaron Sidford

We propose and analyze algorithms for distributionally robust optimization of convex losses with conditional value at risk (CVaR) and $\chi^2$ divergence uncertainty sets. We prove that our algorithms require a number of gradient evaluations independent of training set size and number of parameters, making them suitable for large-scale applications. For $\chi^2$ uncertainty sets these are the first such guarantees in the literature, and for CVaR our guarantees scale linearly in the uncertainty level rather than quadratically as in previous work. We also provide lower bounds proving the worst-case optimality of our algorithms for CVaR and a penalized version of the $\chi^2$ problem. Our primary technical contributions are novel bounds on the bias of batch robust risk estimation and the variance of a multilevel Monte Carlo gradient estimator due to [Blanchet & Glynn, 2015]. Experiments on MNIST and ImageNet confirm the theoretical scaling of our algorithms, which are 9-36 times more efficient than full-batch methods.

PDF Details

NeurIPS Conference 2019 Conference Paper

Unlabeled Data Improves Adversarial Robustness

Yair Carmon
Aditi Raghunathan
Ludwig Schmidt
John Duchi
Percy Liang

We demonstrate, theoretically and empirically, that adversarial robustness can significantly benefit from semisupervised learning. Theoretically, we revisit the simple Gaussian model of Schmidt et al. that shows a sample complexity gap between standard and robust classification. We prove that unlabeled data bridges this gap: a simple semisupervised learning procedure (self-training) achieves high robust accuracy using the same number of labels required for achieving high standard accuracy. Empirically, we augment CIFAR-10 with 500K unlabeled images sourced from 80 Million Tiny Images and use robust self-training to outperform state-of-the-art robust accuracies by over 5 points in (i) $\ell_\infty$ robustness against several strong attacks via adversarial training and (ii) certified $\ell_2$ and $\ell_\infty$ robustness via randomized smoothing. On SVHN, adding the dataset's own extra training set with the labels removed provides gains of 4 to 10 points, within 1 point of the gain from using the extra labels.

PDF Details

NeurIPS Conference 2019 Conference Paper

Variance Reduction for Matrix Games

Yair Carmon
Yujia Jin
Aaron Sidford
Kevin Tian

We present a randomized primal-dual algorithm that solves the problem min x max y y^T A x to additive error epsilon in time nnz(A) + sqrt{nnz(A) n} / epsilon, for matrix A with larger dimension n and nnz(A) nonzero entries. This improves the best known exact gradient methods by a factor of sqrt{nnz(A) / n} and is faster than fully stochastic gradient methods in the accurate and/or sparse regime epsilon < sqrt{n / nnz(A)$. Our results hold for x, y in the simplex (matrix games, linear programming) and for x in an \ell_2 ball and y in the simplex (perceptron / SVM, minimum enclosing ball). Our algorithm combines the Nemirovski's "conceptual prox-method" and a novel reduced-variance gradient estimator based on "sampling from the difference" between the current iterate and a reference point.

PDF Details

NeurIPS Conference 2018 Conference Paper

Analysis of Krylov Subspace Solutions of Regularized Non-Convex Quadratic Problems

Yair Carmon
John Duchi

We provide convergence rates for Krylov subspace solutions to the trust-region and cubic-regularized (nonconvex) quadratic problems. Such solutions may be efficiently computed by the Lanczos method and have long been used in practice. We prove error bounds of the form $1/t^2$ and $e^{-4t/\sqrt{\kappa}}$, where $\kappa$ is a condition number for the problem, and $t$ is the Krylov subspace order (number of Lanczos iterations). We also provide lower bounds showing that our analysis is sharp.

PDF Details

ICML Conference 2017 Conference Paper

"Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions

Yair Carmon
John C. Duchi
Oliver Hinder
Aaron Sidford

We develop and analyze a variant of Nesterov’s accelerated gradient descent (AGD) for minimization of smooth non-convex functions. We prove that one of two cases occurs: either our AGD variant converges quickly, as if the function was convex, or we produce a certificate that the function is “guilty” of being non-convex. This non-convexity certificate allows us to exploit negative curvature and obtain deterministic, dimension-free acceleration of convergence for non-convex functions. For a function $f$ with Lipschitz continuous gradient and Hessian, we compute a point $x$ with $\|\nabla f(x)\| \le \epsilon$ in $O(\epsilon^{-7/4} \log(1/ \epsilon) )$ gradient and function evaluations. Assuming additionally that the third derivative is Lipschitz, we require only $O(\epsilon^{-5/3} \log(1/ \epsilon) )$ evaluations.

Details