Author name cluster

Dan Garber

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

ICML Conference 2024 Conference Paper

Projection-Free Online Convex Optimization with Time-Varying Constraints

Dan Garber
Ben Kretzu

We consider the setting of online convex optimization with adversarial time-varying constraints in which actions must be feasible w. r. t. a fixed constraint set, and are also required on average to approximately satisfy additional time-varying constraints. Motivated by scenarios in which the fixed feasible set (hard constraint) is difficult to project on, we consider projection-free algorithms that access this set only through a linear optimization oracle (LOO). We present an algorithm that, on a sequence of length $T$ and using overall $T$ calls to the LOO, guarantees $\tilde{O}(T^{3/4})$ regret w. r. t. the losses and $O(T^{7/8})$ constraints violation (ignoring all quantities except for $T$). In particular, these bounds hold w. r. t. any interval of the sequence. This algorithm however also requires access to an oracle for minimizing a strongly convex nonsmooth function over a Euclidean ball. We present a more efficient algorithm that does not require the latter optimization oracle but only first-order access to the time-varying constraints, and achieves similar bounds w. r. t. the entire sequence. We extend the latter to the setting of bandit feedback and obtain similar bounds (as a function of $T$) in expectation.

Details

NeurIPS Conference 2022 Conference Paper

Frank-Wolfe-based Algorithms for Approximating Tyler's M-estimator

Lior Danon
Dan Garber

Tyler's M-estimator is a well known procedure for robust and heavy-tailed covariance estimation. Tyler himself suggested an iterative fixed-point algorithm for computing his estimator however, it requires super-linear (in the size of the data) runtime per iteration, which maybe prohibitive in large scale. In this work we propose, to the best of our knowledge, the first Frank-Wolfe-based algorithms for computing Tyler's estimator. One variant uses standard Frank-Wolfe steps, the second also considers \textit{away-steps} (AFW), and the third is a \textit{geodesic} version of AFW (GAFW). AFW provably requires, up to a log factor, only linear time per iteration, while GAFW runs in linear time (up to a log factor) in a large $n$ (number of data-points) regime. All three variants are shown to provably converge to the optimal solution with sublinear rate, under standard assumptions, despite the fact that the underlying optimization problem is not convex nor smooth. Under an additional fairly mild assumption, that holds with probability 1 when the (normalized) data-points are i. i. d. samples from a continuous distribution supported on the entire unit sphere, AFW and GAFW are proved to converge with linear rates. Importantly, all three variants are parameter-free and use adaptive step-sizes.

PDF Details

NeurIPS Conference 2022 Conference Paper

Local Linear Convergence of Gradient Methods for Subspace Optimization via Strict Complementarity

Ron Fisher
Dan Garber

We consider optimization problems in which the goal is to find a $k$-dimensional subspace of $\mathbb{R}^n$, $k<<n$, which minimizes a convex and smooth loss. Such problems generalize the fundamental task of principal component analysis (PCA) to include robust and sparse counterparts, and logistic PCA for binary data, among others. This problem could be approached either via nonconvex gradient methods with highly-efficient iterations, but for which arguing about fast convergence to a global minimizer is difficult or, via a convex relaxation for which arguing about convergence to a global minimizer is straightforward, but the corresponding methods are often inefficient. In this work we bridge these two approaches under a strict complementarity assumption, which in particular implies that the optimal solution to the convex relaxation is unique and is also the optimal solution to the original nonconvex problem. Our main result is a proof that a natural nonconvex gradient method which is \textit{SVD-free} and requires only a single QR-factorization of an $n\times k$ matrix per iteration, converges locally with a linear rate. We also establish linear convergence results for the nonconvex projected gradient method, and the Frank-Wolfe method when applied to the convex relaxation.

PDF Details

NeurIPS Conference 2021 Conference Paper

Low-Rank Extragradient Method for Nonsmooth and Low-Rank Matrix Optimization Problems

Atara Kaplan
Dan Garber

Low-rank and nonsmooth matrix optimization problems capture many fundamental tasks in statistics and machine learning. While significant progress has been made in recent years in developing efficient methods for \textit{smooth} low-rank optimization problems that avoid maintaining high-rank matrices and computing expensive high-rank SVDs, advances for nonsmooth problems have been slow paced. In this paper we consider standard convex relaxations for such problems. Mainly, we prove that under a natural \textit{generalized strict complementarity} condition and under the relatively mild assumption that the nonsmooth objective can be written as a maximum of smooth functions, the \textit{extragradient method}, when initialized with a "warm-start'' point, converges to an optimal solution with rate $O(1/t)$ while requiring only two \textit{low-rank} SVDs per iteration. We give a precise trade-off between the rank of the SVDs required and the radius of the ball in which we need to initialize the method. We support our theoretical results with empirical experiments on several nonsmooth low-rank matrix recovery tasks, demonstrating that using simple initializations, the extragradient method produces exactly the same iterates when full-rank SVDs are replaced with SVDs of rank that matches the rank of the (low-rank) ground-truth matrix to be recovered.

PDF Details

ICML Conference 2020 Conference Paper

Online Convex Optimization in the Random Order Model

Dan Garber
Gal Korcia
Kfir Yehuda Levy

Online Convex Optimization (OCO) is a powerful framework for sequential prediction, portraying the natural uncertainty inherent in data-streams as though the data were generated by an almost omniscient adversary. However, this view, which is often too pessimistic for real-world data, comes with a price. The complexity of solving many important online tasks in this adversarial framework becomes much worse than that of their offline and even stochastic counterparts. In this work we consider a natural random-order version of the OCO model, in which the adversary can choose the set of loss functions, but does not get to choose the order in which they are supplied to the learner; Instead, they are observed in uniformly random order. Focusing on two important families of online tasks, one in which the cumulative loss function is strongly convex (though individual loss functions may not even be convex), and the other being online $k$-PCA, we show that under standard well-conditioned-data assumptions, standard online gradient descent (OGD) methods become much more efficient in the random-order model. In particular, for the first group of tasks OGD guarantees poly-logarithmic regret. In the case of online $k$-PCA, OGD guarantees sublinear regret using only a rank-$k$ SVD on each iteration and memory linear in the size of the solution.

Details

NeurIPS Conference 2020 Conference Paper

Revisiting Frank-Wolfe for Polytopes: Strict Complementarity and Sparsity

Dan Garber

In recent years it was proved that simple modifications of the classical Frank-Wolfe algorithm (aka conditional gradient algorithm) for smooth convex minimization over convex and compact polytopes, converge with linear rate, assuming the objective function has the quadratic growth property. However, the rate of these methods depends explicitly on the dimension of the problem which cannot explain their empirical success for large scale problems. In this paper we first demonstrate that already for very simple problems and even when the optimal solution lies on a low-dimensional face of the polytope, such dependence on the dimension cannot be avoided in worst case. We then revisit the addition of a strict complementarity assumption already considered in Wolfe's classical book \cite{Wolfe1970}, and prove that under this condition, the Frank-Wolfe method with away-steps and line-search converges linearly with rate that depends explicitly only on the dimension of the optimal face, hence providing a significant improvement in case the optimal solution is sparse. We motivate this strict complementarity condition by proving that it implies sparsity-robustness of optimal solutions to noise.

PDF Details

JMLR Journal 2019 Journal Article

Stochastic Canonical Correlation Analysis

Chao Gao
Dan Garber
Nathan Srebro
Jialei Wang
Weiran Wang

We study the sample complexity of canonical correlation analysis (CCA), i.e., the number of samples needed to estimate the population canonical correlation and directions up to arbitrarily small error. With mild assumptions on the data distribution, we show that in order to achieve $\epsilon$-suboptimality in a properly defined measure of alignment between the estimated canonical directions and the population solution, we can solve the empirical objective exactly with $N(\epsilon, \Delta, \gamma)$ samples, where $\Delta$ is the singular value gap of the whitened cross-covariance matrix and $1/\gamma$ is an upper bound of the condition number of auto-covariance matrices. Moreover, we can achieve the same learning accuracy by drawing the same level of samples and solving the empirical objective approximately with a stochastic optimization algorithm; this algorithm is based on the shift-and-invert power iterations and only needs to process the dataset for $\mathcal{O} \left(\log \frac{1}{\epsilon} \right)$ passes. Finally, we show that, given an estimate of the canonical correlation, the streaming version of the shift-and-invert power iterations achieves the same learning accuracy with the same level of sample complexity, by processing the data only once. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

PDF Details

ICML Conference 2017 Conference Paper

Communication-efficient Algorithms for Distributed Stochastic Principal Component Analysis

Dan Garber
Ohad Shamir
Nathan Srebro

We study the fundamental problem of Principal Component Analysis in a statistical distributed setting in which each machine out of m stores a sample of n points sampled i. i. d. from a single unknown distribution. We study algorithms for estimating the leading principal component of the population covariance matrix that are both communication-efficient and achieve estimation error of the order of the centralized ERM solution that uses all mn samples. On the negative side, we show that in contrast to results obtained for distributed estimation under convexity assumptions, for the PCA objective, simply averaging the local ERM solutions cannot guarantee error that is consistent with the centralized ERM. We show that this unfortunate phenomena can be remedied by performing a simple correction step which correlates between the individual solutions, and provides an estimator that is consistent with the centralized ERM for sufficiently-large n. We also introduce an iterative distributed algorithm that is applicable in any regime of n, which is based on distributed matrix-vector products. The algorithm gives significant acceleration in terms of communication rounds over previous distributed algorithms, in a wide regime of parameters.

Details

NeurIPS Conference 2017 Conference Paper

Efficient Online Linear Optimization with Approximation Algorithms

Dan Garber

We revisit the problem of Online Linear Optimization in case the set of feasible actions is accessible through an approximated linear optimization oracle with a factor $\alpha$ multiplicative approximation guarantee. This setting is in particular interesting since it captures natural online extensions of well-studied offline linear optimization problems which are NP-hard, yet admit efficient approximation algorithms. The goal here is to minimize the $\alpha$-regret which is the natural extension of the standard regret in online learning to this setting. We present new algorithms with significantly improved oracle complexity for both the full information and bandit variants of the problem. Mainly, for both variants, we present $\alpha$-regret bounds of $O(T^{-1/3})$, were $T$ is the number of prediction rounds, using only $O(\log(T))$ calls to the approximation oracle per iteration, on average. These are the first results to obtain both average oracle complexity of $O(\log(T))$ (or even poly-logarithmic in $T$) and $\alpha$-regret bound $O(T^{-c})$ for a positive constant $c$, for both variants.

PDF Details

NeurIPS Conference 2016 Conference Paper

Efficient Globally Convergent Stochastic Optimization for Canonical Correlation Analysis

Weiran Wang
Jialei Wang
Dan Garber
Nati Srebro

We study the stochastic optimization of canonical correlation analysis (CCA), whose objective is nonconvex and does not decouple over training samples. Although several stochastic gradient based optimization algorithms have been recently proposed to solve this problem, no global convergence guarantee was provided by any of them. Inspired by the alternating least squares/power iterations formulation of CCA, and the shift-and-invert preconditioning method for PCA, we propose two globally convergent meta-algorithms for CCA, both of which transform the original problem into sequences of least squares problems that need only be solved approximately. We instantiate the meta-algorithms with state-of-the-art SGD methods and obtain time complexities that significantly improve upon that of previous work. Experimental results demonstrate their superior performance.

PDF Details

ICML Conference 2016 Conference Paper

Faster Eigenvector Computation via Shift-and-Invert Preconditioning

Dan Garber
Elad Hazan
Chi Jin 0001
Sham M. Kakade
Cameron Musco
Praneeth Netrapalli
Aaron Sidford

We give faster algorithms and improved sample complexities for the fundamental problem of estimating the top eigenvector. Given an explicit matrix $A \in \mathbb{R}^{n \times d}$, we show how to compute an $\epsilon$-approximate top eigenvector of $A^TA$ in time $\tilde O\left( \left[\text{nnz}(A) + \frac{d \text{sr}(A)}{\text{gap}^2} \right] \cdot \log 1/\epsilon\right)$. Here $\text{nnz}(A)$ is the number of nonzeros in $A$, $\text{sr}(A)$ is the stable rank, and gap is the relative eigengap. We also consider an online setting in which, given a stream of i. i. d. samples from a distribution D with covariance matrix $\Sigma$ and a vector $x_0$ which is an $O(\text{gap})$ approximate top eigenvector for $\Sigma$, we show how to refine $x_0$ to an $\epsilon$ approximation using $O \left( \frac{\text{var}(\mathcal{D})}{\text{gap}-\epsilon}\right)$ samples from $\mathcal{D}$. Here $\text{var}(\mathcal{D})$ is a natural notion of variance. Combining our algorithm with previous work to initialize $x_0$, we obtain improved sample complexities and runtimes under a variety of assumptions on D. We achieve our results via a robust analysis of the classic shift-and-invert preconditioning method. This technique lets us reduce eigenvector computation to approximately solving a series of linear systems with fast stochastic gradient methods.

Details

NeurIPS Conference 2016 Conference Paper

Faster Projection-free Convex Optimization over the Spectrahedron

Dan Garber

Minimizing a convex function over the spectrahedron, i. e. , the set of all $d\times d$ positive semidefinite matrices with unit trace, is an important optimization task with many applications in optimization, machine learning, and signal processing. It is also notoriously difficult to solve in large-scale since standard techniques require to compute expensive matrix decompositions. An alternative, is the conditional gradient method (aka Frank-Wolfe algorithm) that regained much interest in recent years, mostly due to its application to this specific setting. The key benefit of the CG method is that it avoids expensive matrix decompositions all together, and simply requires a single eigenvector computation per iteration, which is much more efficient. On the downside, the CG method, in general, converges with an inferior rate. The error for minimizing a $\beta$-smooth function after $t$ iterations scales like $\beta/t$. This rate does not improve even if the function is also strongly convex. In this work we present a modification of the CG method tailored for the spectrahedron. The per-iteration complexity of the method is essentially identical to that of the standard CG method: only a single eigenvecor computation is required. For minimizing an $\alpha$-strongly convex and $\beta$-smooth function, the \textit{expected} error of the method after $t$ iterations is: $O\left({\min\{\frac{\beta{}}{t}, \left({\frac{\beta\sqrt{\rank(\X^*)}}{\alpha^{1/4}t}}\right)^{4/3}, \left({\frac{\beta}{\sqrt{\alpha}\lambda_{\min}(\X^*)t}}\right)^{2}\}}\right)$. Beyond the significant improvement in convergence rate, it also follows that when the optimum is low-rank, our method provides better accuracy-rank tradeoff than the standard CG method. To the best of our knowledge, this is the first result that attains provably faster convergence rates for a CG variant for optimization over the spectrahedron. We also present encouraging preliminary empirical results.

PDF Details

NeurIPS Conference 2016 Conference Paper

Linear-Memory and Decomposition-Invariant Linearly Convergent Conditional Gradient Algorithm for Structured Polytopes

Dan Garber
Ofer Meshi

Recently, several works have shown that natural modifications of the classical conditional gradient method (aka Frank-Wolfe algorithm) for constrained convex optimization, provably converge with a linear rate when the feasible set is a polytope, and the objective is smooth and strongly-convex. However, all of these results suffer from two significant shortcomings: i) large memory requirement due to the need to store an explicit convex decomposition of the current iterate, and as a consequence, large running-time overhead per iteration ii) the worst case convergence rate depends unfavorably on the dimension In this work we present a new conditional gradient variant and a corresponding analysis that improves on both of the above shortcomings. In particular, both memory and computation overheads are only linear in the dimension, and in addition, in case the optimal solution is sparse, the new convergence rate replaces a factor which is at least linear in the dimension in previous works, with a linear dependence on the number of non-zeros in the optimal solution At the heart of our method, and corresponding analysis, is a novel way to compute decomposition-invariant away-steps. While our theoretical guarantees do not apply to any polytope, they apply to several important structured polytopes that capture central concepts such as paths in graphs, perfect matchings in bipartite graphs, marginal distributions that arise in structured prediction tasks, and more. Our theoretical findings are complemented by empirical evidence that shows that our method delivers state-of-the-art performance.

PDF Details

ICML Conference 2015 Conference Paper

Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets

Dan Garber
Elad Hazan

The Frank-Wolfe method (a. k. a. conditional gradient algorithm) for smooth optimization has regained much interest in recent years in the context of large scale optimization and machine learning. A key advantage of the method is that it avoids projections - the computational bottleneck in many applications - replacing it by a linear optimization step. Despite this advantage, the known convergence rates of the FW method fall behind standard first order methods for most settings of interest. It is an active line of research to derive faster linear optimization-based algorithms for various settings of convex optimization. In this paper we consider the special case of optimization over strongly convex sets, for which we prove that the vanila FW method converges at a rate of \frac1t^2. This gives a quadratic improvement in convergence rate compared to the general case, in which convergence is of the order \frac1t, and known to be tight. We show that various balls induced by \ell_p norms, Schatten norms and group norms are strongly convex on one hand and on the other hand, linear optimization over these sets is straightforward and admits a closed-form solution. We further show how several previous fast-rate results for the FW method follow easily from our analysis.

Details

ICML Conference 2015 Conference Paper

Online Learning of Eigenvectors

Dan Garber
Elad Hazan
Tengyu Ma 0001

Computing the leading eigenvector of a symmetric real matrix is a fundamental primitive of numerical linear algebra with numerous applications. We consider a natural online extension of the leading eigenvector problem: a sequence of matrices is presented and the goal is to predict for each matrix a unit vector, with the overall goal of competing with the leading eigenvector of the cumulative matrix. Existing regret-minimization algorithms for this problem either require to compute an \textiteigen decompostion every iteration, or suffer from a large dependency of the regret bound on the dimension. In both cases the algorithms are not practical for large scale applications. In this paper we present new algorithms that avoid both issues. On one hand they do not require any expensive matrix decompositions and on the other, they guarantee regret rates with a mild dependence on the dimension at most. In contrast to previous algorithms, our algorithms also admit implementations that enable to leverage sparsity in the data to further reduce computation. We extend our results to also handle non-symmetric matrices.

Details

SODA Conference 2015 Conference Paper

Online Principal Components Analysis

Christos Boutsidis
Dan Garber
Zohar Shay Karnin
Edo Liberty

Details

FOCS Conference 2013 Conference Paper

Playing Non-linear Games with Linear Oracles

Dan Garber
Elad Hazan

Linear optimization is many times algorithmically simpler than non-linear convex optimization. Linear optimization over matroid polytopes, matching polytopes and path polytopes are example of problems for which we have efficient combinatorial algorithms, but whose non-linear convex counterpart is harder and admit significantly less efficient algorithms. This motivates the computational model of online decision making and optimization using a linear optimization oracle. In this computational model we give the first efficient decision making algorithm with optimal regret guarantees, answering an open question of Kalai and Vempala, Hazan and Kale, in case the decision set is a polytope. We also give an extension of the algorithm for the partial information setting, i. e. the "bandit" model. Our method is based on a novel variant of the conditional gradient method, or Frank-Wolfe algorithm, that reduces the task of minimizing a smooth convex function over a domain to that of minimizing a linear objective. Whereas previous variants of this method give rise to approximation algorithms, we give such algorithm that converges exponentially faster and thus runs in polynomial-time for a large class of convex optimization problems over polyhedral sets, a result of independent interest.

Details

NeurIPS Conference 2011 Conference Paper

Approximating Semidefinite Programs in Sublinear Time

Dan Garber
Elad Hazan

In recent years semidefinite optimization has become a tool of major importance in various optimization and machine learning problems. In many of these problems the amount of data in practice is so large that there is a constant need for faster algorithms. In this work we present the first sublinear time approximation algorithm for semidefinite programs which we believe may be useful for such problems in which the size of data may cause even linear time algorithms to have prohibitive running times in practice. We present the algorithm and its analysis alongside with some theoretical lower bounds and an improved algorithm for the special problem of supervised learning of a distance metric.

PDF Details