STOC Conference 2025 Conference Paper
Breaking the Barrier of Self-Concordant Barriers: Faster Interior Point Methods for M-Matrices
- Adrian Vladu
Author name cluster
Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.
STOC Conference 2025 Conference Paper
NeurIPS Conference 2025 Conference Paper
In this paper, we study the problem $\min_{x\in R^{d}, Nx=v}\sum\_{i=1}^{n}f((Ax-b)_{i})$ for a quasi-self-concordant function $f: R\to R$, where $A, N$ are $n\times d$ and $m\times d$ matrices, $b, v$ are vectors of length $n$ and $m$ with $n\ge d. $ We show an algorithm based on a trust-region method with an oracle that can be implemented using $\widetilde{O}(d^{1/3})$ linear system solves, improving the $\widetilde{O}(n^{1/3})$ oracle by [Adil-Bullins-Sachdeva, NeurIPS 2021]. Our implementation of the oracle relies on solving the overdetermined $\ell\_{\infty}$-regression problem $\min\_{x\in R^{d}, Nx=v}\|Ax-b\|\_{\infty}$. We provide an algorithm that finds a $(1+\epsilon)$-approximate solution to this problem using $O((d^{1/3}/\epsilon+1/\epsilon^{2})\log(n/\epsilon))$ linear system solves. This algorithm leverages $\ell\_{\infty}$ Lewis weight overestimates and achieves this iteration complexity via a simple lightweight IRLS approach, inspired by the work of [Ene-Vladu, ICML 2019]. Experimentally, we demonstrate that our algorithm significantly improves the runtime of the standard CVX solver.
ICLR Conference 2023 Conference Paper
Deep neural networks (DNNs) often have to be compressed, via pruning and/or quantization, before they can be deployed in practical settings. In this work we propose a new compression-aware minimizer dubbed CrAM that modifies the optimization step in a principled way, in order to produce models whose local loss behavior is stable under compression operations such as pruning. Thus, dense models trained via CrAM should be compressible post-training, in a single step, without significant accuracy loss. Experimental results on standard benchmarks, such as residual networks for ImageNet classification and BERT models for language modelling, show that CrAM produces dense models that can be more accurate than the standard SGD/Adam-based baselines, but which are stable under weight pruning: specifically, we can prune models in one-shot to 70-80% sparsity with almost no accuracy loss, and to 90% with reasonable (∼ 1%) accuracy loss, which is competitive with gradual compression methods. Additionally, CrAM can produce sparse models which perform well for transfer learning, and it also works for semi-structured 2:4 pruning patterns supported by GPU hardware. The code for reproducing the results is available at: https://github.com/IST-DASLab/CrAM .
SODA Conference 2023 Conference Paper
ICML Conference 2023 Conference Paper
Communication-reduction techniques are a popular way to improve scalability in data-parallel training of deep neural networks (DNNs). The recent emergence of large language models such as GPT has created the need for new approaches to exploit data-parallelism. Among these, fully-sharded data parallel (FSDP) training is highly popular, yet it still encounters scalability bottlenecks. One reason is that applying compression techniques to FSDP is challenging: as the vast majority of the communication involves the model’s weights, direct compression alters convergence and leads to accuracy loss. We present QSDP, a variant of FSDP which supports both gradient and weight quantization with theoretical guarantees, is simple to implement and has essentially no overheads. To derive QSDP we prove that a natural modification of SGD achieves convergence even when we only maintain quantized weights, and thus the domain over which we train consists of quantized points and is, therefore, highly non-convex. We validate this approach by training GPT-family models with up to 1. 3 billion parameters on a multi-node cluster. Experiments show that QSDP preserves model accuracy, while completely removing the communication bottlenecks of FSDP, providing end-to-end speedups of up to 2. 2x.
NeurIPS Conference 2021 Conference Paper
The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate. Recent work has investigated the even harder case of sparse training, where the DNN weights are, for as much as possible, already sparse to reduce computational costs during training. Existing sparse training methods are often empirical and can have lower accuracy relative to the dense baseline. In this paper, we present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of DNNs, demonstrate convergence for a variant of the algorithm, and show that AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets; at high sparsity levels, AC/DC even outperforms existing methods that rely on accurate pre-trained dense models. An important property of AC/DC is that it allows co-training of dense and sparse models, yielding accurate sparse-dense model pairs at the end of the training process. This is useful in practice, where compressed variants may be desirable for deployment in resource-constrained settings without re-doing the entire training flow, and also provides us with insights into the accuracy gap between dense and compressed models.
AAAI Conference 2021 Conference Paper
We provide new adaptive first-order methods for constrained convex optimization. Our main algorithms ADAACSA and ADAAGD+ are accelerated methods, which are universal in the sense that they achieve nearly-optimal convergence rates for both smooth and non-smooth functions, even when they only have access to stochastic gradients. In addition, they do not require any prior knowledge on how the objective function is parametrized, since they automatically adjust their percoordinate learning rate. These can be seen as truly accelerated ADAGRAD methods for constrained optimization. We complement them with a simpler algorithm ADAGRAD+ which enjoys the same features, and achieves the standard non-accelerated convergence rate. We also present a set of new results involving adaptive methods for unconstrained optimization and monotone operators.
ICML Conference 2021 Conference Paper
This paper bridges discrete and continuous optimization approaches for decomposable submodular function minimization, in both the standard and parametric settings. We provide improved running times for this problem by reducing it to a number of calls to a maximum flow oracle. When each function in the decomposition acts on O(1) elements of the ground set V and is polynomially bounded, our running time is up to polylogarithmic factors equal to that of solving maximum flow in a sparse graph with O(|V|) vertices and polynomial integral capacities. We achieve this by providing a simple iterative method which can optimize to high precision any convex function defined on the submodular base polytope, provided we can efficiently minimize it on the base polytope corresponding to the cut function of a certain graph that we construct. We solve this minimization problem by lifting the solutions of a parametric cut problem, which we obtain via a new efficient combinatorial reduction to maximum flow. This reduction is of independent interest and implies some previously unknown bounds for the parametric minimum s, t-cut problem in multiple settings.
FOCS Conference 2021 Conference Paper
We give an $\tilde{O}(m^{3/2-1/762}\log(U+W))$ time algorithm for minimum cost flow with capacities bounded by $U$ and costs bounded by $W$. For sparse graphs with general capacities, this is the first algorithm to improve over the $\tilde{O}(m^{3/2}\log^{O(1)}(U+W))$ running time obtained by an appropriate instantiation of an interior point method [Daitch-Spielman, 2008]. Our approach is extending the framework put forth in [Gao-Liu-Peng, 2021] for computing the maximum flow in graphs with large capacities and, in particular, demonstrates how to reduce the problem of computing an electrical flow with general demands to the same problem on a sublinear-sized set of vertices—even if the demand is supported on the entire graph. Along the way, we develop new machinery to assess the importance of the graph's edges at each phase of the interior point method optimization process. This capability relies on establishing a new connections between the electrical flows arising inside that optimization process and vertex distances in the corresponding effective resistance metric.
AAAI Conference 2021 Conference Paper
We design differentially private algorithms for the bandit convex optimization problem in the projection-free setting. This setting is important whenever the decision set has a complex geometry, and access to it is done efficiently only through a linear optimization oracle, hence Euclidean projections are unavailable (e. g. matroid polytope, submodular base polytope). This is the first differentially-private algorithm for projection-free bandit optimization, and in fact our bound matches the best known non-private projection-free algorithm and the best known private algorithm, even for the weaker setting when projections are available.
FOCS Conference 2020 Conference Paper
We present an m 4/3+o(1) logW-time algorithm for solving the minimum cost flow problem in graphs with unit capacity, where W is the maximum absolute value of any edge weight. For sparse graphs, this improves over the best known running time for this problem and, by well-known reductions, also implies improved running times for the shortest path problem with negative weights, minimum cost bipartite b-matching when ||b||1=O(m), and recovers the running time of the currently fastest algorithm for maximum flow in graphs with unit capacities (Liu-Sidford, 2020). Our algorithm relies on developing an interior point method-based framework which acts on the space of circulations in the underlying graph. From the combinatorial point of view, this framework can be viewed as iteratively improving the cost of a suboptimal solution by pushing flow around circulations. These circulations are derived by computing a regularized version of the standard Newton step, which is partially inspired by previous work on the unit-capacity maximum flow problem (Liu-Sidford, 2019), and subsequently refined based on the very recent progress on this problem (Liu-Sidford, 2020). The resulting step problem can then be computed efficiently using the recent work on lp-norm minimizing flows (Kyng-Peng-Sachdeva-Wang, 2019). We obtain our faster algorithm by combining this new step primitive with a customized preconditioning method, which aims to ensure that the graph on which these circulations are computed has sufficiently large conductance.
ICML Conference 2019 Conference Paper
The iteratively reweighted least squares method (IRLS) is a popular technique used in practice for solving regression problems. Various versions of this method have been proposed, but their theoretical analyses failed to capture the good practical performance. In this paper we propose a simple and natural version of IRLS for solving $\ell_\infty$ and $\ell_1$ regression, which provably converges to a $(1+\epsilon)$-approximate solution in $O(m^{1/3}\log(1/\epsilon)/\epsilon^{2/3} + \log m/\epsilon^2)$ iterations, where $m$ is the number of rows of the input matrix. Interestingly, this running time is independent of the conditioning of the input, and the dominant term of the running time depends sublinearly in $\epsilon^{-1}$, which is atypical for the optimization of non-smooth functions. This improves upon the more complex algorithms of Chin et al. (ITCS ’12), and Christiano et al. (STOC ’11) by a factor of at least $1/\epsilon^2$, and yields a truly efficient natural algorithm for the slime mold dynamics (Straszak-Vishnoi, SODA ’16, ITCS ’16, ITCS ’17).
STOC Conference 2019 Conference Paper
We consider the problem of maximizing the multilinear extension of a submodular function subject a single matroid constraint or multiple packing constraints with a small number of adaptive rounds of evaluation queries. We obtain the first algorithms with low adaptivity for submodular maximization with a matroid constraint. Our algorithms achieve a 1−1/ e −є approximation for monotone functions and a 1/ e −є approximation for non-monotone functions, which nearly matches the best guarantees known in the fully adaptive setting. The number of rounds of adaptivity is O (log 2 n /є 3 ), which is an exponential speedup over the existing algorithms. We obtain the first parallel algorithm for non-monotone submodular maximization subject to packing constraints. Our algorithm achieves a 1/ e −є approximation using O (log( n /є) log(1/є) log( n + m )/ є 2 ) parallel rounds, which is again an exponential speedup in parallel time over the existing algorithms. For monotone functions, we obtain a 1−1/ e −є approximation in O (log( n /є)log m /є 2 ) parallel rounds. The number of parallel rounds of our algorithm matches that of the state of the art algorithm for solving packing LPs with a linear objective (Mahoney et al., 2016). Our results apply more generally to the problem of maximizing a diminishing returns submodular (DR-submodular) function.
STOC Conference 2017 Conference Paper
FOCS Conference 2017 Conference Paper
In this paper 1, we study matrix scaling and balancing, which are fundamental problems in scientific computing, with a long line of work on them that dates back to the 1960s. We provide algorithms for both these problems that, ignoring logarithmic factors involving the dimension of the input matrix and the size of its entries, both run in time Õ(m log κ log 2 (1/ε)) where ε is the amount of error we are willing to tolerate. Here, κ represents the ratio between the largest and the smallest entries of the optimal scalings. This implies that our algorithms run in nearly-linear time whenever κ is quasi-polynomial, which includes, in particular, the case of strictly positive matrices. We complement our results by providing a separate algorithm that uses an interior-point method and runs in time Õ(m 3/2 log(1/ε)). In order to establish these results, we develop a new secondorder optimization framework that enables us to treat both problems in a unified and principled manner. This framework identifies a certain generalization of linear system solving that we can use to efficiently minimize a broad class of functions, which we call second-order robust. We then show that in the context of the specific functions capturing matrix scaling and balancing, we can leverage and generalize the work on Laplacian system solving to make the algorithms obtained via this framework very efficient.
SODA Conference 2017 Conference Paper
ICML Conference 2017 Conference Paper
We present a deterministic nearly-linear time algorithm for approximating any point inside a convex polytope with a sparse convex combination of the polytope’s vertices. Our result provides a constructive proof for the Approximate Carathéodory Problem, which states that any point inside a polytope contained in the $\ell_p$ ball of radius $D$ can be approximated to within $\epsilon$ in $\ell_p$ norm by a convex combination of $O\left(D^2 p/\epsilon^2\right)$ vertices of the polytope for $p \geq 2$. While for the particular case of $p=2$, this can be achieved by the well-known Perceptron algorithm, we follow a more principled approach which generalizes to arbitrary $p\geq 2$; furthermore, this naturally extends to domains with more complicated geometry, as it is the case for providing an approximate Birkhoff-von Neumann decomposition. Secondly, we show that the sparsity bound is tight for $\ell_p$ norms, using an argument based on anti-concentration for the binomial distribution, thus resolving an open question posed by Barman. Experimentally, we verify that our deterministic optimization-based algorithms achieve in practice much better sparsity than previously known sampling-based algorithms. We also show how to apply our techniques to SVM training and rounding fractional points in matroid and flow polytopes.
FOCS Conference 2016 Conference Paper
In this paper, we provide faster algorithms for computing variousfundamental quantities associated with random walks on a directedgraph, including the stationary distribution, personalized PageRankvectors, hitting times, and escape probabilities. In particular, ona directed graph with n vertices and m edges, we show how tocompute each quantity in time Õ(m3/4n + mn2/3), wherethe Õ notation suppresses polylog factors in n, the desired accuracy, and the appropriate condition number (i. e. themixing time or restart probability). Our result improves upon the previous fastest running times for these problems, previous results either invoke a general purpose linearsystem solver on a n × n matrix with m non-zero entries, or depend polynomially on the desired error or natural condition numberassociated with the problem (i. e. the mixing time or restart probability). For sparse graphs, we obtain a running time of Õ(n7/4), breaking the O(n2) barrier of the best running time one couldhope to achieve using fast matrix multiplication. We achieve our result by providing a similar running time improvementfor solving directed Laplacian systems, a natural directedor asymmetric analog of the well studied symmetric or undirected Laplaciansystems. We show how to solve such systems in time Õ(m3/4n + mn2/3), and efficiently reduce a broad range of problems to solving Õ(1) directed Laplacian systems on Eulerian graphs. We hope these resultsand our analysis open the door for further study into directedspectral graph theory.