Author name cluster

Stephen J. Wright 0001

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

1 author row

ICML Conference 2024 Conference Paper

Convex and Bilevel Optimization for Neural-Symbolic Inference and Learning

Charles Dickens
Changyu Gao
Connor Pryor
Stephen J. Wright 0001
Lise Getoor

We leverage convex and bilevel optimization techniques to develop a general gradient-based parameter learning framework for neural-symbolic (NeSy) systems. We demonstrate our framework with NeuPSL, a state-of-the-art NeSy architecture. To achieve this, we propose a smooth primal and dual formulation of NeuPSL inference and show learning gradients are functions of the optimal dual variables. Additionally, we develop a dual block coordinate descent algorithm for the new formulation that naturally exploits warm-starts. This leads to over $100 \times$ learning runtime improvements over the current best NeuPSL inference method. Finally, we provide extensive empirical evaluations across $8$ datasets covering a range of tasks and demonstrate our learning framework achieves up to a $16$% point prediction performance improvement over alternative learning methods.

Details

ICML Conference 2024 Conference Paper

How to Make the Gradients Small Privately: Improved Rates for Differentially Private Non-Convex Optimization

Andrew Lowy
Jonathan R. Ullman
Stephen J. Wright 0001

We provide a simple and flexible framework for designing differentially private algorithms to find approximate stationary points of non-convex loss functions. Our framework is based on using a private approximate risk minimizer to "warm start" another private algorithm for finding stationary points. We use this framework to obtain improved, and sometimes optimal, rates for several classes of non-convex loss functions. First, we obtain improved rates for finding stationary points of smooth non-convex empirical loss functions. Second, we specialize to quasar-convex functions, which generalize star-convex functions and arise in learning dynamical systems and training some neural nets. We achieve the optimal rate for this class. Third, we give an optimal algorithm for finding stationary points of functions satisfying the Kurdyka-Lojasiewicz (KL) condition. For example, over-parameterized neural networks often satisfy this condition. Fourth, we provide new state-of-the-art rates for stationary points of non-convex population loss functions. Fifth, we obtain improved rates for non-convex generalized linear models. A modification of our algorithm achieves nearly the same rates for second-order stationary points of functions with Lipschitz Hessian, improving over the previous state-of-the-art for each of the above problems.

Details

ICML Conference 2024 Conference Paper

Private Heterogeneous Federated Learning Without a Trusted Server Revisited: Error-Optimal and Communication-Efficient Algorithms for Convex Losses

Changyu Gao
Andrew Lowy
Xingyu Zhou 0001
Stephen J. Wright 0001

We revisit the problem of federated learning (FL) with private data from people who do not trust the server or other silos/clients. In this context, every silo (e. g. hospital) has data from several people (e. g. patients) and needs to protect the privacy of each person’s data (e. g. health records), even if the server and/or other silos try to uncover this data. Inter-Silo Record-Level Differential Privacy (ISRL-DP) prevents each silo’s data from being leaked, by requiring that silo $i$’s communications satisfy item-level differential privacy. Prior work (Lowy & Razaviyayn, 2023a) characterized the optimal excess risk bounds for ISRL-DP algorithms with homogeneous (i. i. d.) silo data and convex loss functions. However, two important questions were left open: 1) Can the same excess risk bounds be achieved with heterogeneous (non-i. i. d.) silo data? 2) Can the optimal risk bounds be achieved with fewer communication rounds? In this paper, we give positive answers to both questions. We provide novel ISRL-DP FL algorithms that achieve the optimal excess risk bounds in the presence of heterogeneous silo data. Moreover, our algorithms are more communication-efficient than the prior state-of-the-art. For smooth loss functions, our algorithm achieves the optimal excess risk bound and has communication complexity that matches the non-private lower bound. Additionally, our algorithms are more computationally efficient than the previous state-of-the-art.

Details

ICML Conference 2024 Conference Paper

Revisiting Inexact Fixed-Point Iterations for Min-Max Problems: Stochasticity and Structured Nonconvexity

Ahmet Alacaoglu
Donghwan Kim
Stephen J. Wright 0001

We focus on constrained, $L$-smooth, potentially stochastic and nonconvex-nonconcave min-max problems either satisfying $\rho$-cohypomonotonicity or admitting a solution to the $\rho$-weakly Minty Variational Inequality (MVI), where larger values of the parameter $\rho>0$ correspond to a greater degree of nonconvexity. These problem classes include examples in two player reinforcement learning, interaction dominant min-max problems, and certain synthetic test problems on which classical min-max algorithms fail. It has been conjectured that first-order methods can tolerate a value of $\rho$ no larger than $\frac{1}{L}$, but existing results in the literature have stagnated at the tighter requirement $\rho conic nonexpansiveness property of operators. Second, we provide a refined analysis for inexact Halpern iteration that relaxes the required inexactness level to improve some state-of-the-art complexity results even for constrained stochastic convex-concave min-max problems. Third, we analyze a stochastic inexact Krasnosel’skii-Mann iteration with a multilevel Monte Carlo estimator when the assumptions only hold with respect to a solution.

Details

ICML Conference 2023 Conference Paper

Cyclic Block Coordinate Descent With Variance Reduction for Composite Nonconvex Optimization

Xufeng Cai
Chaobing Song
Stephen J. Wright 0001
Jelena Diakonikolas

Nonconvex optimization is central in solving many machine learning problems, in which block-wise structure is commonly encountered. In this work, we propose cyclic block coordinate methods for nonconvex optimization problems with non-asymptotic gradient norm guarantees. Our convergence analysis is based on a gradient Lipschitz condition with respect to a Mahalanobis norm, inspired by a recent progress on cyclic block coordinate methods. In deterministic settings, our convergence guarantee matches the guarantee of (full-gradient) gradient descent, but with the gradient Lipschitz constant being defined w. r. t. a Mahalanobis norm. In stochastic settings, we use recursive variance reduction to decrease the per-iteration cost and match the arithmetic operation complexity of current optimal stochastic full-gradient methods, with a unified analysis for both finite-sum and infinite-sum cases. We prove a faster linear convergence result when a Polyak-Łojasiewicz (PŁ) condition holds. To our knowledge, this work is the first to provide non-asymptotic convergence guarantees — variance-reduced or not — for a cyclic block coordinate method in general composite (smooth + nonsmooth) nonconvex settings. Our experimental results demonstrate the efficacy of the proposed cyclic scheme in training deep neural nets.

Details

ICML Conference 2021 Conference Paper

Variance Reduction via Primal-Dual Accelerated Dual Averaging for Nonsmooth Convex Finite-Sums

Chaobing Song
Stephen J. Wright 0001
Jelena Diakonikolas

Structured nonsmooth convex finite-sum optimization appears in many machine learning applications, including support vector machines and least absolute deviation. For the primal-dual formulation of this problem, we propose a novel algorithm called \emph{Variance Reduction via Primal-Dual Accelerated Dual Averaging (\vrpda)}. In the nonsmooth and general convex setting, \vrpda has the overall complexity $O(nd\log\min \{1/\epsilon, n\} + d/\epsilon )$ in terms of the primal-dual gap, where $n$ denotes the number of samples, $d$ the dimension of the primal variables, and $\epsilon$ the desired accuracy. In the nonsmooth and strongly convex setting, the overall complexity of \vrpda becomes $O(nd\log\min\{1/\epsilon, n\} + d/\sqrt{\epsilon})$ in terms of both the primal-dual gap and the distance between iterate and optimal solution. Both these results for \vrpda improve significantly on state-of-the-art complexity estimates—which are $O(nd\log \min\{1/\epsilon, n\} + \sqrt{n}d/\epsilon)$ for the nonsmooth and general convex setting and $O(nd\log \min\{1/\epsilon, n\} + \sqrt{n}d/\sqrt{\epsilon})$ for the nonsmooth and strongly convex setting—with a simpler and more straightforward algorithm and analysis. Moreover, both complexities are better than \emph{lower} bounds for general convex finite-sum optimization, because our approach makes use of additional, commonly occurring structure. Numerical experiments reveal competitive performance of \vrpda compared to state-of-the-art approaches.

Details

ICML Conference 2019 Conference Paper

Bilinear Bandits with Low-rank Structure

Kwang-Sung Jun
Rebecca Willett
Stephen J. Wright 0001
Robert D. Nowak

We introduce the bilinear bandit problem with low-rank structure in which an action takes the form of a pair of arms from two different entity types, and the reward is a bilinear function of the known feature vectors of the arms. The unknown in the problem is a $d_1$ by $d_2$ matrix $\mathbf{\Theta}^*$ that defines the reward, and has low rank $r \ll \min\{d_1, d_2\}$. Determination of $\mathbf{\Theta}^*$ with this low-rank structure poses a significant challenge in finding the right exploration-exploitation tradeoff. In this work, we propose a new two-stage algorithm called “Explore-Subspace-Then-Refine” (ESTR). The first stage is an explicit subspace exploration, while the second stage is a linear bandit algorithm called “almost-low-dimensional OFUL” (LowOFUL) that exploits and further refines the estimated subspace via a regularization technique. We show that the regret of ESTR is $\widetilde{\mathcal{O}}((d_1+d_2)^{3/2} \sqrt{r T})$ where $\widetilde{\mathcal{O}}$ hides logarithmic factors and $T$ is the time horizon, which improves upon the regret of $\widetilde{\mathcal{O}}(d_1d_2\sqrt{T})$ attained for a naïve linear bandit reduction. We conjecture that the regret bound of ESTR is unimprovable up to polylogarithmic factors, and our preliminary experiment shows that ESTR outperforms a naïve linear bandit reduction.

Details

ICML Conference 2019 Conference Paper

Blended Conditonal Gradients

Gábor Braun
Sebastian Pokutta
Dan Tu
Stephen J. Wright 0001

We present a blended conditional gradient approach for minimizing a smooth convex function over a polytope P, combining the Frank{–}Wolfe algorithm (also called conditional gradient) with gradient-based steps, different from away steps and pairwise steps, but still achieving linear convergence for strongly convex functions, along with good practical performance. Our approach retains all favorable properties of conditional gradient algorithms, notably avoidance of projections onto P and maintenance of iterates as sparse convex combinations of a limited number of extreme points of P. The algorithm is lazy, making use of inexpensive inexact solutions of the linear programming subproblem that characterizes the conditional gradient approach. It decreases measures of optimality (primal and dual gaps) rapidly, both in the number of iterations and in wall-clock time, outperforming even the lazy conditional gradient algorithms of Braun et al. 2017. We also present a streamlined version of the algorithm that applies when P is the probability simplex.

Details

ICML Conference 2019 Conference Paper

First-Order Algorithms Converge Faster than $O(1/k)$ on Convex Problems

Ching-pei Lee
Stephen J. Wright 0001

It is well known that both gradient descent and stochastic coordinate descent achieve a global convergence rate of $O(1/k)$ in the objective value, when applied to a scheme for minimizing a Lipschitz-continuously differentiable, unconstrained convex function. In this work, we improve this rate to $o(1/k)$. We extend the result to proximal gradient and proximal coordinate descent on regularized problems to show similar $o(1/k)$ convergence rates. The result is tight in the sense that a rate of $O(1/k^{1+\epsilon})$ is not generally attainable for any $\epsilon>0$, for any of these methods.

Details

ICML Conference 2018 Conference Paper

Dissipativity Theory for Accelerating Stochastic Variance Reduction: A Unified Analysis of SVRG and Katyusha Using Semidefinite Programs

Bin Hu 0002
Stephen J. Wright 0001
Laurent Lessard

Techniques for reducing the variance of gradient estimates used in stochastic programming algorithms for convex finite-sum problems have received a great deal of attention in recent years. By leveraging dissipativity theory from control, we provide a new perspective on two important variance-reduction algorithms: SVRG and its direct accelerated variant Katyusha. Our perspective provides a physically intuitive understanding of the behavior of SVRG-like methods via a principle of energy conservation. The tools discussed here allow us to automate the convergence analysis of SVRG-like methods by capturing their essential properties in small semidefinite programs amenable to standard analysis and computational techniques. Our approach recovers existing convergence results for SVRG and Katyusha and generalizes the theory to alternative parameter choices. We also discuss how our approach complements the linear coupling technique. Our combination of perspectives leads to a better understanding of accelerated variance-reduced stochastic methods for finite-sum problems.

Details

UAI Conference 2018 Conference Paper

Stochastic Learning for Sparse Discrete Markov Random Fields with Controlled Gradient Approximation Error

Sinong Geng
Zhaobin Kuang
Jie Liu 0006
Stephen J. Wright 0001
David Page

We study the L1 -regularized maximum likelihood estimator/estimation (MLE) problem for discrete Markov random fields (MRFs), where efficient and scalable learning requires both sparse regularization and approximate inference. To address these challenges, we consider a stochastic learning framework called stochastic proximal gradient (SPG; Honorio 2012a, Atchade et al. 2014, Miasojedow and Rejchel 2016). SPG is an inexact proximal gradient algorithm [Schmidt et al. , 2011], whose inexactness stems from the stochastic oracle (Gibbs sampling) for gradient approximation – exact gradient evaluation is infeasible in general due to the NP-hard inference problem for discrete MRFs [Koller and Friedman, 2009]. Theoretically, we provide novel verifiable bounds to inspect and control the quality of gradient approximation. Empirically, we propose the tighten asymptotically (TAY) learning strategy based on the verifiable bounds to boost the performance of SPG.

Details

ICML Conference 2014 Conference Paper

An Asynchronous Parallel Stochastic Coordinate Descent Algorithm

Ji Liu 0002
Stephen J. Wright 0001
Christopher Ré
Victor Bittorf
Srikrishna Sridhar

We describe an asynchronous parallel stochastic coordinate descent algorithm for minimizing smooth unconstrained or separably constrained functions. The method achieves a linear convergence rate on functions that satisfy an essential strong convexity property and a sublinear rate (1/K) on general convex functions. Near-linear speedup on a multicore system can be expected if the number of processors is O(n^1/2) in unconstrained optimization and O(n^1/4) in the separable-constrained case, where n is the number of variables. We describe results from implementation on 40-core processors.

Details

ICML Conference 2011 Conference Paper

Manifold Identification of Dual Averaging Methods for Regularized Stochastic Online Learning

Sangkyun Lee 0002
Stephen J. Wright 0001

Details