Author name cluster

Hanbaek Lyu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

JMLR Journal 2026 Journal Article

Convergence and complexity of block majorization-minimization for constrained block-Riemannian optimization

Yuchen Li
Laura Balzano
Deanna Needell
Hanbaek Lyu

Block majorization-minimization (BMM) is a simple iterative algorithm for nonconvex optimization that sequentially minimizes a majorizing surrogate of the objective function in each block coordinate while the other block coordinates are held fixed. We consider a family of BMM algorithms for minimizing nonsmooth nonconvex objectives, where each parameter block is constrained within a subset of a Riemannian manifold. We establish that this algorithm converges asymptotically to the set of stationary points, and attains an $\epsilon$-stationary point within $\widetilde{O}(\epsilon^{-2})$ iterations. In particular, the assumptions for our complexity results are completely Euclidean when the underlying manifold is a product of Euclidean or Stiefel manifolds, although our analysis makes explicit use of the Riemannian geometry. Our general analysis applies to a wide range of algorithms with Riemannian constraints: Riemannian MM, block projected gradient descent, Bures-JKO scheme for Wasserstein variational inference, optimistic likelihood estimation, geodesically constrained subspace tracking, robust PCA, and Riemannian CP-dictionary-learning. We experimentally validate that our algorithm converges faster than standard Euclidean algorithms applied to the Riemannian setting. [abs] [ pdf ][ bib ] &copy JMLR 2026. ( edit, beta )

PDF Details

ICML Conference 2025 Conference Paper

Linear convergence of Sinkhorn's algorithm for generalized static Schrödinger bridge

Rahul Choudhary
Hanbaek Lyu

The classical static Schrödinger Bridge (SSB) problem, which seeks the most likely stochastic evolution between two marginal probability measures, has been studied extensively in the optimal transport and statistical physics communities, and more recently in machine learning communities in the surge of generative models. The standard approach to solve SSB is to first identify its Kantorovich dual and use Sinkhorn’s algorithm to find the optimal potential functions. While the original SSB is only a strictly convex minimization problem, this approach is known to warrant linear convergence under mild assumptions. In this work, we consider a generalized SSB allowing any strictly increasing divergence functional, far generalizing the entropy functional $x\log x$ in the standard SSB. This problem naturally arises in a wide range of seemingly unrelated problems in entropic optimal transport, random graphs/matrices, and combinatorics. We establish Kantorovich duality and linear convergence of Sinkhorn’s algorithm for the generalized SSB problem under mild conditions. Our results provide a new rigorous foundation for understanding Sinkhorn-type iterative methods in the context of large-scale generalized Schrödinger bridges.

Details

NeurIPS Conference 2025 Conference Paper

Offline Actor-Critic for Average Reward MDPs

William Powell
Jeongyeol Kwon
Qiaomin Xie
Hanbaek Lyu

We study offline policy optimization for infinite-horizon average-reward Markov decision processes (MDPs) with large or infinite state spaces. Specifically, we propose a pessimistic actor-critic algorithm that uses a computationally efficient linear function class for value function estimation. At the core of our method is a critic that computes a pessimistic estimate of the average reward under the current policy, as well as the corresponding policy gradient, by solving a fixed-point Bellman equation, rather than solving a successive sequence of regression problems as in finite horizon settings. This procedure reduces to solving a second-order cone program, which is computationally tractable. Our theoretical analysis is based on a weak partial data coverage assumption, which requires only that the offline data aligns well with the expected feature vector of a comparator policy. Under this condition, we show that our algorithm achieves the optimal sample complexity of O(\varepsilon^{-2}) for learning a near-optimal policy, up to model misspecification errors.

PDF Details

ICML Conference 2025 Conference Paper

Sample Complexity of Branch-length Estimation by Maximum Likelihood

David Clancy Jr.
Hanbaek Lyu
Sébastien Roch

We consider the branch-length estimation problem on a bifurcating tree: a character evolves along the edges of a binary tree according to a two-state symmetric Markov process, and we seek to recover the edge transition probabilities from repeated observations at the leaves. This problem arises in phylogenetics, and is related to latent tree graphical model inference. In general, the log-likelihood function is non-concave and may admit many critical points. Nevertheless, simple coordinate maximization has been known to perform well in practice, defying the complexity of the likelihood landscape. In this work, we provide the first theoretical guarantee as to why this might be the case. We show that deep inside the Kesten-Stigum reconstruction regime, provided with polynomially many $m$ samples (assuming the tree is balanced), there exists a universal parameter regime (independent of the size of the tree) where the log-likelihood function is strongly concave and smooth with high probability. On this high-probability likelihood landscape event, we show that the standard coordinate maximization algorithm converges exponentially fast to the maximum likelihood estimator, which is within $O(1/\sqrt{m})$ from the true parameter, provided a sufficiently close initial point.

Details

ICML Conference 2024 Conference Paper

Convergence and Complexity Guarantee for Inexact First-order Riemannian Optimization Algorithms

Yuchen Li
Laura Balzano
Deanna Needell
Hanbaek Lyu

We analyze inexact Riemannian gradient descent (RGD) where Riemannian gradients and retractions are inexactly (and cheaply) computed. Our focus is on understanding when inexact RGD converges and what is the complexity in the general nonconvex and constrained setting. We answer these questions in a general framework of tangential Block Majorization-Minimization (tBMM). We establish that tBMM converges to an $\epsilon$-stationary point within $O(\epsilon^{-2})$ iterations. Under a mild assumption, the results still hold when the subproblem is solved inexactly in each iteration provided the total optimality gap is bounded. Our general analysis applies to a wide range of classical algorithms with Riemannian constraints including inexact RGD and proximal gradient method on Stiefel manifolds. We numerically validate that tBMM shows improved performance over existing methods when applied to various problems, including nonnegative tensor decomposition with Riemannian constraints, regularized nonnegative matrix factorization, and low-rank matrix recovery problems.

Details

ICML Conference 2024 Conference Paper

On The Complexity of First-Order Methods in Stochastic Bilevel Optimization

Jeongyeol Kwon
Dohyun Kwon 0002
Hanbaek Lyu

We consider the problem of finding stationary points in Bilevel optimization when the lower-level problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lower-level solutions $y^*(x)$ in response to the changes in the upper-level variables $x$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lower-level solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$-aware, that returns an $O(\epsilon)$-estimate of the lower-level solution, in addition to first-order gradient estimators locally unbiased within the $\Theta(\epsilon)$-ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$-aware oracle: we propose a simple first-order method that converges to an $\epsilon$ stationary point using $O(\epsilon^{-6}), O(\epsilon^{-4})$ access to first-order $y^*$-aware oracles. Our upper bounds also apply to standard unbiased first-order oracles, improving the best-known complexity of first-order methods by $O(\epsilon)$ with minimal assumptions. We then provide the matching $\Omega(\epsilon^{-6})$, $\Omega(\epsilon^{-4})$ lower bounds without and with an additional smoothness assumption, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$-aware oracle must suffer the same lower bounds.

Details

ICML Conference 2024 Conference Paper

Stochastic Optimization with Arbitrary Recurrent Data Sampling

William G. Powell
Hanbaek Lyu

For obtaining optimal first-order convergence guarantees for stochastic optimization, it is necessary to use a recurrent data sampling algorithm that samples every data point with sufficient frequency. Most commonly used data sampling algorithms (e. g. , i. i. d. , MCMC, random reshuffling) are indeed recurrent under mild assumptions. In this work, we show that for a particular class of stochastic optimization algorithms, we do not need any further property (e. g. , independence, exponential mixing, and reshuffling) beyond recurrence in data sampling to guarantee optimal rate of first-order convergence. Namely, using regularized versions of Minimization by Incremental Surrogate Optimization (MISO), we show that for non-convex and possibly non-smooth objective functions with constraints, the expected optimality gap converges at an optimal rate $O(n^{-1/2})$ under general recurrent sampling schemes. Furthermore, the implied constant depends explicitly on the ’speed of recurrence’, measured by the expected amount of time to visit a data point, either averaged (’target time’) or supremized (’hitting time’) over the starting locations. We discuss applications of our general framework to decentralized optimization and distributed non-negative matrix factorization.

Details

JMLR Journal 2024 Journal Article

Stochastic Regularized Majorization-Minimization with weakly convex and multi-convex surrogates

Hanbaek Lyu

Stochastic majorization-minimization (SMM) is a class of stochastic optimization algorithms that proceed by sampling new data points and minimizing a recursive average of surrogate functions of an objective function. The surrogates are required to be strongly convex and the existing convergence rate analysis for the general non-convex setting was not available. In this paper, we propose an extension of SMM where surrogates are allowed to be only weakly convex or block multi-convex, and the averaged surrogates are approximately minimized with proximal regularization or block-minimized within diminishing radii, respectively. For the general nonconvex constrained setting with non-i.i.d. data samples, we show that the first-order optimality gap of the proposed algorithm decays at the rate $\widetilde{O}(n^{-1/4})$ for the empirical loss and $\widetilde{O}(n^{-1/8})$ for the expected loss, where $n$ denotes the number of data samples processed. Under some additional assumption, the latter convergence rate can be improved to $\widetilde{O}(n^{-1/4})$. As a corollary, we obtain the first convergence rate bounds for various optimization methods under general nonconvex non-i.i.d. data setting: Double-averaging projected gradient descent and its generalizations, proximal point empirical risk minimization, and online matrix/tensor decomposition algorithms. We also provide experimental validation of our results. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

PDF Details

ICML Conference 2024 Conference Paper

Supervised Matrix Factorization: Local Landscape Analysis and Applications

Joowon Lee
Hanbaek Lyu
Weixin Yao

Supervised matrix factorization (SMF) is a classical machine learning method that seeks low-dimensional feature extraction and classification tasks at the same time. Training an SMF model involves solving a non-convex and factor-wise constrained optimization problem with at least three blocks of parameters. Due to the high non-convexity and constraints, theoretical understanding of the optimization landscape of SMF has been limited. In this paper, we provide an extensive local landscape analysis for SMF and derive several theoretical and practical applications. Analyzing diagonal blocks of the Hessian naturally leads to a block coordinate descent (BCD) algorithm with adaptive step sizes. We provide global convergence and iteration complexity guarantees for this algorithm. Full Hessian analysis gives minimum $L_{2}$-regularization to guarantee local strong convexity and robustness of parameters. We establish a local estimation guarantee under a statistical SMF model. We also propose a novel GPU-friendly neural implementation of the BCD algorithm and validate our theoretical findings through numerical experiments. Our work contributes to a deeper understanding of SMF optimization, offering insights into the optimization landscape and providing practical solutions to enhance its performance.

Details

ICML Conference 2023 Conference Paper

Complexity of Block Coordinate Descent with Proximal Regularization and Applications to Wasserstein CP-dictionary Learning

Dohyun Kwon 0002
Hanbaek Lyu

We consider the block coordinate descent methods of Gauss-Seidel type with proximal regularization (BCD-PR), which is a classical method of minimizing general nonconvex objectives under constraints that has a wide range of practical applications. We theoretically establish the worst-case complexity bound for this algorithm. Namely, we show that for general nonconvex smooth objectives with block-wise constraints, the classical BCD-PR algorithm converges to an epsilon-stationary point within O(1/epsilon) iterations. Under a mild condition, this result still holds even if the algorithm is executed inexactly in each step. As an application, we propose a provable and efficient algorithm for ‘Wasserstein CP-dictionary learning’, which seeks a set of elementary probability distributions that can well-approximate a given set of d-dimensional joint probability distributions. Our algorithm is a version of BCD-PR that operates in the dual space, where the primal problem is regularized both entropically and proximally.

Details

ICML Conference 2023 Conference Paper

Convergence of First-Order Methods for Constrained Nonconvex Optimization with Dependent Data

Ahmet Alacaoglu
Hanbaek Lyu

We focus on analyzing the classical stochastic projected gradient methods under a general dependent data sampling scheme for constrained smooth nonconvex optimization. We show the worst-case rate of convergence $\tilde{O}(t^{-1/4})$ and complexity $\tilde{O}(\varepsilon^{-4})$ for achieving an $\varepsilon$-near stationary point in terms of the norm of the gradient of Moreau envelope and gradient mapping. While classical convergence guarantee requires i. i. d. data sampling from the target distribution, we only require a mild mixing condition of the conditional distribution, which holds for a wide class of Markov chain sampling algorithms. This improves the existing complexity for the constrained smooth nonconvex optimization with dependent data from $\tilde{O}(\varepsilon^{-8})$ to $\tilde{O}(\varepsilon^{-4})$ with a significantly simpler analysis. We illustrate the generality of our approach by deriving convergence results with dependent data for stochastic proximal gradient methods, adaptive stochastic gradient algorithm AdaGrad and stochastic gradient algorithm with heavy ball momentum. As an application, we obtain first online nonnegative matrix factorization algorithms for dependent data based on stochastic projected gradient methods with adaptive step sizes and optimal rate of convergence.

Details

NeurIPS Conference 2023 Conference Paper

Exponentially Convergent Algorithms for Supervised Matrix Factorization

Joowon Lee
Hanbaek Lyu
Weixin Yao

Supervised matrix factorization (SMF) is a classical machine learning method that simultaneously seeks feature extraction and classification tasks, which are not necessarily a priori aligned objectives. Our goal is to use SMF to learn low-rank latent factors that offer interpretable, data-reconstructive, and class-discriminative features, addressing challenges posed by high-dimensional data. Training SMF model involves solving a nonconvex and possibly constrained optimization with at least three blocks of parameters. Known algorithms are either heuristic or provide weak convergence guarantees for special cases. In this paper, we provide a novel framework that `lifts' SMF as a low-rank matrix estimation problem in a combined factor space and propose an efficient algorithm that provably converges exponentially fast to a global minimizer of the objective with arbitrary initialization under mild assumptions. Our framework applies to a wide range of SMF-type problems for multi-class classification with auxiliary features. To showcase an application, we demonstrate that our algorithm successfully identified well-known cancer-associated gene groups for various cancers.

PDF Details

JMLR Journal 2023 Journal Article

Sampling random graph homomorphisms and applications to network data analysis

Hanbaek Lyu
Facundo Memoli
David Sivakoff

A graph homomorphism is a map between two graphs that preserves adjacency relations. We consider the problem of sampling a random graph homomorphism from a graph into a large network. We propose two complementary MCMC algorithms for sampling random graph homomorphisms and establish bounds on their mixing times and the concentration of their time averages. Based on our sampling algorithms, we propose a novel framework for network data analysis that circumvents some of the drawbacks in methods based on independent and neighborhood sampling. Various time averages of the MCMC trajectory give us various computable observables, including well-known ones such as homomorphism density and average clustering coefficient and their generalizations. Furthermore, we show that these network observables are stable with respect to a suitably renormalized cut distance between networks. We provide various examples and simulations demonstrating our framework through synthetic networks. We also \commHL{demonstrate the performance of} our framework on the tasks of network clustering and subgraph classification on the Facebook100 dataset and on Word Adjacency Networks of a set of classic novels. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

PDF Details

JMLR Journal 2022 Journal Article

Online Nonnegative CP-dictionary Learning for Markovian Data

Hanbaek Lyu
Christopher Strohmeier
Deanna Needell

Online Tensor Factorization (OTF) is a fundamental tool in learning low-dimensional interpretable features from streaming multi-modal data. While various algorithmic and theoretical aspects of OTF have been investigated recently, a general convergence guarantee to stationary points of the objective function without any incoherence or sparsity assumptions is still lacking even for the i.i.d. case. In this work, we introduce a novel algorithm that learns a CANDECOMP/PARAFAC (CP) basis from a given stream of tensor-valued data under general constraints, including nonnegativity constraints that induce interpretability of the learned CP basis. We prove that our algorithm converges almost surely to the set of stationary points of the objective function under the hypothesis that the sequence of data tensors is generated by an underlying Markov chain. Our setting covers the classical i.i.d. case as well as a wide range of application contexts including data streams generated by independent or MCMC sampling. Our result closes a gap between OTF and Online Matrix Factorization in global convergence analysis for CP-decompositions. Experimentally, we show that our algorithm converges much faster than standard algorithms for nonnegative tensor factorization tasks on both synthetic and real-world data. Also, we demonstrate the utility of our algorithm on a diverse set of examples from image, video, and time-series data, illustrating how one may learn qualitatively different CP-dictionaries from the same tensor data by exploiting the tensor structure in multiple ways. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

PDF Details

JMLR Journal 2020 Journal Article

Online matrix factorization for Markovian data and applications to Network Dictionary Learning

Hanbaek Lyu
Deanna Needell
Laura Balzano

Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems, giving an approximate representation of complex data sets in terms of a reduced number of extracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streams remains largely unexplored. In this paper, we show that a non-convex generalization of the well-known OMF algorithm for i.i.d. stream of data in Mairal et al. converges almost surely to the set of critical points of the expected loss function, even when the data matrices are functions of some underlying Markov chain satisfying a mild mixing condition. This allows one to extract features more efficiently from dependent data streams, as there is no need to subsample the data sequence to approximately satisfy the independence assumption. As the main application, by combining online non-negative matrix factorization and a recent MCMC algorithm for sampling motifs from networks, we propose a novel framework of Network Dictionary Learning, which extracts “network dictionary patches” from a given network in an online manner that encodes main features of the network. We demonstrate this technique and its application to network denoising problems on real-world network data. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

PDF Details