Author name cluster

Hancheng Min

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

9 papers

2 author rows

TMLR Journal 2025 Journal Article

A Local Polyak-Łojasiewicz and Descent Lemma of Gradient Descent For Overparametrized Linear Models

Ziqing Xu
Hancheng Min
Salma Tarmoun
Enrique Mallada
Rene Vidal

Most prior work on the convergence of gradient descent (GD) for overparameterized neural networks relies on strong assumptions on the step size (infinitesimal), the hidden-layer width (infinite), or the initialization (large, spectral, balanced). Recent efforts to relax these assumptions focus on two-layer linear networks trained with the squared loss. In this work, we derive a linear convergence rate for training two-layer linear neural networks with GD for general losses and under relaxed assumptions on the step size, width, and initialization. A key challenge in deriving this result is that classical ingredients for deriving convergence rates for nonconvex problems, such as the Polyak-Łojasiewicz (PL) condition and Descent Lemma, do not hold globally for overparameterized neural networks. Here, we prove that these two conditions hold locally with local constants that depend on the weights. Then, we provide bounds on these local constants, which depend on the initialization of the weights, the current loss, and the global PL and smoothness constants of the non-overparameterized model. Based on these bounds, we derive a linear convergence rate for GD. Our convergence analysis not only improves upon prior results but also suggests a better choice for the step size, as verified through our numerical experiments.

PDF Details

NeurIPS Conference 2025 Conference Paper

Convergence Rates for Gradient Descent on the Edge of Stability for Overparametrised Least Squares

Lachlan MacDonald
Hancheng Min
Leandro Palma
Salma Tarmoun
Ziqing Xu
Rene Vidal

Classical optimisation theory guarantees monotonic objective decrease for gradient descent (GD) when employed in a small step size, or "stable", regime. In contrast, gradient descent on neural networks is frequently performed in a large step size regime called the "edge of stability", in which the objective decreases non-monotonically with an observed implicit bias towards flat minima. In this paper, we take a step toward quantifying this phenomenon by providing convergence rates for gradient descent with large learning rates in an overparametrised least squares setting. The key insight behind our analysis is that, as a consequence of overparametrisation, the set of global minimisers forms a Riemannian manifold $M$, which enables the decomposition of the GD dynamics into components parallel and orthogonal to $M$. The parallel component corresponds to Riemannian gradient descent on the objective sharpness, while the orthogonal component corresponds to a quadratic dynamical system. This insight allows us to derive convergence rates in three regimes characterised by the learning rate size: the subcritical regime, in which transient instability is overcome in finite time before linear convergence to a suboptimally flat global minimum; the critical regime, in which instability persists for all time with a power-law convergence toward the optimally flat global minimum; the supercritical regime, in which instability persists for all time with linear convergence to an oscillation of period two centred on the optimally flat global minimum.

PDF Details

ICML Conference 2025 Conference Paper

Gradient Flow Provably Learns Robust Classifiers for Orthonormal GMMs

Hancheng Min
René Vidal

Deep learning-based classifiers are known to be vulnerable to adversarial attacks. Existing methods for defending against such attacks require adding a defense mechanism or modifying the learning procedure (e. g. , by adding adversarial examples). This paper shows that for certain data distributions one can learn a provably robust classifier using standard learning methods and without adding a defense mechanism. More specifically, this paper addresses the problem of finding a robust classifier for a binary classification problem in which the data comes from an isotropic mixture of Gaussians with orthonormal cluster centers. First, we characterize the largest $\ell_2$-attack any classifier can defend against while maintaining high accuracy, and show the existence of optimal robust classifiers achieving this maximum $\ell_2$-robustness. Next, we show that given data from the orthonormal Gaussian mixture model, gradient flow on a two-layer network with a polynomial ReLU activation and without adversarial examples provably finds an optimal robust classifier.

Details

NeurIPS Conference 2025 Conference Paper

Neural Collapse under Gradient Flow on Shallow ReLU Networks for Orthogonally Separable Data

Hancheng Min
Zhihui Zhu
Rene Vidal

Among many mysteries behind the success of deep networks lies the exceptional discriminative power of their learned representations as manifested by the intriguing Neural Collapse (NC) phenomenon, where simple feature structures emerge at the last layer of a trained neural network. Prior works on the theoretical understandings of NC have focused on analyzing the optimization landscape of matrix-factorization-like problems by considering the last-layer features as unconstrained free optimization variables and showing that their global minima exhibit NC. In this paper, we show that gradient flow on a two-layer ReLU network for classifying orthogonally separable data provably exhibits NC, thereby advancing prior results in two ways: First, we relax the assumption of unconstrained features, showing the effect of data structure and nonlinear activations on NC characterizations. Second, we reveal the role of the implicit bias of the training dynamics in facilitating the emergence of NC.

PDF Details

ICML Conference 2024 Conference Paper

Can Implicit Bias Imply Adversarial Robustness?

Hancheng Min
René Vidal

The implicit bias of gradient-based training algorithms has been considered mostly beneficial as it leads to trained networks that often generalize well. However, Frei et al. (2023) show that such implicit bias can harm adversarial robustness. Specifically, they show that if the data consists of clusters with small inter-cluster correlation, a shallow (two-layer) ReLU network trained by gradient flow generalizes well, but it is not robust to adversarial attacks of small radius. Moreover, this phenomenon occurs despite the existence of a much more robust classifier that can be explicitly constructed from a shallow network. In this paper, we extend recent analyses of neuron alignment to show that a shallow network with a polynomial ReLU activation (pReLU) trained by gradient flow not only generalizes well but is also robust to adversarial attacks. Our results highlight the importance of the interplay between data structure and architecture design in the implicit bias and robustness of trained networks.

Details

ICLR Conference 2024 Conference Paper

Early Neuron Alignment in Two-layer ReLU Networks with Small Initialization

Hancheng Min
Enrique Mallada
René Vidal

This paper studies the problem of training a two-layer ReLU network for binary classification using gradient flow with small initialization. We consider a training dataset with well-separated input vectors: Any pair of input data with the same label are positively correlated, and any pair with different labels are negatively correlated. Our analysis shows that, during the early phase of training, neurons in the first layer try to align with either the positive data or the negative data, depending on its corresponding weight on the second layer. A careful analysis of the neurons' directional dynamics allows us to provide an $\mathcal{O}(\frac{\log n}{\sqrt{\mu}})$ upper bound on the time it takes for all neurons to achieve good alignment with the input data, where $n$ is the number of data points and $\mu$ measures how well the data are separated. After the early alignment phase, the loss converges to zero at a $\mathcal{O}(\frac{1}{t})$ rate, and the weight matrix on the first layer is approximately low-rank. Numerical experiments on the MNIST dataset illustrate our theoretical findings.

Details

ICML Conference 2023 Conference Paper

On the Convergence of Gradient Flow on Multi-layer Linear Models

Hancheng Min
René Vidal
Enrique Mallada

In this paper, we analyze the convergence of gradient flow on a multi-layer linear model with a loss function of the form $f(W_1W_2\cdots W_L)$. We show that when $f$ satisfies the gradient dominance property, proper weight initialization leads to exponential convergence of the gradient flow to a global minimum of the loss. Moreover, the convergence rate depends on two trajectory-specific quantities that are controlled by the weight initialization: the imbalance matrices, which measure the difference between the weights of adjacent layers, and the least singular value of the weight product $W=W_1W_2\cdots W_L$. Our analysis exploits the fact that the gradient of the overparameterized loss can be written as the composition of the non-overparametrized gradient with a time-varying (weight-dependent) linear operator whose smallest eigenvalue controls the convergence rate. The key challenge we address is to derive a uniform lower bound for this time-varying eigenvalue that lead to improved rates for several multi-layer network models studied in the literature.

Details

ICML Conference 2021 Conference Paper

On the Explicit Role of Initialization on the Convergence and Implicit Bias of Overparametrized Linear Networks

Hancheng Min
Salma Tarmoun
René Vidal
Enrique Mallada

Neural networks trained via gradient descent with random initialization and without any regularization enjoy good generalization performance in practice despite being highly overparametrized. A promising direction to explain this phenomenon is to study how initialization and overparametrization affect convergence and implicit bias of training algorithms. In this paper, we present a novel analysis of single-hidden-layer linear networks trained under gradient flow, which connects initialization, optimization, and overparametrization. Firstly, we show that the squared loss converges exponentially to its optimum at a rate that depends on the level of imbalance of the initialization. Secondly, we show that proper initialization constrains the dynamics of the network parameters to lie within an invariant set. In turn, minimizing the loss over this set leads to the min-norm solution. Finally, we show that large hidden layer width, together with (properly scaled) random initialization, ensures proximity to such an invariant set during training, allowing us to derive a novel non-asymptotic upper-bound on the distance between the trained network and the min-norm solution.

Details

ICRA Conference 2018 Conference Paper

Voronoi-Based Coverage Control of Pan/Tilt/Zoom Camera Networks

Ömür Arslan
Hancheng Min
Daniel E. Koditschek

A challenge of pan/tilt/zoom (PTZ) camera networks for efficient and flexible visual monitoring is automated active network reconfiguration in response to environmental stimuli. In this paper, given an event/activity distribution over a convex environment, we propose a new provably correct reactive coverage control algorithm for PTZ camera networks that continuously (re) configures camera orientations and zoom levels (i. e. , angles of view) in order to locally maximize their total coverage quality. Our construction is based on careful modeling of visual sensing quality that is consistent with the physical nature of cameras, and we introduce a new notion of conic Voronoi diagrams, based on our sensing quality measures, to solve the camera network allocation problem: that is, to determine where each camera should focus in its field of view given all the other cameras' configurations. Accordingly, we design simple greedy gradient algorithms for both continuous-and discrete-time first-order PTZ camera dynamics that asymptotically converge a locally optimal coverage configuration. Finally, we provide numerical and experimental evidence demonstrating the effectiveness of the proposed coverage algorithms.

Details