Author name cluster

Thomas Pethick

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

2 author rows

ICLR Conference 2025 Conference Paper

Efficient Interpolation between Extragradient and Proximal Methods for Weak MVIs

Thomas Pethick
Ioannis Mavrothalassitis
Volkan Cevher

We study nonmonotone games satisfying the weak Minty variational inequality (MVI) with parameter $\rho \in (-\tfrac{1}{L}, \infty)$, where $L$ is the Lipschitz constant of the gradient operator. An error corrected version of the inexact proximal point algorithm is proposed, with which we establish the first $\mathcal O(1/\epsilon)$ rate for the entire range $\rho \in (-\tfrac{1}{L}, \infty)$, thus removing a logarithmic factor compared with the complexity of existing methods. The scheme automatically selects the needed accuracy for the proximal computation, and can recover the relaxed extragradient method when $\rho > -\tfrac{1}{2L}$ and the relaxed proximal point algorithm (rPPA) when $\rho > -\tfrac{1}{L}$. Due to the error correction, the scheme inherits the strong properties of the _exact_ rPPA. Specifically, we show that linear convergence is automatically achieved under appropriate conditions. Tightness for the range of $\rho$ is established through a lower bound for rPPA. Central to the algorithmic construction is a halfspace projection, where the key insight is that the allowed error tolerance can both be used to correct for the proximal approximation and to enlarge the problem class.

Details

NeurIPS Conference 2025 Conference Paper

Generalized Gradient Norm Clipping & Non-Euclidean $(L_0,L_1)$-Smoothness

Thomas Pethick
Wanyun Xie
Mete Erdogan
Kimon Antonakopoulos
Antonio Silveti-Falls
Volkan Cevher

This work introduces a hybrid non-Euclidean optimization method which generalizes gradient norm clipping by combining steepest descent and conditional gradient approaches. The method achieves the best of both worlds by establishing a descent property under a generalized notion of ($L_0$, $L_1$)-smoothness. Weight decay is incorporated in a principled manner by identifying a connection to the Frank-Wolfe short step. In the stochastic case, we show an order optimal $O(n^{-1/4})$ convergence rate by leveraging a momentum based gradient estimator. We discuss how to instantiate the algorithms for deep learning, which we dub Clipped Scion, and demonstrate their properties on image classification and language modeling. The code is available at https: //github. com/LIONS-EPFL/ClippedScion.

PDF Details

ICML Conference 2025 Conference Paper

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick
Wanyun Xie
Kimon Antonakopoulos
Zhenyu Zhu
Antonio Silveti-Falls
Volkan Cevher

In this work, we study optimization methods that leverage the linear minimization oracle (LMO) over a norm-ball. We propose a new stochastic family of algorithms that uses the LMO to adapt to the geometry of the problem and, perhaps surprisingly, show that they can be applied to unconstrained problems. The resulting update rule unifies several existing optimization methods under a single framework. Furthermore, we propose an explicit choice of norm for deep architectures, which, as a side benefit, leads to the transferability of hyperparameters across model sizes. Experimentally, we demonstrate significant speedups on nanoGPT training without any reliance on Adam. The proposed method is memory-efficient, requiring only one set of model weights and one set of gradients, which can be stored in half-precision.

Details

TMLR Journal 2025 Journal Article

νSAM: Memory-Efficient Sharpness-Aware Minimization via Nuclear Norm Constraints

Thomas Pethick
Parameswaran Raman
Lenon Minorics
Mingyi Hong
Shoham Sabach
Volkan Cevher

Sharpness-aware minimization (SAM) has been shown to improve the generalization of neural networks. However, the method comes at the expense of storing a perturbation of the model parameters, which can be restrictive when memory bound. We design a variant of SAM, called $\nu$SAM, which obtains a low-rank perturbation by modifying the perturbation constraint. The update almost entirely removes the memory footprint of the perturbation without increasing the computational complexity, thus achieving close to a $1/3$ memory saving regarding the parameters when using SGD as the base optimizer. We demonstrate comparable performance of $\nu$SAM with SAM on vision transformers both when training models from scratch and for fine-tuning. Interestingly, $\nu$SAM seems to significantly improve performance for MLP-Mixer architectures across both settings. The results are corroborated theoretically, where we show that SAM with an \emph{arbitrary} norm choice (which includes $\nu$SAM) can converge even with fixed perturbation radius.

PDF Details

ICML Conference 2024 Conference Paper

Improving SAM Requires Rethinking its Optimization Formulation

Wanyun Xie
Fabian Latorre
Kimon Antonakopoulos
Thomas Pethick
Volkan Cevher

This paper rethinks Sharpness-Aware Minimization (SAM), which is originally formulated as a zero-sum game where the weights of a network and a bounded perturbation try to minimize/maximize, respectively, the same differentiable loss. To fundamentally improve this design, we argue that SAM should instead be reformulated using the 0-1 loss. As a continuous relaxation, we follow the simple conventional approach where the minimizing (maximizing) player uses an upper bound (lower bound) surrogate to the 0-1 loss. This leads to a novel formulation of SAM as a bilevel optimization problem, dubbed as BiSAM. BiSAM with newly designed lower-bound surrogate loss indeed constructs stronger perturbation. Through numerical evidence, we show that BiSAM consistently results in improved performance when compared to the original SAM and variants, while enjoying similar computational complexity. Our code is available at https: //github. com/LIONS-EPFL/BiSAM.

Details

TMLR Journal 2024 Journal Article

Mixed Nash for Robust Federated Learning

Wanyun Xie
Thomas Pethick
Ali Ramezani-Kebrya
Volkan Cevher

We study robust federated learning (FL) within a game theoretic framework to alleviate the server vulnerabilities to even an informed adversary who can tailor training-time attacks. Specifically, we introduce RobustTailor, a simulation-based framework that prevents the adversary from being omniscient and derives its convergence guarantees. RobustTailor improves robustness to training-time attacks significantly while preserving almost the same privacy guarantees as standard robust aggregation schemes in FL. Empirical results under challenging attacks show that RobustTailor performs close to an upper bound with perfect knowledge of honest clients.

PDF Details

NeurIPS Conference 2024 Conference Paper

SAMPa: Sharpness-aware Minimization Parallelized

Wanyun Xie
Thomas Pethick
Volkan Cevher

Sharpness-aware minimization (SAM) has been shown to improve the generalization of neural networks. However, each SAM update requires sequentially computing two gradients, effectively doubling the per-iteration cost compared to base optimizers like SGD. We propose a simple modification of SAM, termed SAMPa, which allows us to fully parallelize the two gradient computations. SAMPa achieves a twofold speedup of SAM under the assumption that communication costs between devices are negligible. Empirical results show that SAMPa ranks among the most efficient variants of SAM in terms of computational time. Additionally, our method consistently outperforms SAM across both vision and language tasks. Notably, SAMPa theoretically maintains convergence guarantees even for fixed perturbation sizes, which is established through a novel Lyapunov function. We in fact arrive at SAMPa by treating this convergence guarantee as a hard requirement---an approach we believe is promising for developing SAM-based methods in general. Our code is available at https: //github. com/LIONS-EPFL/SAMPa.

PDF Details DOI

TMLR Journal 2023 Journal Article

Federated Learning under Covariate Shifts with Generalization Guarantees

Ali Ramezani-Kebrya
Fanghui Liu
Thomas Pethick
Grigorios Chrysos
Volkan Cevher

This paper addresses intra-client and inter-client covariate shifts in federated learning (FL) with a focus on the overall generalization performance. To handle covariate shifts, we formulate a new global model training paradigm and propose Federated Importance-Weighted Empirical Risk Minimization (FTW-ERM) along with improving density ratio matching methods without requiring perfect knowledge of the supremum over true ratios. We also propose the communication-efficient variant FITW-ERM with the same level of privacy guarantees as those of classical ERM in FL. We theoretically show that FTW-ERM achieves smaller generalization error than classical ERM under certain settings. Experimental results demonstrate the superiority of FTW-ERM over existing FL baselines in challenging imbalanced federated settings in terms of data distribution shifts across clients.

PDF Details

ICLR Conference 2023 Conference Paper

Finding Actual Descent Directions for Adversarial Training

Fabian Latorre
Igor Krawczuk
Leello Tadesse Dadi
Thomas Pethick
Volkan Cevher

Adversarial Training using a strong first-order adversary (PGD) is the gold standard for training Deep Neural Networks that are robust to adversarial examples. We show that, contrary to the general understanding of the method, the gradient at an optimal adversarial example may increase, rather than decrease, the adversarially robust loss. This holds independently of the learning rate. More precisely, we provide a counterexample to a corollary of Danskin's Theorem presented in the seminal paper of Madry et al. (2018) which states that a solution of the inner maximization problem can yield a descent direction for the adversarially robust loss. Based on a correct interpretation of Danskin's Theorem, we propose Danskin's Descent Direction (DDi) and we verify experimentally that it provides better directions than those obtained by a PGD adversary. Using the CIFAR10 dataset we further provide a real world example showing that our method achieves a steeper increase in robustness levels in the early stages of training, and is more stable than the PGD baseline. As a limitation, PGD training of ReLU+BatchNorm networks still performs better, but current theory is unable to explain this.

Details

TMLR Journal 2023 Journal Article

Revisiting adversarial training for the worst-performing class

Thomas Pethick
Grigorios Chrysos
Volkan Cevher

Despite progress in adversarial training (AT), there is a substantial gap between the top-performing and worst-performing classes in many datasets. For example, on CIFAR10, the accuracies for the best and worst classes are 74% and 23%, respectively. We argue that this gap can be reduced by explicitly optimizing for the worst-performing class, resulting in a min-max-max optimization formulation. Our method, called class focused online learning (CFOL), includes high probability convergence guarantees for the worst class loss and can be easily integrated into existing training setups with minimal computational overhead. We demonstrate an improvement to 32% in the worst class accuracy on CIFAR10, and we observe consistent behavior across CIFAR100 and STL10. Our study highlights the importance of moving beyond average accuracy, which is particularly important in safety-critical applications.

PDF Details

ICLR Conference 2023 Conference Paper

Solving stochastic weak Minty variational inequalities without increasing batch size

Thomas Pethick
Olivier Fercoq
Puya Latafat
Panagiotis Patrinos
Volkan Cevher

This paper introduces a family of stochastic extragradient-type algorithms for a class of nonconvex-nonconcave problems characterized by the weak Minty variational inequality (MVI). Unlike existing results on extragradient methods in the monotone setting, employing diminishing stepsizes is no longer possible in the weak MVI setting. This has led to approaches such as increasing batch sizes per iteration which can however be prohibitively expensive. In contrast, our proposed methods involves two stepsizes and only requires one additional oracle evaluation per iteration. We show that it is possible to keep one fixed stepsize while it is only the second stepsize that is taken to be diminishing, making it interesting even in the monotone setting. Almost sure convergence is established and we provide a unified analysis for this family of schemes which contains a nonlinear generalization of the celebrated primal dual hybrid gradient algorithm.

Details

NeurIPS Conference 2023 Conference Paper

Stable Nonconvex-Nonconcave Training via Linear Interpolation

Thomas Pethick
Wanyun Xie
Volkan Cevher

This paper presents a theoretical analysis of linear interpolation as a principled method for stabilizing (large-scale) neural network training. We argue that instabilities in the optimization process are often caused by the nonmonotonicity of the loss landscape and show how linear interpolation can help by leveraging the theory of nonexpansive operators. We construct a new optimization scheme called relaxed approximate proximal point (RAPP), which is the first 1-SCLI method to achieve last iterate convergence rates for $\rho$-comonotone problems while only requiring $\rho > -\tfrac{1}{2L}$. The construction extends to constrained and regularized settings. By replacing the inner optimizer in RAPP we rediscover the family of Lookahead algorithms for which we establish convergence in cohypomonotone problems even when the base optimizer is taken to be gradient descent ascent. The range of cohypomonotone problems in which Lookahead converges is further expanded by exploiting that Lookahead inherits the properties of the base optimizer. We corroborate the results with experiments on generative adversarial networks which demonstrates the benefits of the linear interpolation present in both RAPP and Lookahead.

PDF Details

ICLR Conference 2022 Conference Paper

Escaping limit cycles: Global convergence for constrained nonconvex-nonconcave minimax problems

Thomas Pethick
Puya Latafat
Panagiotis Patrinos
Olivier Fercoq
Volkan Cevher

This paper introduces a new extragradient-type algorithm for a class of nonconvex-nonconcave minimax problems. It is well-known that finding a local solution for general minimax problems is computationally intractable. This observation has recently motivated the study of structures sufficient for convergence of first order methods in the more general setting of variational inequalities when the so-called weak Minty variational inequality (MVI) holds. This problem class captures non-trivial structures as we demonstrate with examples, for which a large family of existing algorithms provably converge to limit cycles. Our results require a less restrictive parameter range in the weak MVI compared to what is previously known, thus extending the applicability of our scheme. The proposed algorithm is applicable to constrained and regularized problems, and involves an adaptive stepsize allowing for potentially larger stepsizes. Our scheme also converges globally even in settings where the underlying operator exhibits limit cycles.

Details

NeurIPS Conference 2021 Conference Paper

Sifting through the noise: Universal first-order methods for stochastic variational inequalities

Kimon Antonakopoulos
Thomas Pethick
Ali Kavis
Panayotis Mertikopoulos
Volkan Cevher

We examine a flexible algorithmic framework for solving monotone variational inequalities in the presence of randomness and uncertainty. The proposed template encompasses a wide range of popular first-order methods, including dual averaging, dual extrapolation and optimistic gradient algorithms – both adaptive and non-adaptive. Our first result is that the algorithm achieves the optimal rates of convergence for cocoercive problems when the profile of the randomness is known to the optimizer: $\mathcal{O}(1/\sqrt{T})$ for absolute noise profiles, and $\mathcal{O}(1/T)$ for relative ones. Subsequently, we drop all prior knowledge requirements (the absolute/relative variance of the randomness affecting the problem, the operator's cocoercivity constant, etc. ), and we analyze an adaptive instance of the method that gracefully interpolates between the above rates – i. e. it achieves $\mathcal{O}(1/\sqrt{T})$ and $\mathcal{O}(1/T)$ in the absolute and relative cases, respectively. To our knowledge, this is the first universality result of its kind in the literature and, somewhat surprisingly, it shows that an extra-gradient proxy step is not required to achieve optimal rates.

PDF Details

NeurIPS Conference 2021 Conference Paper

Subquadratic Overparameterization for Shallow Neural Networks

ChaeHwan Song
Ali Ramezani-Kebrya
Thomas Pethick
Armin Eftekhari
Volkan Cevher

Overparameterization refers to the important phenomenon where the width of a neural network is chosen such that learning algorithms can provably attain zero loss in nonconvex training. The existing theory establishes such global convergence using various initialization strategies, training modifications, and width scalings. In particular, the state-of-the-art results require the width to scale quadratically with the number of training data under standard initialization strategies used in practice for best generalization performance. In contrast, the most recent results obtain linear scaling either with requiring initializations that lead to the "lazy-training", or training only a single layer. In this work, we provide an analytical framework that allows us to adopt standard initialization strategies, possibly avoid lazy training, and train all layers simultaneously in basic shallow neural networks while attaining a desirable subquadratic scaling on the network width. We achieve the desiderata via Polyak-Lojasiewicz condition, smoothness, and standard assumptions on data, and use tools from random matrix theory.

PDF Details