Author name cluster

Ali Jadbabaie

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

2 author rows

ICML Conference 2025 Conference Paper

Fast Tensor Completion via Approximate Richardson Iteration

Mehrdad Ghadiri
Matthew Fahrbach
Yunbum Kook
Ali Jadbabaie

We study tensor completion (TC) through the lens of low-rank tensor decomposition (TD). Many TD algorithms use fast alternating minimization methods to solve highly structured linear regression problems at each step (e. g. , for CP, Tucker, and tensor-train decompositions). However, such algebraic structure is often lost in TC regression problems, making direct extensions unclear. This work proposes a novel lifting method for approximately solving TC regression problems using structured TD regression algorithms as blackbox subroutines, enabling sublinear-time methods. We analyze the convergence rate of our approximate Richardson iteration-based algorithm, and our empirical study shows that it can be 100x faster than direct methods for CP completion on real-world tensors.

NeurIPS Conference 2025 Conference Paper

Is Your Diffusion Model Actually Denoising?

Daniel Pfrommer
Zehao Dou
Christopher Scarvelis
Max Simchowitz
Ali Jadbabaie

We study the inductive biases of diffusion models with a conditioning-variable, which have seen widespread application as both text-conditioned generative image models and observation-conditioned continuous control policies. We observe that when these models are queried conditionally, their generations consistently deviate from the idealized "denoising" process upon which diffusion models are formulated, inducing disagreement between popular sampling algorithms (e. g. DDPM, DDIM). We introduce Schedule Deviation, a rigorous measure which captures the rate of deviation from a standard denoising process, and provide a methodology to compute it. Crucially, we demonstrate that the deviation from an idealized denoising process occurs irrespective of the model capacity or amount of training data. We posit that this phenomenon occurs due to the difficulty of bridging distinct denoising flows across different parts of the conditioning space and show theoretically how such a phenomenon can arise through an inductive bias towards smoothness.

ICML Conference 2025 Conference Paper

On the Emergence of Position Bias in Transformers

Xinyi Wu 0003
Yifei Wang 0001
Stefanie Jegelka
Ali Jadbabaie

Recent studies have revealed various manifestations of position bias in transformer architectures, from the "lost-in-the-middle" phenomenon to attention sinks, yet a comprehensive theoretical understanding of how attention masks and positional encodings shape these biases remains elusive. This paper presents a graph-theoretic framework for analyzing position bias in multi-layer attention. Modeling attention masks as directed graphs, we quantify how tokens interact with contextual information based on their sequential positions. We uncover two key insights: First, causal masking inherently biases attention toward earlier positions, as tokens in deeper layers attend to increasingly more contextualized representations of earlier tokens. Second, we characterize the competing effects of the causal mask and relative positional encodings, such as the decay mask and rotary positional encoding (RoPE): while both mechanisms introduce distance-based decay within individual attention maps, their aggregate effect across multiple attention layers—coupled with the causal mask—leads to a trade-off between the long-term decay effects and the cumulative importance of early sequence positions. Through controlled numerical experiments, we not only validate our theoretical findings but also reproduce position biases observed in real-world LLMs. Our framework offers a principled foundation for understanding positional biases in transformers, shedding light on the complex interplay of attention mechanism components and guiding more informed architectural design.

ICLR Conference 2025 Conference Paper

Residual Connections and Normalization Can Provably Prevent Oversmoothing in GNNs

Michael Scholkemper
Xinyi Wu 0003
Ali Jadbabaie
Michael T. Schaub

Residual connections and normalization layers have become standard design choices for graph neural networks (GNNs), and were proposed as solutions to the mitigate the oversmoothing problem in GNNs. However, how exactly these methods help alleviate the oversmoothing problem from a theoretical perspective is not well understood. In this work, we provide a formal and precise characterization of (linearized) GNNs with residual connections and normalization layers. We establish that (a) for residual connections, the incorporation of the initial features at each layer can prevent the signal from becoming too smooth, and determines the subspace of possible node representations; (b) batch normalization prevents a complete collapse of the output embedding space to a one-dimensional subspace through the individual rescaling of each column of the feature matrix. This results in the convergence of node representations to the top-k eigenspace of the message-passing operator; (c) moreover, we show that the centering step of a normalization layer — which can be understood as a projection — alters the graph signal in message-passing in such a way that relevant information can become harder to extract. Building on the last theoretical insight, we introduce GraphNormv2, a novel and principled normalization layer. GraphNormv2 features a learnable centering step designed to preserve the integrity of the original graph signal. Experimental results corroborate the effectiveness of our method, demonstrating improved performance across various GNN architectures and tasks.

ICML Conference 2024 Conference Paper

How to Escape Sharp Minima with Random Perturbations

Kwangjun Ahn
Ali Jadbabaie
Suvrit Sra

Modern machine learning applications have witnessed the remarkable success of optimization algorithms that are designed to find flat minima. Motivated by this design choice, we undertake a formal study that (i) formulates the notion of flat minima, and (ii) studies the complexity of finding them. Specifically, we adopt the trace of the Hessian of the cost function as a measure of flatness, and use it to formally define the notion of approximate flat minima. Under this notion, we then analyze algorithms that find approximate flat minima efficiently. For general cost functions, we discuss a gradient-based algorithm that finds an approximate flat local minimum efficiently. The main component of the algorithm is to use gradients computed from randomly perturbed iterates to estimate a direction that leads to flatter minima. For the setting where the cost function is an empirical risk over training data, we present a faster algorithm that is inspired by a recently proposed practical algorithm called sharpness-aware minimization, supporting its success in practice.

ICLR Conference 2024 Conference Paper

Linear attention is (maybe) all you need (to understand Transformer optimization)

Kwangjun Ahn
Xiang Cheng
Minhak Song
Chulhee Yun
Ali Jadbabaie
Suvrit Sra

Transformer training is notoriously difficult, requiring a careful design of optimizers and use of various heuristics. We make progress towards understanding the subtleties of training Transformers by carefully studying a simple yet canonical linearized *shallow* Transformer model. Specifically, we train linear Transformers to solve regression tasks, inspired by J. von Oswald et al. (ICML 2023), and K. Ahn et al. (NeurIPS 2023). Most importantly, we observe that our proposed linearized models can reproduce several prominent aspects of Transformer training dynamics. Consequently, the results obtained in this paper suggest that a simple linearized Transformer model could actually be a valuable, realistic abstraction for understanding Transformer optimization.

NeurIPS Conference 2024 Conference Paper

On the Role of Attention Masks and LayerNorm in Transformers

Xinyi Wu
Amir Ajorlou
Yifei Wang
Stefanie Jegelka
Ali Jadbabaie

Self-attention is the key mechanism of transformers, which are the essential building blocks of modern foundation models. Recent studies have shown that pure self-attention suffers from an increasing degree of rank collapse as depth increases, limiting model expressivity and further utilization of model depth. The existing literature on rank collapse, however, has mostly overlooked other critical components in transformers that may alleviate the rank collapse issue. In this paper, we provide a general analysis of rank collapse under self-attention, taking into account the effects of attention masks and layer normalization (LayerNorm). In particular, we find that although pure masked attention still suffers from exponential collapse to a rank one subspace, sparse or local masked attention can provably slow down the collapse rate. In the case of self-attention with LayerNorm, we first show that for certain classes of value matrices, collapse to a rank one subspace still happens exponentially. However, through construction of nontrivial counterexamples, we then establish that with proper choice of value matrices, a general class of sequences may not converge to a rank one subspace, and the self-attention dynamics with LayerNorm can simultaneously possess a rich set of equilibria with any possible rank between one and full. Our result refutes the previous hypothesis that LayerNorm plays no role in the rank collapse of self-attention and suggests that self-attention with LayerNorm constitutes a much more expressive, versatile nonlinear dynamical system than what was originally thought.

PDF Details DOI

ICLR Conference 2023 Conference Paper

A Non-Asymptotic Analysis of Oversmoothing in Graph Neural Networks

Xinyi Wu 0003
Zhengdao Chen
William Wei Wang
Ali Jadbabaie

Oversmoothing is a central challenge of building more powerful Graph Neural Networks (GNNs). While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions—an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is $O(\log N/\log (\log N))$ for sufficiently dense graphs with $N$ nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR), or equivalently, the effects of initial residual connections on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice can be magnified by the difficulty of optimizing deep GNN models.

NeurIPS Conference 2023 Conference Paper

Convergence of Adam Under Relaxed Assumptions

Haochuan Li
Alexander Rakhlin
Ali Jadbabaie

In this paper, we provide a rigorous proof of convergence of the Adaptive Moment Estimate (Adam) algorithm for a wide class of optimization objectives. Despite the popularity and efficiency of the Adam algorithm in training deep neural networks, its theoretical properties are not yet fully understood, and existing convergence proofs require unrealistically strong assumptions, such as globally bounded gradients, to show the convergence to stationary points. In this paper, we show that Adam provably converges to $\epsilon$-stationary points with $\mathcal{O}(\epsilon^{-4})$ gradient complexity under far more realistic conditions. The key to our analysis is a new proof of boundedness of gradients along the optimization trajectory of Adam, under a generalized smoothness assumption according to which the local smoothness (i. e. , Hessian norm when it exists) is bounded by a sub-quadratic function of the gradient norm. Moreover, we propose a variance-reduced version of Adam with an accelerated gradient complexity of $\mathcal{O}(\epsilon^{-3})$.

NeurIPS Conference 2023 Conference Paper

Convex and Non-convex Optimization Under Generalized Smoothness

Haochuan Li
Jian Qian
Yi Tian
Alexander Rakhlin
Ali Jadbabaie

Classical analysis of convex and non-convex optimization methods often requires the Lipschitz continuity of the gradient, which limits the analysis to functions bounded by quadratics. Recent work relaxed this requirement to a non-uniform smoothness condition with the Hessian norm bounded by an affine function of the gradient norm, and proved convergence in the non-convex setting via gradient clipping, assuming bounded noise. In this paper, we further generalize this non-uniform smoothness condition and develop a simple, yet powerful analysis technique that bounds the gradients along the trajectory, thereby leading to stronger results for both convex and non-convex optimization problems. In particular, we obtain the classical convergence rates for (stochastic) gradient descent and Nesterov's accelerated gradient method in the convex and/or non-convex setting under this general smoothness condition. The new analysis approach does not require gradient clipping and allows heavy-tailed noise with bounded variance in the stochastic setting.

NeurIPS Conference 2023 Conference Paper

Demystifying Oversmoothing in Attention-Based Graph Neural Networks

Xinyi Wu
Amir Ajorlou
Zihui Wu
Ali Jadbabaie

Oversmoothing in Graph Neural Networks (GNNs) refers to the phenomenon where increasing network depth leads to homogeneous node representations. While previous work has established that Graph Convolutional Networks (GCNs) exponentially lose expressive power, it remains controversial whether the graph attention mechanism can mitigate oversmoothing. In this work, we provide a definitive answer to this question through a rigorous mathematical analysis, by viewing attention-based GNNs as nonlinear time-varying dynamical systems and incorporating tools and techniques from the theory of products of inhomogeneous matrices and the joint spectral radius. We establish that, contrary to popular belief, the graph attention mechanism cannot prevent oversmoothing and loses expressive power exponentially. The proposed framework extends the existing results on oversmoothing for symmetric GCNs to a significantly broader class of GNN models, including random walk GCNs, Graph Attention Networks (GATs) and (graph) transformers. In particular, our analysis accounts for asymmetric, state-dependent and time-varying aggregation operators and a wide range of common nonlinear activation functions, such as ReLU, LeakyReLU, GELU and SiLU.

NeurIPS Conference 2023 Conference Paper

Provable Guarantees for Generative Behavior Cloning: Bridging Low-Level Stability and High-Level Behavior

Adam Block
Ali Jadbabaie
Daniel Pfrommer
Max Simchowitz
Russ Tedrake

We propose a theoretical framework for studying behavior cloning of complex expert demonstrations using generative modeling. Our framework invokes low-level controllers - either learned or implicit in position-command control - to stabilize imitation around expert demonstrations. We show that with (a) a suitable low-level stability guarantee and (b) a powerful enough generative model as our imitation learner, pure supervised behavior cloning can generate trajectories matching the per-time step distribution of essentially arbitrary expert trajectories in an optimal transport cost. Our analysis relies on a stochastic continuity property of the learned policy we call "total variation continuity" (TVC). We then show that TVC can be ensured with minimal degradation of accuracy by combining a popular data-augmentation regimen with a novel algorithmic trick: adding augmentation noise at execution time. We instantiate our guarantees for policies parameterized by diffusion models and prove that if the learner accurately estimates the score of the (noise-augmented) expert policy, then the distribution of imitator trajectories is close to the demonstrator distribution in a natural optimal transport distance. Our analysis constructs intricate couplings between noise-augmented trajectories, a technique that may be of independent interest. We conclude by empirically validating our algorithmic recommendations, and discussing implications for future research directions for better behavior cloning with generative modeling.

ICML Conference 2022 Conference Paper

Beyond Worst-Case Analysis in Stochastic Approximation: Moment Estimation Improves Instance Complexity

Jingzhao Zhang
Hongzhou Lin
Subhro Das
Suvrit Sra
Ali Jadbabaie

We study oracle complexity of gradient based methods for stochastic approximation problems. Though in many settings optimal algorithms and tight lower bounds are known for such problems, these optimal algorithms do not achieve the best performance when used in practice. We address this theory-practice gap by focusing on instance-dependent complexity instead of worst case complexity. In particular, we first summarize known instance-dependent complexity results and categorize them into three levels. We identify the domination relation between different levels and propose a fourth instance-dependent bound that dominates existing ones. We then provide a sufficient condition according to which an adaptive algorithm with moment estimation can achieve the proposed bound without knowledge of noise levels. Our proposed algorithm and its analysis provide a theoretical justification for the success of moment estimation as it achieves improved instance complexity.

ICML Conference 2022 Conference Paper

Neural Network Weights Do Not Converge to Stationary Points: An Invariant Measure Perspective

Jingzhao Zhang
Haochuan Li
Suvrit Sra
Ali Jadbabaie

This work examines the deep disconnect between existing theoretical analyses of gradient-based algorithms and the practice of training deep neural networks. Specifically, we provide numerical evidence that in large-scale neural network training (e. g. , ImageNet + ResNet101, and WT103 + TransformerXL models), the neural network’s weights do not converge to stationary points where the gradient of the loss is zero. Remarkably, however, we observe that even though the weights do not converge to stationary points, the progress in minimizing the loss function halts and training loss stabilizes. Inspired by this observation, we propose a new perspective based on ergodic theory of dynamical systems to explain it. Rather than studying the evolution of weights, we study the evolution of the distribution of weights. We prove convergence of the distribution of weights to an approximate invariant measure, thereby explaining how the training loss can stabilize without weights necessarily converging to stationary points. We further discuss how this perspective can better align optimization theory with empirical observations in machine learning practice.

ICML Conference 2022 Conference Paper

On Convergence of Gradient Descent Ascent: A Tight Local Analysis

Haochuan Li
Farzan Farnia
Subhro Das
Ali Jadbabaie

Gradient Descent Ascent (GDA) methods are the mainstream algorithms for minimax optimization in generative adversarial networks (GANs). Convergence properties of GDA have drawn significant interest in the recent literature. Specifically, for $\min_{x} \max_{y} f(x; y)$ where $f$ is strongly-concave in $y$ and possibly nonconvex in $x$, (Lin et al. , 2020) proved the convergence of GDA with a stepsize ratio $\eta_y/\eta_x=\Theta(\kappa^2)$ where $\eta_x$ and $\eta_y$ are the stepsizes for $x$ and $y$ and $\kappa$ is the condition number for $y$. While this stepsize ratio suggests a slow training of the min player, practical GAN algorithms typically adopt similar stepsizes for both variables, indicating a wide gap between theoretical and empirical results. In this paper, we aim to bridge this gap by analyzing the local convergence of general nonconvex-nonconcave minimax problems. We demonstrate that a stepsize ratio of $\Theta(\kappa)$ is necessary and sufficient for local convergence of GDA to a Stackelberg Equilibrium, where $\kappa$ is the local condition number for $y$. We prove a nearly tight convergence rate with a matching lower bound. We further extend the convergence guarantees to stochastic GDA and extra-gradient methods (EG). Finally, we conduct several numerical experiments to support our theoretical findings.

NeurIPS Conference 2021 Conference Paper

Complexity Lower Bounds for Nonconvex-Strongly-Concave Min-Max Optimization

Haochuan Li
Yi Tian
Jingzhao Zhang
Ali Jadbabaie

We provide a first-order oracle complexity lower bound for finding stationary points of min-max optimization problems where the objective function is smooth, nonconvex in the minimization variable, and strongly concave in the maximization variable. We establish a lower bound of $\Omega\left(\sqrt{\kappa}\epsilon^{-2}\right)$ for deterministic oracles, where $\epsilon$ defines the level of approximate stationarity and $\kappa$ is the condition number. Our lower bound matches the best existing upper bound in the $\epsilon$ and $\kappa$ dependence up to logarithmic factors. For stochastic oracles, we provide a lower bound of $\Omega\left(\sqrt{\kappa}\epsilon^{-2} + \kappa^{1/3}\epsilon^{-4}\right)$. It suggests that there is a gap between the best existing upper bound $\mathcal{O}(\kappa^3 \epsilon^{-4})$ and our lower bound in the condition number dependence.

ICML Conference 2020 Conference Paper

Complexity of Finding Stationary Points of Nonconvex Nonsmooth Functions

Jingzhao Zhang
Hongzhou Lin
Stefanie Jegelka
Suvrit Sra
Ali Jadbabaie

We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains important examples such as ReLU neural networks and others with non-differentiable activation functions. First, we show that finding an epsilon-stationary point with first-order methods is impossible in finite time. Therefore, we introduce the notion of (delta, epsilon)-stationarity, a generalization that allows for a point to be within distance delta of an epsilon-stationary point and reduces to epsilon-stationarity for smooth functions. We propose a series of randomized first-order methods and analyze their complexity of finding a (delta, epsilon)-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on delta. Empirically, our methods perform well for training ReLU neural networks.

NeurIPS Conference 2020 Conference Paper

Estimation of Skill Distribution from a Tournament

Ali Jadbabaie
Anuran Makur
Devavrat Shah

In this paper, we study the problem of learning the skill distribution of a population of agents from observations of pairwise games in a tournament. These games are played among randomly drawn agents from the population. The agents in our model can be individuals, sports teams, or Wall Street fund managers. Formally, we postulate that the likelihoods of outcomes of games are governed by the parametric Bradley-Terry-Luce (or multinomial logit) model, where the probability of an agent beating another is the ratio between its skill level and the pairwise sum of skill levels, and the skill parameters are drawn from an unknown, non-parametric skill density of interest. The problem is, in essence, to learn a distribution from noisy, quantized observations. We propose a surprisingly simple and tractable algorithm that learns the skill density with near-optimal minimax mean squared error scaling as $n^{-1+\varepsilon}$, for any $\varepsilon>0$, so long as the density is smooth. Our approach brings together prior work on learning skill parameters from pairwise comparisons with kernel density estimation from non-parametric statistics. Furthermore, we prove information theoretic lower bounds which establish minimax optimality of the skill parameter estimation technique used in our algorithm. These bounds utilize a continuum version of Fano's method along with a careful covering argument. We apply our algorithm to various soccer leagues and world cups, cricket world cups, and mutual funds. We find that the entropy of a learnt distribution provides a quantitative measure of skill, which in turn provides rigorous explanations for popular beliefs about perceived qualities of sporting events, e. g. , soccer league rankings. Finally, we apply our method to assess the skill distributions of mutual funds. Our results shed light on the abundance of low quality funds prior to the Great Recession of 2008, and the domination of the industry by more skilled funds after the financial crisis.

NeurIPS Conference 2020 Conference Paper

Robust Federated Learning: The Case of Affine Distribution Shifts

Amirhossein Reisizadeh
Farzan Farnia
Ramtin Pedarsani
Ali Jadbabaie

Federated learning is a distributed paradigm that aims at training models using samples distributed across multiple users in a network while keeping the samples on users’ devices with the aim of efficiency and protecting users privacy. In such settings, the training data is often statistically heterogeneous and manifests various distribution shifts across users, which degrades the performance of the learnt model. The primary goal of this paper is to develop a robust federated learning algorithm that achieves satisfactory performance against distribution shifts in users' samples. To achieve this goal, we first consider a structured affine distribution shift in users' data that captures the device-dependent data heterogeneity in federated settings. This perturbation model is applicable to various federated learning problems such as image classification where the images undergo device-dependent imperfections, e. g. different intensity, contrast, and brightness. To address affine distribution shifts across users, we propose a Federated Learning framework Robust to Affine distribution shifts (FLRA) that is provably robust against affine Wasserstein shifts to the distribution of observed samples. To solve the FLRA's distributed minimax optimization problem, we propose a fast and efficient optimization method and provide convergence and performance guarantees via a gradient Descent Ascent (GDA) method. We further prove generalization error bounds for the learnt classifier to show proper generalization from empirical distribution of samples to the true underlying distribution. We perform several numerical experiments to empirically support FLRA. We show that an affine distribution shift indeed suffices to significantly decrease the performance of the learnt classifier in a new test user, and our proposed algorithm achieves a significant gain in comparison to standard federated learning and adversarial training methods.

ICLR Conference 2020 Conference Paper

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

Jingzhao Zhang
Tianxing He
Suvrit Sra
Ali Jadbabaie

We provide a theoretical explanation for the effectiveness of gradient clipping in training deep neural networks. The key ingredient is a new smoothness condition derived from practical neural network training examples. We observe that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks. Further, this smoothness positively correlates with the gradient norm, and contrary to standard assumptions in the literature, it can grow with the norm of the gradient. These empirical observations limit the applicability of existing theoretical analyses of algorithms that rely on a fixed bound on smoothness. These observations motivate us to introduce a novel relaxation of gradient smoothness that is weaker than the commonly used Lipschitz smoothness assumption. Under the new condition, we prove that two popular methods, namely, gradient clipping and normalized gradient, converge arbitrarily faster than gradient descent with fixed stepsize. We further explain why such adaptively scaled gradient methods can accelerate empirical convergence and verify our results empirically in popular neural network training settings.

NeurIPS Conference 2019 Conference Paper

Are deep ResNets provably better than linear predictors?

Chulhee Yun
Suvrit Sra
Ali Jadbabaie

Recent results in the literature indicate that a residual network (ResNet) composed of a single residual block outperforms linear predictors, in the sense that all local minima in its optimization landscape are at least as good as the best linear predictor. However, these results are limited to a single residual block (i. e. , shallow ResNets), instead of the deep ResNets composed of multiple residual blocks. We take a step towards extending this result to deep ResNets. We start by two motivating examples. First, we show that there exist datasets for which all local minima of a fully-connected ReLU network are no better than the best linear predictor, whereas a ResNet has strictly better local minima. Second, we show that even at the global minimum, the representation obtained from the residual block outputs of a 2-block ResNet do not necessarily improve monotonically over subsequent blocks, which highlights a fundamental difficulty in analyzing deep ResNets. Our main theorem on deep ResNets shows under simple geometric conditions that, any critical point in the optimization landscape is either (i) at least as good as the best linear predictor; or (ii) the Hessian at this critical point has a strictly negative eigenvalue. Notably, our theorem shows that a chain of multiple skip-connections can improve the optimization landscape, whereas existing results study direct skip-connections to the last hidden layer or output layer. Finally, we complement our results by showing benign properties of the "near-identity regions" of deep ResNets, showing depth-independent upper bounds for the risk attained at critical points as well as the Rademacher complexity.

NeurIPS Conference 2019 Conference Paper

Small ReLU networks are powerful memorizers: a tight analysis of memorization capacity

Chulhee Yun
Suvrit Sra
Ali Jadbabaie

We study finite sample expressivity, i. e. , memorization power of ReLU networks. Recent results require $N$ hidden nodes to memorize/interpolate arbitrary $N$ data points. In contrast, by exploiting depth, we show that 3-layer ReLU networks with $\Omega(\sqrt{N})$ hidden nodes can perfectly memorize most datasets with $N$ points. We also prove that width $\Theta(\sqrt{N})$ is necessary and sufficient for memorizing $N$ data points, proving tight bounds on memorization capacity. The sufficiency result can be extended to deeper networks; we show that an $L$-layer network with $W$ parameters in the hidden layers can memorize $N$ data points if $W = \Omega(N)$. Combined with a recent upper bound $O(WL\log W)$ on VC dimension, our construction is nearly tight for any fixed $L$. Subsequently, we analyze memorization capacity of residual networks under a general position assumption; we prove results that substantially reduce the known requirement of $N$ hidden nodes. Finally, we study the dynamics of stochastic gradient descent (SGD), and show that when initialized near a memorizing global minimum of the empirical risk, SGD quickly finds a nearby point with much smaller empirical risk.

NeurIPS Conference 2018 Conference Paper

Direct Runge-Kutta Discretization Achieves Acceleration

Jingzhao Zhang
Aryan Mokhtari
Suvrit Sra
Ali Jadbabaie

We study gradient-based optimization methods obtained by directly discretizing a second-order ordinary differential equation (ODE) related to the continuous limit of Nesterov's accelerated gradient method. When the function is smooth enough, we show that acceleration can be achieved by a stable discretization of this ODE using standard Runge-Kutta integrators. Specifically, we prove that under Lipschitz-gradient, convexity and order-$(s+2)$ differentiability assumptions, the sequence of iterates generated by discretizing the proposed second-order ODE converges to the optimal solution at a rate of $\mathcal{O}({N^{-2\frac{s}{s+1}}})$, where $s$ is the order of the Runge-Kutta numerical integrator. Furthermore, we introduce a new local flatness condition on the objective, under which rates even faster than $\mathcal{O}(N^{-2})$ can be achieved with low-order integrators and only gradient information. Notably, this flatness condition is satisfied by several standard loss functions used in machine learning. We provide numerical experiments that verify the theoretical rates predicted by our results.

NeurIPS Conference 2018 Conference Paper

Escaping Saddle Points in Constrained Optimization

Aryan Mokhtari
Asuman Ozdaglar
Ali Jadbabaie

In this paper, we study the problem of escaping from saddle points in smooth nonconvex optimization problems subject to a convex set $\mathcal{C}$. We propose a generic framework that yields convergence to a second-order stationary point of the problem, if the convex set $\mathcal{C}$ is simple for a quadratic objective function. Specifically, our results hold if one can find a $\rho$-approximate solution of a quadratic program subject to $\mathcal{C}$ in polynomial time, where $\rho<1$ is a positive constant that depends on the structure of the set $\mathcal{C}$. Under this condition, we show that the sequence of iterates generated by the proposed framework reaches an $(\epsilon, \gamma)$-second order stationary point (SOSP) in at most $\mathcal{O}(\max\{\epsilon^{-2}, \rho^{-3}\gamma^{-3}\})$ iterations. We further characterize the overall complexity of reaching an SOSP when the convex set $\mathcal{C}$ can be written as a set of quadratic constraints and the objective function Hessian has a specific structure over the convex $\mathcal{C}$. Finally, we extend our results to the stochastic setting and characterize the number of stochastic gradient and Hessian evaluations to reach an $(\epsilon, \gamma)$-SOSP.

NeurIPS Conference 2013 Conference Paper

Online Learning of Dynamic Parameters in Social Networks

Shahin Shahrampour
Sasha Rakhlin
Ali Jadbabaie

This paper addresses the problem of online learning in a dynamic setting. We consider a social network in which each individual observes a private signal about the underlying state of the world and communicates with her neighbors at each time period. Unlike many existing approaches, the underlying state is dynamic, and evolves according to a geometric random walk. We view the scenario as an optimization problem where agents aim to learn the true state while suffering the smallest possible loss. Based on the decomposition of the global loss function, we introduce two update mechanisms, each of which generates an estimate of the true state. We establish a tight bound on the rate of change of the underlying state, under which individuals can track the parameter with a bounded variance. Then, we characterize explicit expressions for the steady state mean-square deviation(MSD) of the estimates from the truth, per individual. We observe that only one of the estimators recovers the optimal MSD, which underscores the impact of the objective function decomposition on the learning quality. Finally, we provide an upper bound on the regret of the proposed methods, measured as an average of errors in estimating the parameter in a finite time.

ICRA Conference 2010 Conference Paper

A duality approach to path planning for multiple robots

Nader Motee
Ali Jadbabaie
George J. Pappas

In this paper, we propose an optimization-based framework for path planning for multiple robots in presence of obstacles. The objective is to find multiple fixed length paths for multiple robots that satisfy the following constraints: (i) bounded curvature, (ii) obstacle avoidance, (iii) and collision avoidance. First, we formulate a relaxation of the path planning problem using polygonal approximations. We show that path planning problem for multiple robots under various constraints and missions, such as curvature and obstacle avoidance constraints as well as rendezvous and maximal total area coverage, can be cast as a nonconvex optimization problem. Then, we propose an alternative dual formulation that results in no duality gap. We show that the alternative dual function can be interpreted as minimum potential energy of a multi-particle system with discontinuous spring-like forces. Finally, we show that using the proposed duality-based framework, an approximation of the minimal length path planning problem (also known as Dubins' problem) in presence of obstacles can be solved efficiently using primal-dual interior-point methods.

ICRA Conference 2010 Conference Paper

Towards simplicial coverage repair for mobile robot teams

Jason C. Derenick
Vijay Kumar 0001
Ali Jadbabaie

In this note, we present initial results towards developing a distributed algorithm for repairing topological holes in the sensor cover of a mobile robot team. Central to our approach is the melding of recent advances in the application of computational homology (a sub-discipline of algebraic topology) to static sensor networks with relative metric information (i. e. relative pose). More precisely, we consider a greedy, hybrid (discrete-continuous) algorithm whereby a desired Cěch complex, the simplicial complex that captures the underlying topology of the sensing cover, is iteratively generated using local rules (between multi-hop neighbors) and agents are driven towards achieving this topology via a gradient-ascent simplicial control law. Convergence of the proposed algorithm is established as a function of the convergence of the underlying simplicial control law, and the relationship of the latter to the spectrum of the combinatorial Laplacian is considered. Simulation results for teams operating in ℝ 2 are presented.

ICRA Conference 2009 Conference Paper

Multi-vehicle path planning in dynamically changing environments

Ali Ahmadzadeh
Nader Motee
Ali Jadbabaie
George J. Pappas

In this paper, we propose a path planning method for nonholonomic multi-vehicle system in presence of moving obstacles. The objective is to find multiple fixed length paths for multiple vehicles with the following properties: (i) bounded curvature (ii) obstacle avoidant (iii) collision free. Our approach is based on polygonal approximation of a continuous curve. Using this idea, we formulate an arbitrarily fine relaxation of the path planning problem as a nonconvex feasibility optimization problem. Then, we propound a nonsmooth dynamical systems approach to find feasible solutions of this optimization problem. It is shown that the trajectories of the nonsmooth dynamical system always converge to some equilibria that correspond to the set of feasible solutions of the relaxed problem. The proposed framework can handle more complex mission scenarios for multi-vehicle systems such as rendezvous and area coverage.

ICRA Conference 2008 Conference Paper

Connectivity management in mobile robot teams

Ethan Stump
Ali Jadbabaie
Vijay Kumar 0001

We develop a framework for controlling a team of robots to maintain and improve a communication bridge between a stationary robot and an independently exploring robot in a walled environment. We make use of two metrics for characterizing the communication: the Fiedler value of the weighted Laplacian describing the communication interactions of all the robots in the system, and the k-connectivity matrix that expresses which robots can interact through k or less intermediary robots. At each step, we move in such a way as to improve the Fiedler value as much as possible while keeping the number of intermediary robots between the two robots of interest below a desired value. We demonstrate the use of this framework in a scenario where the hop-count constraint cannot be satisfied, but show that communication quality is maintained anyways.

ICRA Conference 2006 Conference Paper

Vision-based Control Laws for Distributed Flocking of Nonholonomic Agents

Nima Moshtagh
Ali Jadbabaie
Kostas Daniilidis

We study the problem of vision-based flocking and coordination of a group of kinematic agents in 2 and 3 dimensions. It is shown that in the absence of communication among agents, and by using only visual information, a group of mobile agents can align their velocity vectors and move in a formation. A coordinate-free control law is used to develop a vision-based input for each nonholonomic agent. The vision-based input does not rely on heading measurements, but only requires measurements of bearing, optical flow and time-to-collision, all of which can be efficiently measured