Author name cluster

Yatin Dandi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

2 author rows

ICML Conference 2025 Conference Paper

Fundamental limits of learning in sequence multi-index models and deep attention networks: high-dimensional asymptotics and sharp thresholds

Emanuele Troiani
Hugo Cui
Yatin Dandi
Florent Krzakala
Lenka Zdeborová

In this manuscript, we study the learning of deep attention neural networks, defined as the composition of multiple self-attention layers, with tied and low-rank weights. We first establish a mapping of such models to sequence multi-index models, a generalization of the widely studied multi-index model to sequential covariates, for which we establish a number of general results. In the context of Bayes-optimal learning, in the limit of large dimension $D$ and proportionally large number of samples $N$, we derive a sharp asymptotic characterization of the optimal performance as well as the performance of the best-known polynomial-time algorithm for this setting –namely approximate message-passing–, and characterize sharp thresholds on the minimal sample complexity required for better-than-random prediction performance. Our analysis uncovers, in particular, how the different layers are learned sequentially. Finally, we discuss how this sequential learning can also be observed in a realistic setup.

Details

NeurIPS Conference 2025 Conference Paper

Optimal Spectral Transitions in High-Dimensional Multi-Index Models

Leonardo Defilippis
Yatin Dandi
Pierre Mergny
Florent Krzakala
Bruno Loureiro

We consider the problem of how many samples from a Gaussian multi-index model are required to weakly reconstruct the relevant index subspace. Despite its increasing popularity as a testbed for investigating the computational complexity of neural networks, results beyond the single-index setting remain elusive. In this work, we introduce spectral algorithms based on the linearization of a message passing scheme tailored to this problem. Our main contribution is to show that the proposed methods achieve the optimal reconstruction threshold. Leveraging a high-dimensional characterization of the algorithms, we show that above the critical threshold the leading eigenvector correlates with the relevant index subspace, a phenomenon reminiscent of the Baik–Ben Arous–Peche (BBP) transition in spiked models arising in random matrix theory. Supported by numerical experiments and a rigorous theoretical framework, our work bridges critical gaps in the computational limits of weak learnability in multi-index model.

PDF Details

NeurIPS Conference 2025 Conference Paper

The Computational Advantage of Depth in Learning High-Dimensional Hierarchical Targets

Yatin Dandi
Luca Pesce
Lenka Zdeborová
Florent Krzakala

Understanding the advantages of deep neural networks trained by gradient descent (GD) compared to shallow models remains an open theoretical challenge. In this paper, we introduce a class of target functions (single and multi-index Gaussian hierarchical targets) that incorporate a hierarchy of latent subspace dimensionalities. This framework enables us to analytically study the learning dynamics and generalization performance of deep networks compared to shallow ones in the high-dimensional limit. Specifically, our main theorem shows that feature learning with GD successively reduces the effective dimensionality, transforming a high-dimensional problem into a sequence of lower-dimensional ones. This enables learning the target function with drastically less samples than with shallow networks. While the results are proven in a controlled training setting, we also discuss more common training procedures and argue that they learn through the same mechanisms. These findings open the way to further quantitative studies of the crucial role of depth in learning hierarchical structures with deep networks.

PDF Details

ICML Conference 2024 Conference Paper

Asymptotics of feature learning in two-layer networks after one gradient-step

Hugo Cui
Luca Pesce
Yatin Dandi
Florent Krzakala
Yue M. Lu
Lenka Zdeborová
Bruno Loureiro

In this manuscript, we investigate the problem of how two-layer neural networks learn features from data, and improve over the kernel regime, after being trained with a single gradient descent step. Leveraging the insight from (Ba et al. , 2022), we model the trained network by a spiked Random Features (sRF) model. Further building on recent progress on Gaussian universality (Dandi et al. , 2023), we provide an exact asymptotic description of the generalization error of the sRF in the high-dimensional limit where the number of samples, the width, and the input dimension grow at a proportional rate. The resulting characterization for sRFs also captures closely the learning curves of the original network model. This enables us to understand how adapting to the data is crucial for the network to efficiently learn non-linear functions in the direction of the gradient - where at initialization it can only express linear functions in this regime.

Details

JMLR Journal 2024 Journal Article

How Two-Layer Neural Networks Learn, One (Giant) Step at a Time

Yatin Dandi
Florent Krzakala
Bruno Loureiro
Luca Pesce
Ludovic Stephan

For high-dimensional Gaussian data, we investigate theoretically how the features of a two-layer neural network adapt to the structure of the target function through a few large batch gradient descent steps, leading to an improvement in the approximation capacity with respect to the initialization. First, we compare the influence of batch size to that of multiple (but finitely many) steps. For a single gradient step, a batch of size $n = O(d)$ is both necessary and sufficient to align with the target function, although only a single direction can be learned. In contrast, $n = O(d^2)$ is essential for neurons to specialize in multiple relevant directions of the target with a single gradient step. Even in this case, we show there might exist “hard” directions requiring $n = O(d^\ell)$ samples to be learned, where $\ell$ is known as the leap index of the target. Second, we show that the picture drastically improves over multiple gradient steps: a batch size of $n = O(d)$ is indeed sufficient to learn multiple target directions satisfying a staircase property, where more and more directions can be learned over time. Finally, we discuss how these directions allow for a drastic improvement in the approximation capacity and generalization error over the initialization, illustrating a separation of scale between the random features/lazy regime and the feature learning regime. Our technical analysis leverages a combination of techniques related to concentration, projection-based conditioning, and Gaussian equivalence, which we believe are of independent interest. By pinning down the conditions necessary for specialization and learning, our results highlight the intertwined role of the structure of the task to learn, the detail of the algorithm (the batch size), and the architecture (i.e., the number of hidden neurons), shedding new light on how neural networks adapt to the feature and learn complex task from data over time. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

PDF Details

ICML Conference 2024 Conference Paper

Online Learning and Information Exponents: The Importance of Batch size & Time/Complexity Tradeoffs

Luca Arnaboldi 0002
Yatin Dandi
Florent Krzakala
Bruno Loureiro
Luca Pesce
Ludovic Stephan

We study the impact of the batch size $n_b$ on the iteration time $T$ of training two-layer neural networks with one-pass stochastic gradient descent (SGD) on multi-index target functions of isotropic covariates. We characterize the optimal batch size minimizing the iteration time as a function of the hardness of the target, as characterized by the information exponents. We show that performing gradient updates with large batches $n_b \lesssim d^{\frac{\ell}{2}}$ minimizes the training time without changing the total sample complexity, where $\ell$ is the information exponent of the target to be learned and $d$ is the input dimension. However, larger batch sizes than $n_b \gg d^{\frac{\ell}{2}}$ are detrimental for improving the time complexity of SGD. We provably overcome this fundamental limitation via a different training protocol, Correlation loss SGD, which suppresses the auto-correlation terms in the loss function. We show that one can track the training progress by a system of low-dimensional ordinary differential equations (ODEs). Finally, we validate our theoretical results with numerical experiments.

Details

ICML Conference 2024 Conference Paper

The Benefits of Reusing Batches for Gradient Descent in Two-Layer Networks: Breaking the Curse of Information and Leap Exponents

Yatin Dandi
Emanuele Troiani
Luca Arnaboldi 0002
Luca Pesce
Lenka Zdeborová
Florent Krzakala

We investigate the training dynamics of two-layer neural networks when learning multi-index target functions. We focus on multi-pass gradient descent (GD) that reuses the batches multiple times and show that it significantly changes the conclusion about which functions are learnable compared to single-pass gradient descent. In particular, multi-pass GD with finite stepsize is found to overcome the limitations of gradient flow and single-pass GD given by the information exponent (Ben Arous et al. , 2021) and leap exponent (Abbe et al. , 2023) of the target function. We show that upon re-using batches, the network achieves in just two time steps an overlap with the target subspace even for functions not satisfying the staircase property (Abbe et al. , 2021). We characterize the (broad) class of functions efficiently learned in finite time. The proof of our results is based on the analysis of the Dynamical Mean-Field Theory (DMFT). We further provide a closed-form description of the dynamical process of the low-dimensional projections of the weights, and numerical experiments illustrating the theory.

Details

NeurIPS Conference 2023 Conference Paper

Universality laws for Gaussian mixtures in generalized linear models

Yatin Dandi
Ludovic Stephan
Florent Krzakala
Bruno Loureiro
Lenka Zdeborová

A recent line of work in high-dimensional statistics working under the Gaussian mixture hypothesis has led to a number of results in the context of empirical risk minimization, Bayesian uncertainty quantification, separation of kernel methods and neural networks, ensembling and fluctuation of random features. We provide rigorous proofs for the applicability of these results to a general class of datasets $(\mathbf{x_i}, y_i, {i=1, \dots, n})$ containing independent samples from a mixture distribution $\sum_{c\in\mathcal{C}} \rho_{c}P_{c}^{\mathbf{x}}$. Specifically, we consider the hypothesis class of generalized linear models $\hat{y} = F(\mathbf{\Theta}^{\top}\mathbf{x})$ and investigate the asymptotic joint statistics of a family of generalized linear estimators $(\mathbf{\Theta}^{(1)}, \dots, \mathbf{\Theta}^{(M)})$, obtained either from (a) minimizing an empirical risk $\hat{R_n}^{(m)}(\mathbf{\Theta}^{(m)}; \mathbf{X}, \mathbf{y})$ or (b) sampling from the associated Gibbs measure $\exp(-\beta n \hat{R_n}^{(m)}(\mathbf{\Theta}^{(m)}; \mathbf{X}, \mathbf{y}))$. Our main contribution is to characterize under which conditions the asymptotic joint statistics of this family depends (on a weak sense) only on the means and covariances of the class conditional features distribution $P_{c}^{\mathbf{x}}$. This allows us to prove the universality of different quantities of interest, including training, generalization errors, as well as the geometrical properties and correlations of the estimators.

PDF Details

AAAI Conference 2022 Conference Paper

Implicit Gradient Alignment in Distributed and Federated Learning

Yatin Dandi
Luis Barba
Martin Jaggi

A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. In this work, we show that data heterogeneity can in fact be exploited to improve generalization performance through implicit regularization. One way to alleviate the effects of heterogeneity is to encourage the alignment of gradients across different clients throughout training. Our analysis reveals that this goal can be accomplished by utilizing the right optimization method that replicates the implicit regularization effect of SGD, leading to gradient alignment as well as improvements in test accuracies. Since the existence of this regularization in SGD completely relies on the sequential use of different mini-batches during training, it is inherently absent when training with large mini-batches. To obtain the generalization benefits of this regularization while increasing parallelism, we propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update. We experimentally validate the benefits of our algorithm in different distributed and federated learning settings.

PDF Details

AAAI Conference 2021 Conference Paper

Generalized Adversarially Learned Inference

Yatin Dandi
Homanga Bharadhwaj
Abhishek Kumar
Piyush Rai

Allowing effective inference of latent vectors while training GANs can greatly increase their applicability in various downstream tasks. Recent approaches, such as ALI and BiGAN frameworks, develop methods of inference of latent variables in GANs by adversarially training an image generator along with an encoder to match two joint distributions of image and latent vector pairs. We generalize these approaches to incorporate multiple layers of feedback on reconstructions, self-supervision, and other forms of supervision based on prior or learned knowledge about the desired solutions. We achieve this by modifying the discriminator’s objective to correctly identify more than two joint distributions of tuples of an arbitrary number of random variables consisting of images, latent vectors, and other variables generated through auxiliary tasks, such as reconstruction and inpainting or as outputs of suitable pre-trained models. We design a non-saturating maximization objective for the generator-encoder pair and prove that the resulting adversarial game corresponds to a global optimum that simultaneously matches all the distributions. Within our proposed framework, we introduce a novel set of techniques for providing self-supervised feedback to the model based on properties, such as patch-level correspondence and cycle consistency of reconstructions. Through comprehensive experiments, we demonstrate the efficacy, scalability, and flexibility of the proposed approach for a variety of tasks. The appendix of the paper can be found at the following link: https: //drive. google. com/file/ d/1i99e682CqYWMEDXlnqkqrctGLVA9viiz/view? usp= sharing

PDF Details