Author name cluster

Haim Sompolinsky

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

13 papers

2 author rows

ICLR Conference 2025 Conference Paper

When narrower is better: the narrow width limit of Bayesian parallel branching neural networks

Zechen Zhang
Haim Sompolinsky

The infinite width limit of random neural networks is known to result in Neural Networks as Gaussian Process (NNGP) (Lee et al. (2018)), characterized by task-independent kernels. It is widely accepted that larger network widths contribute to improved generalization (Park et al. (2019)). However, this work challenges this notion by investigating the narrow width limit of the Bayesian Parallel Branching Neural Network (BPB-NN), an architecture that resembles neural networks with residual blocks. We demonstrate that when the width of a BPB-NN is significantly smaller compared to the number of training examples, each branch exhibits more robust learning due to a symmetry breaking of branches in kernel renormalization. Surprisingly, the performance of a BPB-NN in the narrow width limit is generally superior to or comparable to that achieved in the wide width limit in bias-limited scenarios. Furthermore, the readout norms of each branch in the narrow width limit are mostly independent of the architectural hyperparameters but generally reflective of the nature of the data. We demonstrate such phenomenon primarily in the branching graph neural networks, where each branch represents a different order of convolutions of the graph; we also extend the results to other more general architectures such as the residual-MLP and demonstrate that the narrow width effect is a general feature of the branching networks. Our results characterize a newly defined narrow-width regime for parallel branching networks in general.

Details

NeurIPS Conference 2024 Conference Paper

Dissecting the Interplay of Attention Paths in a Statistical Mechanics Theory of Transformers

Lorenzo Tiberi
Francesca Mignacco
Kazuki Irie
Haim Sompolinsky

Despite the remarkable empirical performance of Transformers, their theoretical understanding remains elusive. Here, we consider a deep multi-head self-attention network, that is closely related to Transformers yet analytically tractable. We develop a statistical mechanics theory of Bayesian learning in this model, deriving exact equations for the network's predictor statistics under the finite-width thermodynamic limit, i. e. , $N, P\rightarrow\infty$, $P/N=\mathcal{O}(1)$, where $N$ is the network width and $P$ is the number of training examples. Our theory shows that the predictor statistics are expressed as a sum of independent kernels, each one pairing different "attention paths", defined as information pathways through different attention heads across layers. The kernels are weighted according to a "task-relevant kernel combination" mechanism that aligns the total kernel with the task labels. As a consequence, this interplay between attention paths enhances generalization performance. Experiments confirm our findings on both synthetic and real-world sequence classification tasks. Finally, our theory explicitly relates the kernel combination mechanism to properties of the learned weights, allowing for a qualitative transfer of its insights to models trained via gradient descent. As an illustration, we demonstrate an efficient size reduction of the network, by pruning those attention heads that are deemed less relevant by our theory.

PDF Details DOI

NeurIPS Conference 2022 Conference Paper

A theory of weight distribution-constrained learning

Weishun Zhong
Ben Sorscher
Daniel Lee
Haim Sompolinsky

A central question in computational neuroscience is how structure determines function in neural networks. Recent large-scale connectomic studies have started to provide a wealth of structural information such as the distribution of excitatory/inhibitory cell and synapse types as well as the distribution of synaptic weights in the brains of different species. The emerging high-quality large structural datasets raise the question of what general functional principles can be gleaned from them. Motivated by this question, we developed a statistical mechanical theory of learning in neural networks that incorporates structural information as constraints. We derived an analytical solution for the memory capacity of the perceptron, a basic feedforward model of supervised learning, with constraint on the distribution of its weights. Interestingly, the theory predicts that the reduction in capacity due to the constrained weight-distribution is related to the Wasserstein distance between the cumulative distribution function of the constrained weights and that of the standard normal distribution. To test the theoretical predictions, we use optimal transport theory and information geometry to develop an SGD-based algorithm to find weights that simultaneously learn the input-output task and satisfy the distribution constraint. We show that training in our algorithm can be interpreted as geodesic flows in the Wasserstein space of probability distributions. Given a parameterized family of weight distributions, our theory predicts the shape of the distribution with optimal parameters. We apply our theory to map out the experimental parameter landscape for the estimated distribution of synaptic weights in mammalian cortex and show that our theory’s prediction for optimal distribution is close to the experimentally measured value. We further developed a statistical mechanical theory for teacher-student perceptron rule learning and ask for the best way for the student to incorporate prior knowledge of the rule (i. e. , the teacher). Our theory shows that it is beneficial for the learner to adopt different prior weight distributions during learning, and shows that distribution-constrained learning outperforms unconstrained and sign-constrained learning. Our theory and algorithm provide novel strategies for incorporating prior knowledge about weights into learning, and reveal a powerful connection between structure and function in neural networks.

PDF Details

NeurIPS Conference 2022 Conference Paper

Globally Gated Deep Linear Networks

Qianyi Li
Haim Sompolinsky

Recently proposed Gated Linear Networks (GLNs) present a tractable nonlinear network architecture, and exhibit interesting capabilities such as learning with local error signals and reduced forgetting in sequential learning. In this work, we introduce a novel gating architecture, named Globally Gated Deep Linear Networks (GGDLNs) where gating units are shared among all processing units in each layer, thereby decoupling the architectures of the nonlinear but unlearned gating and the learned linear processing motifs. We derive exact equations for the generalization properties of Bayesian Learning in these networks in the finite-width thermodynamic limit, defined by $N, P\rightarrow\infty$ while $P/N=O(1)$ where $N$ and $P$ are the hidden layers' width and size of training data sets respectfully. We find that the statistics of the network predictor can be expressed in terms of kernels that undergo shape renormalization through a data-dependent order-parameter matrix compared to the infinite-width Gaussian Process (GP) kernels. Our theory accurately captures the behavior of finite width GGDLNs trained with gradient descent (GD) dynamics. We show that kernel shape renormalization gives rise to rich generalization properties w. r. t. network width, depth, and $L_2$ regularization amplitude. Interestingly, networks with a large number of gating units behave similarly to standard ReLU architectures. Although gating units in the model do not participate in supervised learning, we show the utility of unsupervised learning of the gating parameters. Additionally, our theory allows the evaluation of the network capacity for learning multiple tasks by incorporating task-relevant information into the gating units. In summary, our work is the first exact theoretical solution of learning in a family of nonlinear networks with finite width. The rich and diverse behavior of the GGDLNs suggests that they are helpful analytically tractable models of learning single and multiple tasks, in finite-width nonlinear deep networks.

PDF Details

NeurIPS Conference 2016 Conference Paper

Optimal Architectures in a Solvable Model of Deep Networks

Jonathan Kadmon
Haim Sompolinsky

Deep neural networks have received a considerable attention due to the success of their training for real world machine learning applications. They are also of great interest to the understanding of sensory processing in cortical sensory hierarchies. The purpose of this work is to advance our theoretical understanding of the computational benefits of these architectures. Using a simple model of clustered noisy inputs and a simple learning rule, we provide analytically derived recursion relations describing the propagation of the signals along the deep network. By analysis of these equations, and defining performance measures, we show that these model networks have optimal depths. We further explore the dependence of the optimal architecture on the system parameters.

PDF Details

NeurIPS Conference 2010 Conference Paper

Inferring Stimulus Selectivity from the Spatial Structure of Neural Network Dynamics

Kanaka Rajan
L Abbott
Haim Sompolinsky

How are the spatial patterns of spontaneous and evoked population responses related? We study the impact of connectivity on the spatial pattern of fluctuations in the input-generated response of a neural network, by comparing the distribution of evoked and intrinsically generated activity across the different units. We develop a complementary approach to principal component analysis in which separate high-variance directions are typically derived for each input condition. We analyze subspace angles to compute the difference between the shapes of trajectories corresponding to different network states, and the orientation of the low-dimensional subspaces that driven trajectories occupy within the full space of neuronal activity. In addition to revealing how the spatiotemporal structure of spontaneous activity affects input-evoked responses, these methods can be used to infer input selectivity induced by network dynamics from experimentally accessible measures of spontaneous activity (e. g. from voltage- or calcium-sensitive optical imaging experiments). We conclude that the absence of a detailed spatial map of afferent inputs and cortical connectivity does not limit our ability to design spatially extended stimuli that evoke strong responses.

PDF Details

NeurIPS Conference 2010 Conference Paper

Short-term memory in neuronal networks through dynamical compressed sensing

Surya Ganguli
Haim Sompolinsky

Recent proposals suggest that large, generic neuronal networks could store memory traces of past input sequences in their instantaneous state. Such a proposal raises important theoretical questions about the duration of these memory traces and their dependence on network size, connectivity and signal statistics. Prior work, in the case of gaussian input sequences and linear neuronal networks, shows that the duration of memory traces in a network cannot exceed the number of neurons (in units of the neuronal time constant), and that no network can out-perform an equivalent feedforward network. However a more ethologically relevant scenario is that of sparse input sequences. In this scenario, we show how linear neural networks can essentially perform compressed sensing (CS) of past inputs, thereby attaining a memory capacity that {\it exceeds} the number of neurons. This enhanced capacity is achieved by a class of ``orthogonal recurrent networks and not by feedforward networks or generic recurrent networks. We exploit techniques from the statistical physics of disordered systems to analytically compute the decay of memory traces in such networks as a function of network size, signal sparsity and integration time. Alternately, viewed purely from the perspective of CS, this work introduces a new ensemble of measurement matrices derived from dynamical systems, and provides a theoretical analysis of their asymptotic performance. "

PDF Details

NeurIPS Conference 2001 Conference Paper

Correlation Codes in Neuronal Populations

Maoz Shamir
Haim Sompolinsky

Population codes often rely on the tuning of the mean responses to the stimulus parameters. However, this information can be greatly sup- pressed by long range correlations. Here we study the efﬁciency of cod- ing information in the second order statistics of the population responses. We show that the Fisher Information of this system grows linearly with the size of the system. We propose a bilinear readout model for extract- ing information from correlation codes, and evaluate its performance in discrimination and estimation tasks. It is shown that the main source of information in this system is the stimulus dependence of the variances of the single neuron responses.

PDF Details

NeurIPS Conference 2000 Conference Paper

An Information Maximization Approach to Overcomplete and Recurrent Representations

Oren Shriki
Haim Sompolinsky
Daniel Lee

The principle of maximizing mutual information is applied to learning overcomplete and recurrent representations. The underlying model con(cid: 173) sists of a network of input units driving a larger number of output units with recurrent interactions. In the limit of zero noise, the network is de(cid: 173) terministic and the mutual information can be related to the entropy of the output units. Maximizing this entropy with respect to both the feed(cid: 173) forward connections as well as the recurrent interactions results in simple learning rules for both sets of parameters. The conventional independent components (ICA) learning algorithm can be recovered as a special case where there is an equal number of output units and no recurrent con(cid: 173) nections. The application of these new learning rules is illustrated on a simple two-dimensional input example.

PDF Details

NeurIPS Conference 1999 Conference Paper

Algorithms for Independent Components Analysis and Higher Order Statistics

Daniel Lee
Uri Rokni
Haim Sompolinsky

A latent variable generative model with finite noise is used to de(cid: 173) scribe several different algorithms for Independent Components Anal(cid: 173) ysis (lCA). In particular, the Fixed Point ICA algorithm is shown to be equivalent to the Expectation-Maximization algorithm for maximum likelihood under certain constraints, allowing the conditions for global convergence to be elucidated. The algorithms can also be explained by their generic behavior near a singular point where the size of the opti(cid: 173) mal generative bases vanishes. An expansion of the likelihood about this singular point indicates the role of higher order correlations in determin(cid: 173) ing the features discovered by ICA. The application and convergence of these algorithms are demonstrated on a simple illustrative example.

PDF Details

NeurIPS Conference 1998 Conference Paper

Learning a Continuous Hidden Variable Model for Binary Data

Daniel Lee
Haim Sompolinsky

A directed generative model for binary data using a small number of hidden continuous units is investigated. A clipping nonlinear(cid: 173) ity distinguishes the model from conventional principal components analysis. The relationships between the correlations of the underly(cid: 173) ing continuous Gaussian variables and the binary output variables are utilized to learn the appropriate weights of the network. The advantages of this approach are illustrated on a translationally in(cid: 173) variant binary distribution and on handwritten digit images.

PDF Details

NeurIPS Conference 1998 Conference Paper

The Effect of Correlations on the Fisher Information of Population Codes

Hyoungsoo Yoon
Haim Sompolinsky

We study the effect of correlated noise on the accuracy of popu(cid: 173) lation coding using a model of a population of neurons that are broadly tuned to an angle in two-dimension. The fluctuations in the neuronal activity is modeled as a Gaussian noise with pairwise correlations which decays exponentially with the difference between the preferred orientations of the pair. By calculating the Fisher in(cid: 173) formation of the system, we show that in the biologically relevant regime of parameters positive correlations decrease the estimation capability of the network relative to the uncorrelated population. Moreover strong positive correlations result in information capac(cid: 173) ity which saturates to a finite value as the number of cells in the population grows. In contrast, negative correlations substantially increase the information capacity of the neuronal population.

PDF Details

NeurIPS Conference 1993 Conference Paper

Correlation Functions in a Large Stochastic Neural Network

Iris Ginzburg
Haim Sompolinsky

Most theoretical investigations of large recurrent networks focus on the properties of the macroscopic order parameters such as popu(cid: 173) lation averaged activities or average overlaps with memories. How(cid: 173) ever, the statistics of the fluctuations in the local activities may be an important testing ground for comparison between models and observed cortical dynamics. We evaluated the neuronal cor(cid: 173) relation functions in a stochastic network comprising of excitatory and inhibitory populations. We show that when the network is in a stationary state, the cross-correlations are relatively weak, i. e. , their amplitude relative to that of the auto-correlations are of or(cid: 173) der of 1/ N, N being the size of the interacting population. This holds except in the neighborhoods of bifurcations to nonstationary states. As a bifurcation point is approached the amplitude of the cross-correlations grows and becomes of order 1 and the decay time(cid: 173) constant diverges. This behavior is analogous to the phenomenon of critical slowing down in systems at thermal equilibrium near a critical point. Near a Hopf bifurcation the cross-correlations ex(cid: 173) hibit damped oscillations.

PDF Details