Arrow Research search

Author name cluster

Navin Goyal

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

22 papers
2 author rows

Possible papers

22

NeurIPS Conference 2025 Conference Paper

For Better or for Worse, Transformers Seek Patterns for Memorization

  • Madhur Panwar
  • Gail Weiss
  • Navin Goyal
  • Antoine Bosselut

Memorization in language models is a critical yet poorly understood phenomenon. In this work, we investigate memorization in transformer-based language models by analyzing their memorization dynamics during training over multiple epochs. We find that memorization is neither a constant accumulation of sequences nor simply dictated by the recency of exposure to these sequences. Instead, much like generalization, memorization appears to be driven by pattern recognition. Tracking memorization dynamics in mixed datasets, we observe that models memorize different sub-datasets in distinct bursts, suggesting that each subset is associated with unique underlying patterns, and that the model prefers to learn these patterns in a consistent order. We also find that easily learnable patterns tend to support generalization on unseen data, while more complex patterns do not. Furthermore, in datasets with weak or absent patterns, larger models may delay memorization relative to smaller ones, a behavior we term $\textit{overthinking}$. Our results show that the subset of sequences memorized by a model over time is not arbitrary, and give insights into the internal processes a model goes through during training. Our code is available at: https: //github. com/mdrpanwar/memorization-patterns.

ICLR Conference 2024 Conference Paper

In-Context Learning through the Bayesian Prism

  • Madhur Panwar
  • Kabir Ahuja
  • Navin Goyal

In-context learning (ICL) is one of the surprising and useful features of large language models and subject of intense research. Recently, stylized meta-learning-like ICL setups have been devised that train transformers on sequences of input-output pairs $(x, f(x))$. The function $f$ comes from a function class and generalization is checked by evaluating on sequences generated from unseen functions from the same class. One of the main discoveries in this line of research has been that for several function classes, such as linear regression, transformers successfully generalize to new functions in the class. However, the inductive biases of these models resulting in this behavior are not clearly understood. A model with unlimited training data and compute is a Bayesian predictor: it learns the pretraining distribution. In this paper we empirically examine how far this Bayesian perspective can help us understand ICL. To this end, we generalize the previous meta-ICL setup to hierarchical meta-ICL setup which involve unions of multiple task families. We instantiate this setup on a diverse range of linear and nonlinear function families and find that transformers can do ICL in this setting as well. Where Bayesian inference is tractable, we find evidence that high-capacity transformers mimic the Bayesian predictor. The Bayesian perspective provides insights into the inductive bias of ICL and how transformers perform a particular task when they are trained on multiple tasks. We also find that transformers can learn to generalize to new function classes that were not seen during pretraining. This involves deviation from the Bayesian predictor. We examine these deviations in more depth offering new insights and hypotheses.

NeurIPS Conference 2024 Conference Paper

InversionView: A General-Purpose Method for Reading Information from Neural Activations

  • Xinting Huang
  • Madhur Panwar
  • Navin Goyal
  • Michael Hahn

The inner workings of neural networks can be better understood if we can fully decipher the information encoded in neural activations. In this paper, we argue that this information is embodied by the subset of inputs that give rise to similar activations. We propose InversionView, which allows us to practically inspect this subset by sampling from a trained decoder model conditioned on activations. This helps uncover the information content of activation vectors, and facilitates understanding of the algorithms implemented by transformer models. We present four case studies where we investigate models ranging from small transformers to GPT-2. In these studies, we show that InversionView can reveal clear information contained in activations, including basic information about tokens appearing in the context, as well as more complex information, such as the count of certain tokens, their relative positions, and abstract knowledge about the subject. We also provide causally verified circuits to confirm the decoded information.

NeurIPS Conference 2023 Conference Paper

Monitor-Guided Decoding of Code LMs with Static Analysis of Repository Context

  • Lakshya A Agrawal
  • Aditya Kanade
  • Navin Goyal
  • Shuvendu Lahiri
  • Sriram Rajamani

Language models of code (LMs) work well when the surrounding code provides sufficient context. This is not true when it becomes necessary to use types, functionality or APIs defined elsewhere in the repository or a linked library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating. Integrated development environments (IDEs) assist developers in understanding repository context using static analysis. We extend this assistance, enjoyed by developers, to LMs. We propose monitor-guided decoding (MGD) where a monitor uses static analysis to guide the decoding. We construct a repository-level dataset PragmaticCode for method-completion in Java and evaluate MGD on it. On models of varying parameter scale, by monitoring for type-consistent object dereferences, MGD consistently improves compilation rates and agreement with ground truth. Further, LMs with fewer parameters, when augmented with MGD, can outperform larger LMs. With MGD, SantaCoder-1. 1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. We also conduct a generalizability study to evaluate the ability of MGD to generalize to multiple programming languages (Java, C# and Rust), coding scenarios (e. g. , correct number of arguments to method calls), and to enforce richer semantic constraints (e. g. , stateful API protocols). Our data and implementation are available at https: //github. com/microsoft/monitors4codegen.

UAI Conference 2022 Conference Paper

Robust identifiability in linear structural equation models of causal inference

  • Karthik Abinav Sankararaman
  • Anand Louis
  • Navin Goyal

We consider the problem of robust parameter estimation from observational data in the context of linear structural equation models (LSEMs). Under various conditions on LSEMs and the model parameters the prior work provides efficient algorithms to recover the parameters. However, these results are often about generic identifiability. In practice, generic identifiability is not sufficient and we need robust identifiability: small changes in the observational data should not affect the parameters by a huge amount. Robust identifiability has received far less attention and remains poorly understood. Sankararaman et al. (2019) recently provided a set of sufficient conditions on parameters under which robust identifiability is feasible. However, a limitation of their work is that their results only apply to a small sub-class of LSEMs, called “bow-free paths. ” In this work, we show that for any “bow-free model”, in all but $\frac{1}{\poly(n)}$-measure of instances robust identifiability holds. Moreover, whenever an instance is robustly identifiable, the algorithm proposed in Foygel et al. , (2012) can be used to recover the parameters in a robust fashion. In contrast, for generic identifiability Foygel et al. , (2012) proved that with measure $1$, instances are generically identifiable. Thus, we show that robust identifiability is a strictly harder problem than generic identifiability. Finally, we validate our results on both simulated and real-world datasets.

NeurIPS Conference 2021 Conference Paper

Learning and Generalization in RNNs

  • Abhishek Panigrahi
  • Navin Goyal

Simple recurrent neural networks (RNNs) and their more advanced cousins LSTMs etc. have been very successful in sequence modeling. Their theoretical understanding, however, is lacking and has not kept pace with the progress for feedforward networks, where a reasonably complete understanding in the special case of highly overparametrized one-hidden-layer networks has emerged. In this paper, we make progress towards remedying this situation by proving that RNNs can learn functions of sequences. In contrast to the previous work that could only deal with functions of sequences that are sums of functions of individual tokens in the sequence, we allow general functions. Conceptually and technically, we introduce new ideas which enable us to extract information from the hidden state of the RNN in our proofs---addressing a crucial weakness in previous work. We illustrate our results on some regular language recognition problems.

ICLR Conference 2020 Conference Paper

Effect of Activation Functions on the Training of Overparametrized Neural Nets

  • Abhishek Panigrahi
  • Abhishek Shetty
  • Navin Goyal

It is well-known that overparametrized neural networks trained using gradient based methods quickly achieve small training error with appropriate hyperparameter settings. Recent papers have proved this statement theoretically for highly overparametrized networks under reasonable assumptions. These results either assume that the activation function is ReLU or they depend on the minimum eigenvalue of a certain Gram matrix. In the latter case, existing works only prove that this minimum eigenvalue is non-zero and do not provide quantitative bounds which require that this eigenvalue be large. Empirically, a number of alternative activation functions have been proposed which tend to perform better than ReLU at least in some settings but no clear understanding has emerged. This state of affairs underscores the importance of theoretically understanding the impact of activation functions on training. In the present paper, we provide theoretical results about the effect of activation function on the training of highly overparametrized 2-layer neural networks. A crucial property that governs the performance of an activation is whether or not it is smooth: • For non-smooth activations such as ReLU, SELU, ELU, which are not smooth because there is a point where either the first order or second order derivative is discontinuous, all eigenvalues of the associated Gram matrix are large under minimal assumptions on the data. • For smooth activations such as tanh, swish, polynomial, which have derivatives of all orders at all points, the situation is more complex: if the subspace spanned by the data has small dimension then the minimum eigenvalue of the Gram matrix can be small leading to slow training. But if the dimension is large and the data satisfies another mild condition, then the eigenvalues are large. If we allow deep networks, then the small data dimension is not a limitation provided that the depth is sufficient. We discuss a number of extensions and applications of these results.

UAI Conference 2019 Conference Paper

Stability of Linear Structural Equation Models of Causal Inference

  • Karthik Abinav Sankararaman
  • Anand Louis
  • Navin Goyal

We consider numerical stability of the parameter recovery problem in Linear Structural Equation Model (LSEM) of causal inference. Numerical stability is essential for the recovered parameters to be reliable. A long line of work starting from Wright (1920) has focused on understanding which sub-classes of LSEM allow for efficient parameter recovery. Despite decades of study, this question is not yet fully resolved. The goal of the present paper is complementary to this line of work: we want to understand the stability of the recovery problem in the cases when efficient recovery is possible. Numerical stability of Pearl’s notion of causality was first studied in Schulman and Srivastava (2016) using the concept of condition number where they provide ill-conditioned examples. In this work, we provide a condition number analysis for the LSEM. First we prove that under a sufficient condition, for a certain sub-class of LSEM that are bow-free (Brito and Pearl (2002)), parameter recovery is numerically stable. We further prove that randomly chosen input parameters for this family satisfy the condition with a substantial probability. Hence for this family, on a large subset of parameter space, recovery is stable. Next we construct an example of LSEM on four vertices with unbounded condition number. We then corroborate our theoretical findings via simulations as well as real-world experiments for a sociology application. Finally, we provide a general heuristic for estimating the condition number of any LSEM instance.

AAAI Conference 2017 Conference Paper

Heavy-Tailed Analogues of the Covariance Matrix for ICA

  • Joseph Anderson
  • Navin Goyal
  • Anupama Nandi
  • Luis Rademacher

Independent Component Analysis (ICA) is the problem of learning a square matrix A, given samples of X = AS, where S is a random vector with independent coordinates. Most existing algorithms are provably efficient only when each Si has finite and moderately valued fourth moment. However, there are practical applications where this assumption need not be true, such as speech and finance. Algorithms have been proposed for heavy-tailed ICA, but they are not practical, using random walks and the full power of the ellipsoid algorithm multiple times. The main contributions of this paper are: (1) A practical algorithm for heavy-tailed ICA that we call HTICA. We provide theoretical guarantees and show that it outperforms other algorithms in some heavy-tailed regimes, both on real and synthetic data. Like the current state-of-theart, the new algorithm is based on the centroid body (a first moment analogue of the covariance matrix). Unlike the stateof-the-art, our algorithm is practically efficient. To achieve this, we use explicit analytic representations of the centroid body, which bypasses the use of the ellipsoid method and random walks. (2) We study how heavy tails affect different ICA algorithms, including HTICA. Somewhat surprisingly, we show that some algorithms that use the covariance matrix or higher moments can successfully solve a range of ICA instances with infinite second moment. We study this theoretically and experimentally, with both synthetic and real-world heavy-tailed data.

ICML Conference 2016 Conference Paper

Non-negative Matrix Factorization under Heavy Noise

  • Chiranjib Bhattacharyya
  • Navin Goyal
  • Ravindran Kannan
  • Jagdeep Pani

The Noisy Non-negative Matrix factorization (NMF) is: given a data matrix A (d x n), find non-negative matrices B; C (d x k, k x n respy.) so that A = BC +N, where N is a noise matrix. Existing polynomial time algorithms with proven error guarantees require EACH column N_⋅j to have l1 norm much smaller than ||(BC)_⋅j ||_1, which could be very restrictive. In important applications of NMF such as Topic Modeling as well as theoretical noise models (e. g. Gaussian with high sigma), almost EVERY column of N_. j violates this condition. We introduce the heavy noise model which only requires the average noise over large subsets of columns to be small. We initiate a study of Noisy NMF under the heavy noise model. We show that our noise model subsumes noise models of theoretical and practical interest (for e. g. Gaussian noise of maximum possible sigma). We then devise an algorithm TSVDNMF which under certain assumptions on B, C, solves the problem under heavy noise. Our error guarantees match those of previous algorithms. Our running time of O(k. (d+n)^2) is substantially better than the O(d. n^3) for the previous best. Our assumption on B is weaker than the “Separability” assumption made by all previous results. We provide empirical justification for our assumptions on C. We also provide the first proof of identifiability (uniqueness of B) for noisy NMF which is not based on separability and does not use hard to check geometric conditions. Our algorithm outperforms earlier polynomial time algorithms both in time and error, particularly in the presence of high noise.

FOCS Conference 2015 Conference Paper

Heavy-Tailed Independent Component Analysis

  • Joseph Anderson
  • Navin Goyal
  • Anupama Nandi
  • Luis Rademacher

Independent component analysis (ICA) is the problem of efficiently recovering a matrix A ∈ ℝ n×n from i. i. d. Observations of X=AS where S ∈ ℝ n is a random vector with mutually independent coordinates. This problem has been intensively studied, but all existing efficient algorithms with provable guarantees require that the coordinates Si have finite fourth moments. We consider the heavy-tailed ICA problem where we do not make this assumption, about the second moment. This problem also has received considerable attention in the applied literature. In the present work, we first give a provably efficient algorithm that works under the assumption that for constant γ > 0, each S i has finite (1+γ)-moment, thus substantially weakening the moment requirement condition for the ICA problem to be solvable. We then give an algorithm that works under the assumption that matrix A has orthogonal columns but requires no moment assumptions. Our techniques draw ideas from convex geometry and exploit standard properties of the multivariate spherical Gaussian distribution in a novel way.

SODA Conference 2014 Conference Paper

Annotations for Sparse Data Streams

  • Amit Chakrabarti
  • Graham Cormode
  • Navin Goyal
  • Justin Thaler

Motivated by the surging popularity of commercial cloud computing services, a number of recent works have studied annotated data streams and variants thereof. In this setting, a computationally weak verifier (cloud user), lacking the resources to store and manipulate his massive input locally, accesses a powerful but untrusted prover (cloud service). The verifier must work within the restrictive data streaming paradigm. The prover, who can annotate the data stream as it is read, must not just supply the final answer but also convince the verifier of its correctness. Ideally, both the amount of annotation from the prover and the space used by the verifier should be sublinear in the relevant input size parameters. A rich theory of such algorithms—which we call schemes —has started to emerge. Prior work has shown how to leverage the prover's power to efficiently solve problems that have no non-trivial standard data stream algorithms. However, even though optimal schemes are now known for several basic problems, such optimality holds only for streams whose length is commensurate with the size of the data universe. In contrast, many real-world data sets are relatively sparse, including graphs that contain only o ( n 2 ) edges, and IP traffic streams that contain much fewer than the total number of possible IP addresses, 2 128 in IPv6. Here we design the first annotation schemes that allow both the annotation and the space usage to be sublinear in the total number of stream updates rather than the size of the data universe. We solve significant problems, including variations of INDEX, SET-DISJOINTNESS, and FREQUENCY-MOMENTS, plus several natural problems on graphs. On the other hand, we give a new lower bound that, for the first time, rules out smooth tradeoffs between annotation and space usage for a specific problem. Our technique brings out new nuances in Merlin-Arthur communication complexity models, and provides a separation between online versions of the MA and AMA models.

ICML Conference 2013 Conference Paper

Thompson Sampling for Contextual Bandits with Linear Payoffs

  • Shipra Agrawal 0001
  • Navin Goyal

Thompson Sampling is one of the oldest heuristics for multi-armed bandit problems. It is a randomized algorithm based on Bayesian ideas, and has recently generated significant interest after several studies demonstrated it to have better empirical performance compared to the state of the art methods. However, many questions regarding its theoretical performance remained open. In this paper, we design and analyze Thompson Sampling algorithm for the stochastic contextual multi-armed bandit problem with linear payoff functions, when the contexts are provided by an adaptive adversary. This is among the most important and widely studied version of the contextual bandits problem. We prove a high probability regret bound of \tildeO(\fracd\sqrtε\sqrtT^1+ε) in time T for any ε∈(0, 1), where d is the dimension of each context vector and εis a parameter used by the algorithm. Our results provide the first theoretical guarantees for the contextual version of Thompson Sampling, and are close to the lower bound of Ω(\sqrtdT) for this problem. This essentially solves the COLT open problem of Chapelle and Li [COLT 2012] regarding regret bounds for Thompson Sampling for contextual bandits problem with linear payoff functions. Our version of Thompson sampling uses Gaussian prior and Gaussian likelihood function. Our novel martingale-based analysis techniques also allow easy extensions to the use of other distributions, satisfying certain general conditions.

STOC Conference 2008 Conference Paper

The vpn conjecture is true

  • Navin Goyal
  • Neil Olver
  • F. Bruce Shepherd

We consider the following network design problem. We are given an undirected graph G=(V,E) with edges costs c(e) and a set of terminal nodes W. A hose demand matrix for W is any symmetric matrix [D ij ] such that for each i, ∑ j ≠ i D ij ≤ 1. We must compute the minimum cost edge capacities that are able to support the oblivious routing of every hose matrix in the network. An oblivious routing template, in this context, is a simple path P ij for each pair i,j ∈ W. Given such a template, if we are to route a demand matrix D, then for each i,j we send D ij units of flow along each P ij . Fingerhut et al. and Gupta et al. obtained a 2-approximation for this problem, using a solution template in the form of a tree. It has been widely asked and subsequently conjectured [Italiano 2006] that this solution actually results in the optimal capacity for the single path VPN design problem; this has become known as the VPN conjecture . The conjecture has previously been proven for some restricted classes of graphs [Hurkens 2005, Grandoni 2007, Fiorini 2007]. Our main theorem establishes that this conjecture is true in general graphs. This also gives the first polynomial time algorithm for the single path VPN problem. We also show that the multipath version of the conjecture is false.

FOCS Conference 2006 Conference Paper

Lower bounds for circuits with MOD_m gates

  • Arkadev Chattopadhyay
  • Navin Goyal
  • Pavel Pudlák
  • Denis Thérien

Let CC o(n) [m] be the class of circuits that have size o(n) and in which all gates are MOD[m] gates. We show that CC [m] circuits cannot compute MOD q in sub-linear size when m, q > 1 are co-prime integers. No non-trivial lower bounds were known before on the size of CC [m] circuits of constant depth for computing MOD q. On the other hand, our results show circuits of type MAJ o CC o(n) [m] need exponential size to compute MOD q. Using Bourgain's recent breakthrough result on estimates of exponential sums, we extend our bound to the case where small fan-in AND gates are allowed at the bottom of such circuits i. e. circuits of type MAJ o CC[m] o AND epsiv log n, where epsiv > 0 is a sufficiently small constant. CC [m] circuits of constant depth need superlinear number of wires to compute both the AND and MOD q functions. To prove this, we show that any circuit computing such functions has a certain connectivity property that is similar to that of superconcentration. We show a superlinear lower bound on the number of edges of such graphs extending results on superconcentrators

FOCS Conference 2005 Conference Paper

Lower Bounds for the Noisy Broadcast Problem

  • Navin Goyal
  • Guy Kindler
  • Michael E. Saks

We prove the first nontrivial (superlinear) lower bound in the noisy broadcast model of distributed computation. In this model, there are n + 1 processors P/sub 0/, P/sub 1/, .. ., P/sub n/. Each P/sub i/, for i /spl ges/ 1, initially has a private bit x/sub i/ and the goal is for P/sub 0/ to learn f (x/sub l/, .. ., x/sub n/) for some specified function f. At each time step, a designated processor broadcasts some function of its private bit and the bits it has heard so far. Each broadcast is received by the other processors but each reception may be corrupted by noise. In this model, Gallager (1988) gave a noise-resistant protocol that allows P/sub 0/ to learn the entire input in O(n log log n) broadcasts. We prove that Gallager's protocol is optimal up to a constant factor. Our lower bound follows from a lower bound in a new model, the generalized noisy decision tree model, which may be of independent interest.