Arrow Research search

Author name cluster

Atri Rudra

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

ICLR Conference 2025 Conference Paper

Towards Learning High-Precision Least Squares Algorithms with Sequence Models

  • Jerry Weihong Liu
  • Jessica Grogan
  • Owen M. Dugan
  • Ashish Rao
  • Simran Arora
  • Atri Rudra
  • Christopher Ré

This paper investigates whether sequence models can learn to perform numerical algorithms, e.g. gradient descent, on the fundamental problem of least squares. Our goal is to inherit two properties of standard algorithms from numerical analysis: (1) machine precision, i.e. we want to obtain solutions that are accurate to near floating point error, and (2) numerical generality, i.e. we want them to apply broadly across problem instances. We find that prior approaches using Transformers fail to meet these criteria, and identify limitations present in existing architectures and training procedures. First, we show that softmax Transformers struggle to perform high-precision multiplications, which prevents them from precisely learning numerical algorithms. Second, we identify an alternate class of architectures, comprised entirely of polynomials, that can efficiently represent high-precision gradient descent iterates. Finally, we investigate precision bottlenecks during training and address them via a high-precision training recipe that reduces stochastic gradient noise. Our recipe enables us to train two polynomial architectures, gated convolutions and linear attention, to perform gradient descent iterates on least squares problems. For the first time, we demonstrate the ability to train to near machine precision. Applied iteratively, our models obtain $100,000\times$ lower MSE than standard Transformers trained end-to-end and they incur a $10,000\times$ smaller generalization gap on out-of-distribution problems. We make progress towards end-to-end learning of numerical algorithms for least squares.

ICML Conference 2024 Conference Paper

Simple linear attention language models balance the recall-throughput tradeoff

  • Simran Arora
  • Sabri Eyuboglu
  • Michael Zhang
  • Aman Timalsina
  • Silas Alberti
  • James Y. Zou
  • Atri Rudra
  • Christopher Ré

Recent work has shown that attention-based language models excel at "recall", the ability to ground generations in tokens previously seen in context. However, the efficiency of attention-based models is bottle-necked during inference by the KV-cache’s aggressive memory consumption. In this work, we explore whether we can improve language model efficiency (e. g. by reducing memory consumption) without compromising on recall. By applying experiments and theory to a broad set of architectures, we identify a key tradeoff between a model’s recurrent state size and recall ability. We show that efficient alternatives to attention (e. g. H3, Mamba, RWKV) maintain a fixed-size recurrent state, but struggle at recall. We propose BASED a simple architecture combining linear and sliding window attention. By varying BASED window size and linear attention feature dimension, we can dial the state size and traverse the Pareto frontier of the recall-memory tradeoff curve, recovering the full quality of attention on one end and the small state size of attention-alternatives on the other. We train language models up to $1. 3$b parameters and show that BASED matches the strongest sub-quadratic models (e. g. Mamba) in perplexity and outperforms them on real-world recall-intensive tasks by 10. 36 accuracy points. We further develop IO-aware algorithms that enable BASED to provide 24× higher throughput on language generation than FlashAttention-2, when generating 1024 tokens using 1. 3b parameter models. Overall, BASED expands the Pareto frontier of the throughput-recall tradeoff space beyond prior architectures.

ICLR Conference 2024 Conference Paper

Zoology: Measuring and Improving Recall in Efficient Language Models

  • Simran Arora
  • Sabri Eyuboglu
  • Aman Timalsina
  • Isys Johnson
  • Michael Poli
  • James Y. Zou
  • Atri Rudra
  • Christopher Ré

Attention-free language models that combine gating and convolutions are growing in popularity due to their efficiency and increasingly competitive performance. To better understand these architectures, we pretrain a suite of 17 attention and gated-convolution language models, finding that SoTA gated-convolution architectures still underperform attention by up to 2.1 perplexity points on the Pile. In fine-grained analysis, we find 82% of the gap is explained by each model's ability to recall information that is previously mentioned in-context, e.g. "Hakuna Matata means no worries Hakuna Matata it means no" -> ??. On this task, termed "associative recall", we find that attention outperforms gated-convolutions by a large margin: a 70M parameter attention model outperforms a 1.4 billion parameter gated-convolution model on associative recall. This is surprising because prior work shows gated convolutions can perfectly solve synthetic tests for AR capability. To close the gap between synthetics and real language, we develop a new formalization of the task called multi-query associative recall (MQAR) that better reflects actual language. We perform an empirical and theoretical study of MQAR that elucidates differences in the parameter-efficiency of attention and gated-convolution recall. Informed by our analysis, we evaluate simple convolution-attention hybrids and show that hybrids with input-dependent sparse attention patterns can close 97.4% of the gap to attention, while maintaining sub-quadratic scaling. Code is at: https://github.com/HazyResearch/zoology.

ICLR Conference 2023 Conference Paper

How to Train your HIPPO: State Space Models with Generalized Orthogonal Basis Projections

  • Albert Gu
  • Isys Johnson
  • Aman Timalsina
  • Atri Rudra
  • Christopher Ré

Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular *time-varying* dynamical system, and the use of this matrix as a *time-invariant* SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task.

ICLR Conference 2023 Conference Paper

Hungry Hungry Hippos: Towards Language Modeling with State Space Models

  • Daniel Y. Fu
  • Tri Dao
  • Khaled Kamal Saab
  • Armin W. Thomas
  • Atri Rudra
  • Christopher Ré

State space models (SSMs) have demonstrated state-of-the-art sequence modeling performance in some modalities, but underperform attention in language modeling. Moreover, despite scaling nearly linearly in sequence length instead of quadratically, SSMs are still slower than Transformers due to poor hardware utilization. In this paper, we make progress on understanding the expressivity gap between SSMs and attention in language modeling, and on reducing the hardware barrier between SSMs and attention. First, we use synthetic language modeling tasks to understand the gap between SSMs and attention. We find that existing SSMs struggle with two capabilities: recalling earlier tokens in the sequence and comparing tokens across the sequence. To understand the impact on language modeling, we propose a new SSM layer, H3, that is explicitly designed for these abilities. H3 matches attention on the synthetic languages and comes within 0.4 PPL of Transformers on OpenWebText. Furthermore, a hybrid 125M-parameter H3-attention model that retains two attention layers surprisingly outperforms Transformers on OpenWebText by 1.0 PPL. Next, to improve the efficiency of training SSMs on modern hardware, we propose FlashConv. FlashConv uses a fused block FFT algorithm to improve efficiency on sequences up to 8K, and introduces a novel state passing algorithm that exploits the recurrent properties of SSMs to scale to longer sequences. FlashConv yields 2$\times$ speedup on the long-range arena benchmark and allows hybrid language models to generate text 2.4$\times$ faster than Transformers. Using FlashConv, we scale hybrid H3-attention language models up to 2.7B parameters on the Pile and find promising initial results, achieving lower perplexity than Transformers and outperforming Transformers in zero- and few-shot learning on a majority of tasks in the SuperGLUE benchmark.

NeurIPS Conference 2023 Conference Paper

Laughing Hyena Distillery: Extracting Compact Recurrences From Convolutions

  • Stefano Massaroli
  • Michael Poli
  • Dan Fu
  • Hermann Kumbong
  • Rom Parnichkun
  • David Romero
  • Aman Timalsina
  • Quinn McIntyre

Recent advances in attention-free sequence models rely on convolutions as alternatives to the attention operator at the core of Transformers. In particular, long convolution sequence models have achieved state-of-the-art performance in many domains, but incur a significant cost during auto-regressive inference workloads -- naively requiring a full pass (or caching of activations) over the input sequence for each generated token -- similarly to attention-based models. In this paper, we seek to enable $\mathcal O(1)$ compute and memory cost per token in any pre-trained long convolution architecture to reduce memory footprint and increase throughput during generation. Concretely, our methods consist in extracting low-dimensional linear state-space models from each convolution layer, building upon rational interpolation and model-order reduction techniques. We further introduce architectural improvements to convolution-based layers such as Hyena: by weight-tying the filters across channels into heads, we achieve higher pre-training quality and reduce the number of filters to be distilled. The resulting model achieves 10x higher throughput than Transformers and 1. 5x higher than Hyena at 1. 3B parameters, without any loss in quality after distillation.

NeurIPS Conference 2023 Conference Paper

Monarch Mixer: A Simple Sub-Quadratic GEMM-Based Architecture

  • Dan Fu
  • Simran Arora
  • Jessica Grogan
  • Isys Johnson
  • Evan Sabri Eyuboglu
  • Armin Thomas
  • Benjamin Spector
  • Michael Poli

Machine learning models are increasingly being scaled in both sequence length and model dimension to reach longer contexts and better performance. However, existing architectures such as Transformers scale quadratically along both these axes. We ask: are there performant architectures that can scale sub-quadratically along sequence length and model dimension? We introduce Monarch Mixer (M2), a new architecture that uses the same sub-quadratic primitive along both sequence length and model dimension: Monarch matrices, a simple class of expressive structured matrices that captures many linear transforms, achieves high hardware efficiency on GPUs, and scales sub-quadratically. As a proof of concept, we explore the performance of M2 in three domains: non-causal BERT-style language modeling, ViT-style image classification, and causal GPT-style language modeling. For non-causal BERT-style modeling, M2 matches BERT-base and BERT-large in downstream GLUE quality with up to 27% fewer parameters, and achieves up to 9. 1$\times$ higher throughput at sequence length 4K. On ImageNet, M2 outperforms ViT-b by 1% in accuracy, with only half the parameters. Causal GPT-style models introduce a technical challenge: enforcing causality via masking introduces a quadratic bottleneck. To alleviate this bottleneck, we develop a novel theoretical view of Monarch matrices based on multivariate polynomial evaluation and interpolation, which lets us parameterize M2 to be causal while remaining sub-quadratic. Using this parameterization, M2 matches GPT-style Transformers at 360M parameters in pretraining perplexity on The PILE—showing for the first time that it may be possible to match Transformer quality without attention or MLPs.

ICML Conference 2023 Conference Paper

Simple Hardware-Efficient Long Convolutions for Sequence Modeling

  • Daniel Y. Fu
  • Elliot L. Epstein
  • Eric Nguyen
  • Armin W. Thomas
  • Michael Zhang
  • Tri Dao
  • Atri Rudra
  • Christopher Ré

State space models (SSMs) have high performance on long sequence modeling but require sophisticated initialization techniques and specialized implementations for high quality and runtime performance. We study whether a simple alternative can match SSMs in performance and efficiency: directly learning long convolutions over the sequence. We find that a key requirement to achieving high performance is keeping the convolution kernels smooth. We find that simple interventions-such as squashing the kernel weights-result in smooth kernels and recover SSM performance on a range of tasks including the long range arena, image classification, language modeling, and brain data modeling. Next, we develop FlashButterfly, an IO-aware algorithm to improve the runtime performance of long convolutions. FlashButterfly appeals to classic Butterfly decompositions of the convolution to reduce GPU memory IO and increase FLOP utilization. FlashButterfly speeds up convolutions by 2. 2$\times$, and allows us to train on Path256, a challenging task with sequence length 64K, where we set state-of-the-art by 29. 1 points while training 7. 2$\times$ faster than prior work. Lastly, we introduce an extension to FlashButterfly that learns the coefficients of the Butterfly decomposition, increasing expressivity without increasing runtime. Using this extension, we outperform a Transformer on WikiText103 by 0. 2 PPL with 30% fewer parameters.

NeurIPS Conference 2022 Conference Paper

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness

  • Tri Dao
  • Dan Fu
  • Stefano Ermon
  • Atri Rudra
  • Christopher Ré

Transformers are slow and memory-hungry on long sequences, since the time and memory complexity of self-attention are quadratic in sequence length. Approximate attention methods have attempted to address this problem by trading off model quality to reduce the compute complexity, but often do not achieve wall-clock speedup. We argue that a missing principle is making attention algorithms IO-aware---accounting for reads and writes between levels of GPU memory. We propose FlashAttention, an IO-aware exact attention algorithm that uses tiling to reduce the number of memory reads/writes between GPU high bandwidth memory (HBM) and GPU on-chip SRAM. We analyze the IO complexity of FlashAttention, showing that it requires fewer HBM accesses than standard attention, and is optimal for a range of SRAM sizes. We also extend FlashAttention, yielding an approximate attention algorithm that is faster than any existing approximate attention method. FlashAttention, 3x speedup on GPT-2 (seq. length 1K), and 2. 4x speedup on long-range arena (seq. length 1K-4K). FlashAttention, yielding higher quality models (0. 7 better perplexity on GPT-2 and 6. 4 points of lift on long-document classification) and entirely new capabilities: the first Transformers to achieve better-than-chance performance on the Path-X challenge (seq. length 16K, 61. 4% accuracy) and Path-256 (seq. length 64K, 63. 1% accuracy).

ICML Conference 2022 Conference Paper

Monarch: Expressive Structured Matrices for Efficient and Accurate Training

  • Tri Dao
  • Beidi Chen
  • Nimit Sharad Sohoni
  • Arjun D. Desai
  • Michael Poli
  • Jessica Grogan
  • Alexander Liu
  • Aniruddh Rao

Large neural networks excel in many domains, but they are expensive to train and fine-tune. A popular approach to reduce their compute or memory requirements is to replace dense weight matrices with structured ones (e. g. , sparse, low-rank, Fourier transform). These methods have not seen widespread adoption (1) in end-to-end training due to unfavorable efficiency–quality tradeoffs, and (2) in dense-to-sparse fine-tuning due to lack of tractable algorithms to approximate a given dense weight matrix. To address these issues, we propose a class of matrices (Monarch) that is hardware-efficient (they are parameterized as products of two block-diagonal matrices for better hardware utilization) and expressive (they can represent many commonly used transforms). Surprisingly, the problem of approximating a dense weight matrix with a Monarch matrix, though nonconvex, has an analytical optimal solution. These properties of Monarch matrices unlock new ways to train and fine-tune sparse and dense models. We empirically validate that Monarch can achieve favorable accuracy-efficiency tradeoffs in several end-to-end sparse training applications: speeding up ViT and GPT-2 training on ImageNet classification and Wikitext-103 language modeling by 2x with comparable model quality, and reducing the error on PDE solving and MRI reconstruction tasks by 40%. In sparse-to-dense training, with a simple technique called "reverse sparsification, " Monarch matrices serve as a useful intermediate representation to speed up GPT-2 pretraining on OpenWebText by 2x without quality drop. The same technique brings 23% faster BERT pretraining than even the very optimized implementation from Nvidia that set the MLPerf 1. 1 record. In dense-to-sparse fine-tuning, as a proof-of-concept, our Monarch approximation algorithm speeds up BERT fine-tuning on GLUE by 1. 7x with comparable accuracy.

ICLR Conference 2022 Conference Paper

Pixelated Butterfly: Simple and Efficient Sparse training for Neural Network Models

  • Beidi Chen
  • Tri Dao
  • Kaizhao Liang
  • Jiaming Yang
  • Zhao Song 0002
  • Atri Rudra
  • Christopher Ré

Overparameterized neural networks generalize well but are expensive to train. Ideally one would like to reduce their computational cost while retaining their generalization benefits. Sparse model training is a simple and promising approach to achieve this, but there remain challenges as existing methods struggle with accuracy loss, slow training runtime, or difficulty in sparsifying all model components. The core problem is that searching for a sparsity mask over a discrete set of sparse matrices is difficult and expensive. To address this, our main insight is to optimize over a continuous superset of sparse matrices with a fixed structure known as products of butterfly matrices. As butterfly matrices are not hardware efficient, we propose simple variants of butterfly (block and flat) to take advantage of modern hardware. Our method (Pixelated Butterfly) uses a simple fixed sparsity pattern based on flat block butterfly and low-rank matrices to sparsify most network layers (e.g., attention, MLP). We empirically validate that Pixelated Butterfly is $3\times$ faster than Butterfly and speeds up training to achieve favorable accuracy--efficiency tradeoffs. On the ImageNet classification and WikiText-103 language modeling tasks, our sparse models train up to 2.3$\times$ faster than the dense MLP-Mixer, Vision Transformer, and GPT-2 small with no drop in accuracy.

NeurIPS Conference 2021 Conference Paper

Combining Recurrent, Convolutional, and Continuous-time Models with Linear State Space Layers

  • Albert Gu
  • Isys Johnson
  • Karan Goel
  • Khaled Saab
  • Tri Dao
  • Atri Rudra
  • Christopher Ré

Recurrent neural networks (RNNs), temporal convolutions, and neural differential equations (NDEs) are popular families of deep learning models for time-series data, each with unique strengths and tradeoffs in modeling power and computational efficiency. We introduce a simple sequence model inspired by control systems that generalizes these approaches while addressing their shortcomings. The Linear State-Space Layer (LSSL) maps a sequence $u \mapsto y$ by simply simulating a linear continuous-time state-space representation $\dot{x} = Ax + Bu, y = Cx + Du$. Theoretically, we show that LSSL models are closely related to the three aforementioned families of models and inherit their strengths. For example, they generalize convolutions to continuous-time, explain common RNN heuristics, and share features of NDEs such as time-scale adaptation. We then incorporate and generalize recent theory on continuous-time memorization to introduce a trainable subset of structured matrices $A$ that endow LSSLs with long-range memory. Empirically, stacking LSSL layers into a simple deep neural network obtains state-of-the-art results across time series benchmarks for long dependencies in sequential image classification, real-world healthcare regression tasks, and speech. On a difficult speech classification task with length-16000 sequences, LSSL outperforms prior approaches by 24 accuracy points, and even outperforms baselines that use hand-crafted features on 100x shorter sequences.

NeurIPS Conference 2021 Conference Paper

Scatterbrain: Unifying Sparse and Low-rank Attention

  • Beidi Chen
  • Tri Dao
  • Eric Winsor
  • Zhao Song
  • Atri Rudra
  • Christopher Ré

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences. However, it is still challenging to balance the trade-off between model quality and efficiency to perform a one-size-fits-all approximation for different tasks. To better understand this trade-off, we observe that sparse and low-rank approximations excel in different regimes, determined by the softmax temperature in attention, and sparse + low-rank can outperform each individually. Inspired by the classical robust-PCA algorithm for sparse and low-rank decomposition, we propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate and efficient approximation. The estimation is unbiased with provably low error. We empirically show that Scatterbrain can achieve $2. 1 \times$ lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT. On a pre-trained T2T Vision transformer, even without fine-tuning, Scatterbrain can reduce $98\%$ of attention memory at the cost of only $1\%$ drop in accuracy. We demonstrate Scatterbrain for end-to-end training with up to $4$ points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.

NeurIPS Conference 2020 Conference Paper

HiPPO: Recurrent Memory with Optimal Polynomial Projections

  • Albert Gu
  • Tri Dao
  • Stefano Ermon
  • Atri Rudra
  • Christopher Ré

A central problem in learning from sequential data is representing cumulative history in an incremental fashion as more data is processed. We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. Given a measure that specifies the importance of each time step in the past, HiPPO produces an optimal solution to a natural online function approximation problem. As special cases, our framework yields a short derivation of the recent Legendre Memory Unit (LMU) from first principles, and generalizes the ubiquitous gating mechanism of recurrent neural networks such as GRUs. This formal framework yields a new memory update mechanism (HiPPO-LegS) that scales through time to remember all history, avoiding priors on the timescale. HiPPO-LegS enjoys the theoretical benefits of timescale robustness, fast updates, and bounded gradients. By incorporating the memory dynamics into recurrent neural networks, HiPPO RNNs can empirically capture complex temporal dependencies. On the benchmark permuted MNIST dataset, HiPPO-LegS sets a new state-of-the-art accuracy of 98. 3%. Finally, on a novel trajectory classification task testing robustness to out-of-distribution timescales and missing data, HiPPO-LegS outperforms RNN and neural ODE baselines by 25-40% accuracy.

ICLR Conference 2020 Conference Paper

Kaleidoscope: An Efficient, Learnable Representation For All Structured Linear Maps

  • Tri Dao
  • Nimit Sharad Sohoni
  • Albert Gu
  • Matthew Eichhorn
  • Amit Blonder
  • Megan Leszczynski
  • Atri Rudra
  • Christopher Ré

Modern neural network architectures use structured linear transformations, such as low-rank matrices, sparse matrices, permutations, and the Fourier transform, to improve inference speed and reduce memory usage compared to general linear maps. However, choosing which of the myriad structured transformations to use (and its associated parameterization) is a laborious task that requires trading off speed, space, and accuracy. We consider a different approach: we introduce a family of matrices called kaleidoscope matrices (K-matrices) that provably capture any structured matrix with near-optimal space (parameter) and time (arithmetic operation) complexity. We empirically validate that K-matrices can be automatically learned within end-to-end pipelines to replace hand-crafted procedures, in order to improve model quality. For example, replacing channel shuffles in ShuffleNet improves classification accuracy on ImageNet by up to 5%. K-matrices can also simplify hand-engineered pipelines---we replace filter bank feature computation in speech data preprocessing with a learnable kaleidoscope layer, resulting in only 0.4% loss in accuracy on the TIMIT speech recognition task. In addition, K-matrices can capture latent structure in models: for a challenging permuted image classification task, adding a K-matrix to a standard convolutional architecture can enable learning the latent permutation and improve accuracy by over 8 points. We provide a practically efficient implementation of our approach, and use K-matrices in a Transformer network to attain 36% faster end-to-end inference speed on a language translation task.

ICML Conference 2019 Conference Paper

Learning Fast Algorithms for Linear Transforms Using Butterfly Factorizations

  • Tri Dao
  • Albert Gu
  • Matthew Eichhorn
  • Atri Rudra
  • Christopher Ré

Fast linear transforms are ubiquitous in machine learning, including the discrete Fourier transform, discrete cosine transform, and other structured transformations such as convolutions. All of these transforms can be represented by dense matrix-vector multiplication, yet each has a specialized and highly efficient (subquadratic) algorithm. We ask to what extent hand-crafting these algorithms and implementations is necessary, what structural prior they encode, and how much knowledge is required to automatically learn a fast algorithm for a provided structured transform. Motivated by a characterization of fast matrix-vector multiplication as products of sparse matrices, we introduce a parameterization of divide-and-conquer methods that is capable of representing a large class of transforms. This generic formulation can automatically learn an efficient algorithm for many important transforms; for example, it recovers the $O(N \log N)$ Cooley-Tukey FFT algorithm to machine precision, for dimensions $N$ up to $1024$. Furthermore, our method can be incorporated as a lightweight replacement of generic matrices in machine learning pipelines to learn efficient and compressible transformations. On a standard task of compressing a single hidden-layer network, our method exceeds the classification accuracy of unconstrained matrices on CIFAR-10 by 3. 9 points—the first time a structured approach has done so—with 4X faster inference speed and 40X fewer parameters.

SODA Conference 2018 Conference Paper

A Two-pronged Progress in Structured Dense Matrix Vector Multiplication

  • Christopher De Sa
  • Albert Gu
  • Rohan Puttagunta
  • Christopher Ré
  • Atri Rudra

Matrix-vector multiplication is one of the most fundamental computing primitives. Given a matrix and a vector, it is known that in the worst case Θ( N 2 ) operations over F are needed to compute Ab. Many types of structured matrices do admit faster multiplication. However, even given a matrix A that is known to have this property, it is hard in general to recover a representation of A exposing the actual fast multiplication algorithm. Additionally, it is not known in general whether the inverses of such structured matrices can be computed or multiplied quickly. A broad question is thus to identify classes of structured dense matrices that can be represented with O ( N ) parameters, and for which matrix-vector multiplication (and ideally other operations such as solvers) can be performed in a sub-quadratic number of operations. One such class of structured matrices that admit near-linear matrix-vector multiplication are the orthogonal polynomial transforms whose rows correspond to a family of orthogonal polynomials. Other well known classes include the Toeplitz, Hankel, Vandermonde, Cauchy matrices and their extensions (e. g. confluent Cauchy-like matrices) that are all special cases of a low displacement rank property. In this paper, we make progress on two fronts: 1. We introduce the notion of recurrence width of matrices. For matrices A with constant recurrence width, we design algorithms to compute both Ab and A T b with a near-linear number of operations. This notion of width is finer than all the above classes of structured matrices and thus we can compute near-linear matrix-vector multiplication for all of them using the same core algorithm. Furthermore, we show that it is possible to solve the harder problems of recovering the structured parameterization of a matrix with low recurrence width, and computing matrix-vector product with its inverse in near-linear time. 2. We additionally adapt our algorithm to a matrix-vector multiplication algorithm for a much more general class of matrices with displacement structure: those with low displacement rank with respect to quasiseparable matrices. This result is a novel connection between matrices with displacement structure and those with rank structure, two large but previously separate classes of structured matrices. This class includes Toeplitz-plus-Hankel-like matrices, the Discrete Trigonometric Transforms, and more, and captures all previously known matrices with displacement structure under a unified parameterization and algorithm. Our work unifies, generalizes, and simplifies existing state-of-the-art results in structured matrix-vector multiplication. Finally, we show how applications in areas such as multipoint evaluations of multivariate polynomials can be reduced to problems involving low recurrence width matrices.

NeurIPS Conference 2018 Conference Paper

Learning Compressed Transforms with Low Displacement Rank

  • Anna Thomas
  • Albert Gu
  • Tri Dao
  • Atri Rudra
  • Christopher Ré

The low displacement rank (LDR) framework for structured matrices represents a matrix through two displacement operators and a low-rank residual. Existing use of LDR matrices in deep learning has applied fixed displacement operators encoding forms of shift invariance akin to convolutions. We introduce a rich class of LDR matrices with more general displacement operators, and explicitly learn over both the operators and the low-rank component. This class generalizes several previous constructions while preserving compression and efficient computation. We prove bounds on the VC dimension of multi-layer neural networks with structured weight matrices and show empirically that our compact parameterization can reduce the sample complexity of learning. When replacing weight layers in fully-connected, convolutional, and recurrent neural networks for image classification and language modeling tasks, our new classes exceed the accuracy of existing compression approaches, and on some tasks even outperform general unstructured layers while using more than 20x fewer parameters.

SODA Conference 2017 Conference Paper

Tight Network Topology Dependent Bounds on Rounds of Communication

  • Arkadev Chattopadhyay
  • Michael Langberg
  • Shi Li 0001
  • Atri Rudra

We prove tight network topology dependent bounds on the round complexity of computing well studied k -party functions such as set disjointness and element distinctness. Unlike the usual case in the CONGEST model in distributed computing, we fix the function and then vary the underlying network topology. This complements the recent such results on total communication that have received some attention. We also present some applications to distributed graph computation problems. Our main contribution is a proof technique that allows us to reduce the problem on a general graph topology to a relevant two-party communication complexity problem. However, unlike many previous works that also used the same high level strategy, we do not reason about a two-party communication problem that is induced by a cut in the graph. To ‘stitch’ back the various lower bounds from the two party communication problems, we use the notion of timed graph that has seen prior use in network coding. Our reductions use some tools from Steiner tree packing and multi-commodity flow problems that have a delay constraint.

STOC Conference 2014 Conference Paper

Every list-decodable code for high noise has abundant near-optimal rate puncturings

  • Atri Rudra
  • Mary Wootters

We show that any q -ary code with sufficiently good distance can be randomly punctured to obtain, with high probability, a code that is list decodable up to radius 1 --- 1/ q --- ε with near-optimal rate and list sizes. Our results imply that "most" Reed-Solomon codes are list decodable beyond the Johnson bound, settling the longstanding open question of whether any Reed Solomon codes meet this criterion. More precisely, we show that a Reed-Solomon code with random evaluation points is, with high probability, list decodable up to radius 1 --- ε with list sizes O (1/ ε ) and rate Ω( ε ). As a second corollary of our argument, we obtain improved bounds on the list decodability of random linear codes over large fields. Our approach exploits techniques from high dimensional probability. Previous work used similar tools to obtain bounds on the list decodability of random linear codes, but the bounds did not scale with the size of the alphabet. In this paper, we use a chaining argument to deal with large alphabet sizes.

FOCS Conference 2014 Conference Paper

Topology Matters in Communication

  • Arkadev Chattopadhyay
  • Jaikumar Radhakrishnan
  • Atri Rudra

We consider the communication cost of computing functions when inputs are distributed among the vertices of an undirected graph. The communication is assumed to be point-to-point: a processor sends messages only to its neighbors. The processors in the graph act according to a pre-determined protocol, which can be randomized and may err with some small probability. The communication cost of the protocol is the total number of bits exchanged in the worst case. Extending recent work that assumed that the graph was the complete graph (with unit edge lengths), we develop a methodology for showing lower bounds that are sensitive to the graph topology. In particular, for a broad class of graphs, we obtain a lower bound of the form Ω(k 2 n), for computing a function of k inputs, each of which is n-bits long and located at a different vertex. Previous works obtained lower bounds of the form Ω(k n). This methodology yields a variety of other results including the following: A tight lower bound (ignoring poly-log factors) for Element Distinctness, settling a question of Phillips, Verbin and Zhang (SODA '12), a distributed XOR lemma, a lower bound for composed functions, settling a question of Phillips et al. , new topology-dependent bounds for several natural graph problems considered by Woodruff and Zhang (DISC '13). To obtain these results we use tools from the theory of metric embeddings and represent the topological constraints imposed by the graph as a collection of cuts, each cut providing a setting where our understanding of two-party communication complexity can be effectively deployed.

TCS Journal 2011 Journal Article

Pricing commodities

  • Robert Krauthgamer
  • Aranyak Mehta
  • Atri Rudra

How should a seller price her goods in a market where each buyer prefers a single good among his desired goods, and will buy the cheapest such good, as long as it is within his budget? We provide efficient algorithms that compute near-optimal prices for this problem, focusing on a commodity market, where the range of buyer budgets is small. We also show that our LP rounding based technique easily extends to a different scenario, in which the buyers want to buy all the desired goods, as long as they are within budget.

MFCS Conference 2011 Conference Paper

Symmetric Functions Capture General Functions

  • Richard J. Lipton
  • Kenneth W. Regan
  • Atri Rudra

Abstract We show that the set of all functions is equivalent to the set of all symmetric functions (possibly over a larger domain) up to deterministic time complexity. In particular, for any function f, there is an equivalent symmetric function f sym such that f can be computed from f sym and vice-versa (modulo an extra deterministic linear time computation). For f over finite fields, f sym is (necessarily) over an extension field. This reduction is optimal in size of the extension field. For polynomial functions, the degree of f sym is not optimal. We present another reduction that has optimal degree “blowup” but is worse in the other parameters.

STOC Conference 2007 Conference Paper

Lower bounds for randomized read/write stream algorithms

  • Paul Beame
  • T. S. Jayram
  • Atri Rudra

Motivated by the capabilities of modern storage architectures, we consider the following generalization of the data stream model where the algorithm has sequential access to multiple streams. Unlike the data stream model, where the stream is read only, in this new model (introduced in [8,9]) the algorithms can also write onto streams. There is no limit on the size of the streams but the number of passes made on the streams is restricted. On the other hand, the amount of internal memory used by the algorithm is scarce, similar to data stream model. We resolve the main open problem in [7] of proving lower bounds in this model for algorithms that are allowed to have 2-sided error. Previously, such lower bounds were shown only for deterministic and 1-sided error randomized algorithms [9,7]. We consider the classical set disjointness problemthat has proved to be invaluable for deriving lower bounds for many other problems involving data streams and other randomized models of computation. For this problem, we show a near-linear lower bound on the size of the internal memory used by a randomized algorithm with 2-sided error that is allowed to have o(log N/log log N) passes over the streams. This bound is almost optimal sincethere is a simple algorithm that can solve this problem using logarithmic memory if the number of passes over the streams. Applications include near-linear lower bounds onthe internal memory for well-known problems in the literature:(1) approximately counting the number of distinct elements in the input (F 0 );(2) approximating the frequency of the mod of an input sequence(F* ∞ );(3) computing the join of two relations; and (4) deciding if some node of an XML document matches an XQuery (or XPath) query. Our techniques involve a novel direct-sum type of argument that yields lower bounds for many other problems. Our results asymptotically improve previously known bounds for any problem even in deterministic and 1-sided error models of computation.

STOC Conference 2006 Conference Paper

Explicit capacity-achieving list-decodable codes

  • Venkatesan Guruswami
  • Atri Rudra

For every 0 0, we present an explicit construction of error-correcting codes of rate R that can be list decoded in polynomial time up to a fraction (1-R-ε) of errors. These codes achieve the "capacity" for decoding from adversarial errors, i.e., achieve the optimal trade-off between rate and error-correction radius. At least theoretically, this meets one of the central challenges in coding theory.Prior to this work, explicit codes achieving capacity were not known for any rate R. In fact, our codes are the first to beat the error-correction radius of 1-√R, that was achieved for Reed-Solomon (RS) codes in [9], for all rates R. (For rates R < 1/16, Parvaresh and Vardy [12] had recently improved upon the 1-√R bound; for R → 0, their algorithm can decode a fraction 1-O(R log(1/R)) of errors.)Our codes are simple to describe --- they are certain folded Reed-Solomon codes , which are in fact exactly RS codes, but viewed as a code over a larger alphabet by careful bundling of codeword symbols. Given the ubiquity of RS codes, this is an appealing feature of our result, since the codes we propose are not too far from the ones in actual use.The main insight in our work is that some carefully chosen folded RS codes are "compressed" versions of a related family of Parvaresh-Vardy codes. Further, the decoding of the folded RS codes can be reduced to list decoding the related Parvaresh-Vardy codes. The alphabet size of these folded RS codes is polynomial in the block length. This can be reduced to a constant that depends on the distance ε to capacity using ideas concerning "list recovering" and expander-based codes from [7, 8]. Concatenating the folded RS codes with suitable inner codes also gives us polytime constructible binary codes that can be efficiently list decoded up to the Zyablov bound.

FOCS Conference 2004 Conference Paper

Testing Low-Degree Polynomials over Prime Fields

  • Charanjit S. Jutla
  • Anindya C. Patthak
  • Atri Rudra
  • David Zuckerman

We present an efficient randomized algorithm to test if a given function f: F/sub p/ /sup n/ /spl rarr/ F/sub p/ (where p is a prime) is a low-degree polynomial. This gives a local test for generalized Reed-Muller codes over prime fields. For a given integer t and a given real /spl epsiv/ > 0, the algorithm queries f at 1//spl epsiv/ + t/spl middot/p/sup 2r/p-1+O(1)/ points to determine whether f can be described by a polynomial of degree at most t. If f is indeed a polynomial of degree at most t, our algorithm always accepts, and if f has a relative distance at least e from every degree t polynomial, then our algorithm rejects f with probability at least 1/2. Our result is almost optimal since any such algorithm must query f on at least /spl Omega/(1//spl epsiv/ + p/sup r+1/p-1/) points.