Arrow Research search

Author name cluster

Majid Daliri

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
2 author rows

Possible papers

3

ICLR Conference 2025 Conference Paper

Matrix Product Sketching via Coordinated Sampling

  • Majid Daliri
  • Juliana Freire
  • Danrong Li
  • Christopher Musco

We revisit the well-studied problem of approximating a matrix product, $\bv{A}^T\bv{B}$, based on small space sketches $\mathcal{S}(\bv{A})$ and $\mathcal{S}(\bv{B})$ of $\bv{A} \in \R^{n \times d}$ and $\bv{B}\in \R^{n \times m}$. We are interested in the setting where the sketches must be computed independently of each other, except for the use of a shared random seed. We prove that, when $\bv{A}$ and $\bv{B}$ are sparse, methods based on \emph{coordinated random sampling} can outperform classical linear sketching approaches, like Johnson-Lindenstrauss Projection or CountSketch. For example, to obtain Frobenius norm error $\epsilon\|\bv{A}\|_F\|\bv{B}\|_F$, coordinated sampling requires sketches of size $O(s/\epsilon^2)$ when $\bv{A}$ and $\bv{B}$ have at most $s \leq d,m$ non-zeros per row. In contrast, linear sketching leads to sketches of size $O(d/\epsilon^2)$ and $O(m/\epsilon^2)$ for $\bv{A}$ and $\bv{B}$. We empirically evaluate our approach on two applications: 1) distributed linear regression in databases, a problem motivated by tasks like dataset discovery and augmentation, and 2) approximating attention matrices in transformer-based language models. In both cases, our sampling algorithms yield an order of magnitude improvement over linear sketching.

AAAI Conference 2025 Conference Paper

QJL: 1-Bit Quantized JL Transform for KV Cache Quantization with Zero Overhead

  • Amir Zandieh
  • Majid Daliri
  • Insu Han

Serving LLMs requires substantial memory due to the storage requirements of Key-Value (KV) embeddings in the KV cache, which grows with sequence length. An effective approach to compress KV cache is quantization. However, traditional quantization methods face significant memory overhead due to the need to store quantization constants (at least a zero point and a scale) in full precision per data block. Depending on the block size, this overhead can add 1 or 2 bits per quantized number. We introduce QJL, a new quantization approach that consists of a Johnson-Lindenstrauss (JL) transform followed by sign-bit quantization. In contrast to existing methods, QJL eliminates memory overheads by removing the need for storing quantization constants. We propose an asymmetric estimator for the inner product of two vectors and demonstrate that applying QJL to one vector and a standard JL transform without quantization to the other provides an unbiased estimator with minimal distortion. We have developed an efficient implementation of the QJL sketch and its corresponding inner product estimator, incorporating a lightweight CUDA kernel for optimized computation. When applied across various LLMs and NLP tasks to quantize the KV cache to only 3 bits, QJL demonstrates a more than fivefold reduction in KV cache memory usage without compromising accuracy, all while achieving faster runtime.

ICML Conference 2023 Conference Paper

KDEformer: Accelerating Transformers via Kernel Density Estimation

  • Amir Zandieh
  • Insu Han
  • Majid Daliri
  • Amin Karbasi

Dot-product attention mechanism plays a crucial role in modern deep architectures (e. g. , Transformer) for sequence modeling, however, naïve exact computation of this model incurs quadratic time and memory complexities in sequence length, hindering the training of long-sequence models. Critical bottlenecks are due to the computation of partition functions in the denominator of softmax function as well as the multiplication of the softmax matrix with the matrix of values. Our key observation is that the former can be reduced to a variant of the kernel density estimation (KDE) problem, and an efficient KDE solver can be further utilized to accelerate the latter via subsampling-based fast matrix products. Our proposed KDEformer can approximate the attention in sub-quadratic time with provable spectral norm bounds, while all prior results merely provide entry-wise error bounds. Empirically, we verify that KDEformer outperforms other attention approximations in terms of accuracy, memory, and arithmetic operations on various pre-trained models. For instance, on BigGAN image generation we achieve better generative scores than the exact computation with over 4× speedup. For ImageNet classification with T2T-ViT, KDEformer shows over 18× speedup while the accuracy drop is less than 0. 5%.