Arrow Research search

Author name cluster

Yury Polyanskiy

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

12 papers
2 author rows

Possible papers

12

JMLR Journal 2025 Journal Article

Density Estimation Using the Perceptron

  • Patrik Róbert Gerber
  • Tianze Jiang
  • Yury Polyanskiy
  • Rui Sun

We propose a new density estimation algorithm. Given $n$ i.i.d. observations from a distribution belonging to a class of densities on $\mathbb{R}^d$, our estimator outputs any density in the class whose “perceptron discrepancy” with the empirical distribution is at most $O(\sqrt{d/n})$. The perceptron discrepancy is defined as the largest difference in mass two distribution place on any halfspace. It is shown that this estimator achieves the expected total variation distance to the truth that is almost minimax optimal over the class of densities with bounded Sobolev norm and Gaussian mixtures. This suggests that the regularity of the prior distribution could be an explanation for the efficiency of the ubiquitous step in machine learning that replaces optimization over large function spaces with simpler parametric classes (such as discriminators of GANs). We also show that replacing the perceptron discrepancy with the generalized energy distance of Székely and Rizzo (2013) further improves total variation loss. The generalized energy distance between empirical distributions is easily computable and differentiable, which makes it especially useful for fitting generative models. To the best of our knowledge, it is the first “simple” distance with such properties that yields minimax optimal statistical guarantees. In addition, we shed light on the ubiquitous method of representing discrete data in domain $[k]$ via embedding vectors on a unit ball in $\mathbb{R}^d$. We show that taking $d \asymp \log(k)$ allows one to use simple linear probing to evaluate and estimate total variation distance, as well as recovering minimax optimal sample complexity for the class of discrete distributions on $[k]$. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

Global Minimizers of Sigmoid Contrastive Loss

  • Kiril Bangachev
  • Guy Bresler
  • Iliyas Noman
  • Yury Polyanskiy

The meta-task of obtaining and aligning representations through contrastive pretraining is steadily gaining importance since its introduction in CLIP and ALIGN. In this paper we theoretically explain the advantages of synchronizing with trainable inverse temperature and bias under the sigmoid loss, as implemented in the recent SigLIP and SigLIP2 models of Google DeepMind. Temperature and bias can drive the loss function to zero for a rich class of configurations that we call $(\mathsf{m}, \mathsf{br})$ -Constellations. $(\mathsf{m}, \mathsf{br})$ -Constellations are a novel combinatorial object related to spherical codes and are parametrized by a margin $\mathsf{m}$ and relative bias $\mathsf{br}$. We use our characterization of constellations to theoretically justify the success of SigLIP on retrieval, to explain the modality gap present in SigLIP, and to identify the necessary dimension for producing high-quality representations. Finally, we propose a reparameterization of the sigmoid loss with explicit relative bias, which improves training dynamics in experiments with synthetic data.

ICML Conference 2025 Conference Paper

NestQuant: nested lattice quantization for matrix products and LLMs

  • Semyon Savkin
  • Eitan Porat
  • Or Ordentlich
  • Yury Polyanskiy

Post-training quantization (PTQ) has emerged as a critical technique for efficient deployment of large language models (LLMs). This work proposes NestQuant, a novel PTQ scheme for weights and activations that is based on self-similar nested lattices. Recent works have mathematically shown such quantizers to be information-theoretically optimal for low-precision matrix multiplication. We implement a practical low-complexity version of NestQuant based on Gosset lattice, making it a drop-in quantizer for any matrix multiplication step (e. g. , in self-attention, MLP etc). For example, NestQuant quantizes weights, KV-cache, and activations of Llama-3-8B to 4 bits, achieving perplexity of 6. 6 on wikitext2. This represents more than 55% reduction in perplexity gap with respect to unquantized model (perplexity of 6. 14) compared to state-of-the-art Meta’s SpinQuant (perplexity 7. 3), OstQuant (7. 3) and QuaRot (8. 2). Comparisons on bigger models (up to 70B) and on various LLM evaluation benchmarks confirm uniform superiority of NestQuant.

NeurIPS Conference 2025 Conference Paper

Normalization in Attention Dynamics

  • Nikita Karagodin
  • Shu Ge
  • Yury Polyanskiy
  • Philippe Rigollet

We study the effect of normalization schemes on token representations in deep transformers. Modeling their evolution as interacting particles on the sphere, we show that normalization acts as a form of speed regulation. This perspective enables a unified analysis of several schemes---including Post-LN, Pre-LN, Mix-LN, Peri-LN, nGPT ---revealing how they influence clustering dynamics and representation collapse. Our framework clarifies how different schemes shape token representations across layers and provides a principled basis for comparing them, identifying Peri-LN as a particularly effective choice.

NeurIPS Conference 2024 Conference Paper

Clustering in Causal Attention Masking

  • Nikita Karagodin
  • Yury Polyanskiy
  • Philippe Rigollet

This work presents a modification of the self-attention dynamics proposed in Geshkovski et al to better reflect the practically relevant, causally masked attention used in transformer architectures for generative AI. This modification translates into an interacting particle system that cannot be interpreted as a mean-field gradient flow. Despite this loss of structure, we significantly strengthen the results of Geshkovski et al in this context: While previous rigorous results focused on cases where all three matrices (key, query, and value) were scaled identities, we prove asymptotic convergence to a single cluster for arbitrary key-query matrices and value matrix equal to the identity. Additionally, we establish a connection to the classical R\'enyi parking problem from combinatorial geometry to make initial theoretical steps towards demonstrating the existence of meta-stable states.

FOCS Conference 2023 Conference Paper

Algorithmic Decorrelation and Planted Clique in Dependent Random Graphs: The Case of Extra Triangles

  • Guy Bresler
  • Chenghao Guo
  • Yury Polyanskiy

We aim to understand the extent to which the noise distribution in a planted signal-plus-noise problem impacts its computational complexity. To that end, we consider the planted clique and planted dense subgraph problems, but in a different ambient graph. Instead of Erdős-Rényi $G(n, p)$, which has independent edges, we take the ambient graph to be the random graph with triangles (RGT) obtained by adding triangles to $G(n, p)$. We show that the RGT can be efficiently mapped to the corresponding $G(n, p)$, and moreover, that the planted clique (or dense subgraph) is approximately preserved under this mapping. This constitutes the first average-case reduction transforming dependent noise to independent noise. Together with the easier direction of mapping the ambient graph from Erdős-Rényi to RGT, our results yield a strong equivalence between models. In order to prove our results, we develop a new general framework for reasoning about the validity of average-case reductions based on low sensitivity to perturbations.

NeurIPS Conference 2023 Conference Paper

Kernel-Based Tests for Likelihood-Free Hypothesis Testing

  • Patrik Robert Gerber
  • Tianze Jiang
  • Yury Polyanskiy
  • Rui Sun

Given $n$ observations from two balanced classes, consider the task of labeling an additional $m$ inputs that are known to all belong to \emph{one} of the two classes. Special cases of this problem are well-known: with completeknowledge of class distributions ($n=\infty$) theproblem is solved optimally by the likelihood-ratio test; when$m=1$ it corresponds to binary classification; and when $m\approx n$ it is equivalent to two-sample testing. The intermediate settings occur in the field of likelihood-free inference, where labeled samples are obtained by running forward simulations and the unlabeled sample is collected experimentally. In recent work it was discovered that there is a fundamental trade-offbetween $m$ and $n$: increasing the data sample $m$ reduces the amount $n$ of training/simulationdata needed. In this work we (a) introduce a generalization where unlabeled samples come from a mixture of the two classes -- a case often encountered in practice; (b) study the minimax sample complexity for non-parametric classes of densities under \textit{maximum meandiscrepancy} (MMD) separation; and (c) investigate the empirical performance of kernels parameterized by neural networks on two tasks: detectionof the Higgs boson and detection of planted DDPM generated images amidstCIFAR-10 images. For both problems we confirm the existence of the theoretically predicted asymmetric $m$ vs $n$ trade-off.

NeurIPS Conference 2023 Conference Paper

Score-based Source Separation with Applications to Digital Communication Signals

  • Tejas Jayashankar
  • Gary C. F. Lee
  • Alejandro Lancho
  • Amir Weiss
  • Yury Polyanskiy
  • Gregory Wornell

We propose a new method for separating superimposed sources using diffusion-based generative models. Our method relies only on separately trained statistical priors of independent sources to establish a new objective function guided by $\textit{maximum a posteriori}$ estimation with an $\textit{$\alpha$-posterior}$, across multiple levels of Gaussian smoothing. Motivated by applications in radio-frequency (RF) systems, we are interested in sources with underlying discrete nature and the recovery of encoded bits from a signal of interest, as measured by the bit error rate (BER). Experimental results with RF mixtures demonstrate that our method results in a BER reduction of 95\% over classical and existing learning-based methods. Our analysis demonstrates that our proposed method yields solutions that asymptotically approach the modes of an underlying discrete distribution. Furthermore, our method can be viewed as a multi-source extension to the recently proposed score distillation sampling scheme, shedding additional light on its use beyond conditional sampling. The project webpage is available at https: //alpha-rgs. github. io.

NeurIPS Conference 2023 Conference Paper

The emergence of clusters in self-attention dynamics

  • Borjan Geshkovski
  • Cyril Letrouit
  • Yury Polyanskiy
  • Philippe Rigollet

Viewing Transformers as interacting particle systems, we describe the geometry of learned representations when the weights are not time-dependent. We show that particles, representing tokens, tend to cluster toward particular limiting objects as time tends to infinity. Using techniques from dynamical systems and partial differential equations, we show that type of limiting object that emerges depends on the spectrum of the value matrix. Additionally, in the one-dimensional case we prove that the self-attention matrix converges to a low-rank Boolean matrix. The combination of these results mathematically confirms the empirical observation made by Vaswani et al. [ VSP`17 ] that leaders appear in a sequence of tokens when processed by Transformers.

JMLR Journal 2022 Journal Article

Intrinsic Dimension Estimation Using Wasserstein Distance

  • Adam Block
  • Zeyu Jia
  • Yury Polyanskiy
  • Alexander Rakhlin

It has long been thought that high-dimensional data encountered in many practical machine learning tasks have low-dimensional structure, i.e., the manifold hypothesis holds. A natural question, thus, is to estimate the intrinsic dimension of a given population distribution from a finite sample. We introduce a new estimator of the intrinsic dimension and provide finite sample, non-asymptotic guarantees. We then apply our techniques to get new sample complexity bounds for Generative Adversarial Networks (GANs) depending only on the intrinsic dimension of the data. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

ICML Conference 2019 Conference Paper

Estimating Information Flow in Deep Neural Networks

  • Ziv Goldfeld
  • Ewout van den Berg
  • Kristjan H. Greenewald
  • Igor Melnyk
  • Nam Nguyen
  • Brian Kingsbury
  • Yury Polyanskiy

We study the estimation of the mutual information I(X; T_$\ell$) between the input X to a deep neural network (DNN) and the output vector T_$\ell$ of its $\ell$-th hidden layer (an “internal representation”). Focusing on feedforward networks with fixed weights and noisy internal representations, we develop a rigorous framework for accurate estimation of I(X; T_$\ell$). By relating I(X; T_$\ell$) to information transmission over additive white Gaussian noise channels, we reveal that compression, i. e. reduction in I(X; T_$\ell$) over the course of training, is driven by progressive geometric clustering of the representations of samples from the same class. Experimental results verify this connection. Finally, we shift focus to purely deterministic DNNs, where I(X; T_$\ell$) is provably vacuous, and show that nevertheless, these models also cluster inputs belonging to the same class. The binning-based approximation of I(X; T_$\ell$) employed in past works to measure compression is identified as a measure of clustering, thus clarifying that these experiments were in fact tracking the same clustering phenomenon. Leveraging the clustering perspective, we provide new evidence that compression and generalization may not be causally related and discuss potential future research ideas.