Arrow Research search

Author name cluster

Sitan Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers
2 author rows

Possible papers

36

ICML Conference 2025 Conference Paper

Blink of an eye: a simple theory for feature localization in generative models

  • Marvin Li
  • Aayush Karan
  • Sitan Chen

Large language models can exhibit unexpected behavior in the blink of an eye. In a recent computer use demo, a language model switched from coding to Googling pictures of Yellowstone, and these sudden shifts in behavior have also been observed in reasoning patterns and jailbreaks. This phenomenon is not unique to autoregressive models: in diffusion models, key features of the final output are decided in narrow “critical windows” of the generation process. In this work we develop a simple, unifying theory to explain this phenomenon. Using the formalism of stochastic localization for generative models, we show that it emerges generically as the generation process localizes to a sub-population of the distribution it models. While critical windows have been studied at length in diffusion models, existing theory heavily relies on strong distributional assumptions and the particulars of Gaussian diffusion. In contrast to existing work our theory (1) applies to autoregressive and diffusion models; (2) makes very few distributional assumptions; (3) quantitatively improves previous bounds even when specialized to diffusions; and (4) requires basic mathematical tools. Finally, we validate our predictions empirically for LLMs and find that critical windows often coincide with failures in problem solving for various math and reasoning benchmarks.

ICLR Conference 2025 Conference Paper

Faster Diffusion Sampling with Randomized Midpoints: Sequential and Parallel

  • Shivam Gupta 0002
  • Linda Cai
  • Sitan Chen

Sampling algorithms play an important role in controlling the quality and runtime of diffusion model inference. In recent years, a number of works (Chen et al., 2023c;b; Benton et al., 2023; Lee et al., 2022) have analyzed algorithms for diffusion sampling with provable guarantees; these works show that for essentially any data distribution, one can approximately sample in polynomial time given a sufficiently accurate estimate of its score functions at different noise levels. In this work, we propose a new scheme inspired by Shen and Lee's randomized midpoint method for log-concave sampling (Shen & Lee, 2019). We prove that this approach achieves the best known dimension dependence for sampling from arbitrary smooth distributions in total variation distance ($\widetilde O(d^{5/12})$ compared to $\widetilde O(\sqrt{d})$ from prior work). We also show that our algorithm can be parallelized to run in only $\widetilde O(\log^2 d)$ parallel rounds, constituting the first provable guarantees for parallel sampling with diffusion models. As a byproduct of our methods, for the well-studied problem of log-concave sampling in total variation distance, we give an algorithm and simple analysis achieving dimension dependence $\widetilde O(d^{5/12})$ compared to $\widetilde O(\sqrt{d})$ from prior work.

STOC Conference 2025 Conference Paper

Provably Learning a Multi-head Attention Layer

  • Sitan Chen
  • Yuanzhi Li

The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length k , attention matrices Σ 1 ,…, Σ m ∈ℝ d × d , and projection matrices W 1 ,…, W m ∈ℝ d × d , the corresponding multi-head attention layer F : ℝ k × d → ℝ k × d transforms length- k sequences of d -dimensional tokens X ∈ℝ k × d via F ( X ) ≜ ∑ i =1 m softmax ( X Σ i X ⊤ ) X W i . In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem. Provided { W i , Σ i } satisfy certain non-degeneracy conditions, we give a ( dk ) O ( m 3 ) -time algorithm that learns F to small error given random labeled examples drawn uniformly from {± 1} k × d . We also prove computational lower bounds showing that in the worst case, exponential dependence on the number of heads m is unavoidable. We chose to focus on Boolean X to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feed-forward networks, which predominantly exploit fine-grained algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and “slices” thereof.

ICML Conference 2025 Conference Paper

S4S: Solving for a Fast Diffusion Model Solver

  • Eric Frankel
  • Sitan Chen
  • Jerry Li 0001
  • Pang Wei Koh
  • Lillian J. Ratliff
  • Sewoong Oh

Diffusion models (DMs) create samples from a data distribution by starting from random noise and iteratively solving a reverse-time ordinary differential equation (ODE). Because each step in the iterative solution requires an expensive neural function evaluation (NFE), there has been significant interest in approximately solving these diffusion ODEs with only a few NFEs without modifying the underlying model. However, in the few NFE regime, we observe that tracking the true ODE evolution is fundamentally impossible using traditional ODE solvers. In this work, we propose a new method that learns a good solver for the DM, which we call S olving for the S olver ( S4S ). S4S directly optimizes a solver to obtain good generation quality by learning to match the output of a strong teacher solver. We evaluate S4S on six different pre-trained DMs, including pixel-space and latent-space DMs for both conditional and unconditional sampling. In all settings, S4S uniformly improves the sample quality relative to traditional ODE solvers. Moreover, our method is lightweight, data-free, and can be plugged in black-box on top of any discretization schedule or architecture to improve performance. Building on top of this, we also propose S4S-Alt, which optimizes both the solver and the discretization schedule. By exploiting the full design space of DM solvers, with 5 NFEs, we achieve an FID of 3. 73 on CIFAR10 and 13. 26 on MS-COCO, representing a $1. 5\times$ improvement over previous training-free ODE methods.

STOC Conference 2025 Conference Paper

Stabilizer Bootstrapping: A Recipe for Efficient Agnostic Tomography and Magic Estimation

  • Sitan Chen
  • Weiyuan Gong
  • Qi Ye 0005
  • Zhihan Zhang 0006

We study the task of agnostic tomography: given copies of an unknown n -qubit state ρ which has fidelity τ with some state in a given class C , find a state which has fidelity ≥ τ − є with ρ. We give a new framework, stabilizer bootstrapping , for designing computationally efficient protocols for this task, and use this to get new agnostic tomography protocols for the following classes: Stabilizer states: We give a protocol that runs in time poly ( n ,1/є)· (1/τ) O (log(1/τ)) , answering an open question posed by Grewal, Iyer, Kretschmer, Liang and Anshu and Arunachalam. Previous protocols ran in time exp (Θ( n )) or required τ>cos 2 (π/8). States with stabilizer dimension n − t : We give a protocol that runs in time n 3 ·(2 t /τ) O (log(1/є)) , extending recent work on learning quantum states prepared by circuits with few non-Clifford gates, which only applied in the realizable setting where τ = 1. Discrete product states: If C = K ⊗ n for some µ-separated discrete set K of single-qubit states, we give a protocol that runs in time ( n /µ) O ((1 + log(1/τ))/µ) /є 2 . This strictly generalizes a prior guarantee which applied to stabilizer product states. For stabilizer product states, we give a further improved protocol that runs in time ( n 2 /є 2 )· (1/τ) O (log(1/τ)) . As a corollary, we give the first protocol for estimating stabilizer fidelity, a standard measure of magic for quantum states, to error є in n 3 quasipoly (1/є) time.

ICML Conference 2025 Conference Paper

Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

  • Jaeyeon Kim
  • Kulin Shah
  • Vasilis Kontonis
  • Sham M. Kakade
  • Sitan Chen

In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs that were explicitly trained via teacher forcing to learn the right order of decoding.

ICML Conference 2024 Conference Paper

Critical windows: non-asymptotic theory for feature emergence in diffusion models

  • Marvin Li
  • Sitan Chen

We develop theory to understand an intriguing property of diffusion models for image generation that we term critical windows. Empirically, it has been observed that there are narrow time intervals in sampling during which particular features of the final image emerge, e. g. the image class or background color (Ho et al. , 2020b; Meng et al. , 2022; Choi et al. , 2022; Raya & Ambrogioni, 2023; Georgiev et al. , 2023; Sclocchi et al. , 2024; Biroli et al. , 2024). While this is advantageous for interpretability as it implies one can localize properties of the generation to a small segment of the trajectory, it seems at odds with the continuous nature of the diffusion. We propose a formal framework for studying these windows and show that for data coming from a mixture of strongly log-concave densities, these windows can be provably bounded in terms of certain measures of inter- and intra-group separation. We also instantiate these bounds for concrete examples like well-conditioned Gaussian mixtures. Finally, we use our bounds to give a rigorous interpretation of diffusion models as hierarchical samplers that progressively “decide” output features over a discrete sequence of times. We validate our bounds with experiments on synthetic data and show that critical windows may serve as a useful tool for diagnosing fairness and privacy violations in real-world diffusion models.

FOCS Conference 2024 Conference Paper

Optimal Tradeoffs for Estimating Pauli Observables

  • Sitan Chen
  • Weiyuan Gong
  • Qi Ye 0005

We revisit the problem of Pauli shadow tomography: given copies of an unknown n-qubit quantum state $\rho$, estimate Tr $(P\rho)$ for some set of Pauli operators $F$ to within additive error $\epsilon$. This has been a popular testbed for exploring the advantage of protocols with quantum memory over those without: with enough memory to measure two copies at a time, one can use Bell sampling to estimate $\vert \text{Tr}(P\rho)$ for all $P$ using $O(n/\epsilon^{4})$ copies, but with $k\leq n$ qubits of memory, $\Omega(2^{(n-k)/3})$ copies are needed. These results leave open several natural questions. How does this picture change in the physically relevant setting where one only needs to estimate a certain subset of Paulis? What is the optimal dependence on $\epsilon? $ What is the optimal tradeoff between quantum memory and sample complexity? We answer all of these questions: •For any subset $A$ of Paulis and any family of measurement strategies, we completely characterize the optimal sample complexity, up to $\log\vert A\vert$ factors. •We show any protocol that makes poly $(n)$ -copy measure-ments must make $\Omega(1/\epsilon^{4})$ measurements. •For any protocol that makes poly $(n)$ -copy measurements and only has $k < n$ qubits of memory, we show that $\tilde{\Theta}(\min\{2^{n}/\epsilon^{2}, 2^{n-k}/\epsilon^{4}\})$ copies are necessary and sufficient. The protocols we propose can also estimate the actual values $\text{Tr}(P\rho)$, rather than just their absolute values as in prior work. Additionally, as a byproduct of our techniques, we establish tight bounds for the task of purity testing and show that it exhibits an intriguing phase transition not present in the memory-sample tradeoff for Pauli shadow tomography.

NeurIPS Conference 2024 Conference Paper

Unrolled denoising networks provably learn to perform optimal Bayesian inference

  • Aayush Karan
  • Kulin Shah
  • Sitan Chen
  • Yonina C. Eldar

Much of Bayesian inference centers around the design of estimators for inverse problems which are optimal assuming the data comes from a known prior. But what do these optimality guarantees mean if the prior is unknown? In recent years, algorithm unrolling has emerged as deep learning's answer to this age-old question: design a neural network whose layers can in principle simulate iterations of inference algorithms and train on data generated by the unknown prior. Despite its empirical success, however, it has remained unclear whether this method can provably recover the performance of its optimal, prior-aware counterparts. In this work, we prove the first rigorous learning guarantees for neural networks based on unrolling approximate message passing (AMP). For compressed sensing, we prove that when trained on data drawn from a product prior, the layers of the network approximately converge to the same denoisers used in Bayes AMP. We also provide extensive numerical experiments for compressed sensing and rank-one matrix estimation demonstrating the advantages of our unrolled architecture --- in addition to being able to obliviously adapt to general priors, it exhibits improvements over Bayes AMP in more general settings of low dimensions, non-Gaussian designs, and non-product priors.

NeurIPS Conference 2024 Conference Paper

What does guidance do? A fine-grained analysis in a simple setting

  • Muthu Chidambaram
  • Khashayar Gatmiry
  • Sitan Chen
  • Holden Lee
  • Jianfeng Lu

The use of guidance in diffusion models was originally motivated by the premise that the guidance-modified score is that of the data distribution tilted by a conditional likelihood raised to some power. In this work we clarify this misconception by rigorously proving that guidance fails to sample from the intended tilted distribution. Our main result is to give a fine-grained characterization of the dynamics of guidance in two cases, (1) mixtures of compactly supported distributions and (2) mixtures of Gaussians, which reflect salient properties of guidance that manifest on real-world data. In both cases, we prove that as the guidance parameter increases, the guided model samples more heavily from the boundary of the support of the conditional distribution. We also prove that for any nonzero level of score estimation error, sufficiently large guidance will result in sampling away from the support, theoretically justifying the empirical finding that large guidance results in distorted generations. In addition to verifying these results empirically in synthetic settings, we also show how our theoretical insights can offer useful prescriptions for practical deployment.

NeurIPS Conference 2023 Conference Paper

Learning Mixtures of Gaussians Using the DDPM Objective

  • Kulin Shah
  • Sitan Chen
  • Adam Klivans

Recent works have shown that diffusion models can learn essentially any distribution provided one can perform score estimation. Yet it remains poorly understood under what settings score estimation is possible, let alone when practical gradient-based algorithms for this task can provably succeed. In this work, we give the first provably efficient results for one of the most fundamental distribution families, Gaussian mixture models. We prove that GD on the denoising diffusion probabilistic model (DDPM) objective can efficiently recover the ground truth parameters of the mixture model in the following two settings: 1. We show GD with random initialization learns mixtures of two spherical Gaussians in $d$ dimensions with $1/\text{poly}(d)$-separated centers. 2. We show GD with a warm start learns mixtures of $K$ spherical Gaussians with $\Omega(\sqrt{\log(\min(K, d))})$-separated centers. A key ingredient in our proofs is a new connection between score-based methods and two other approaches to distribution learning, EM and spectral methods.

ICML Conference 2023 Conference Paper

Restoration-Degradation Beyond Linear Diffusions: A Non-Asymptotic Analysis For DDIM-type Samplers

  • Sitan Chen
  • Giannis Daras
  • Alexandros G. Dimakis

We develop a framework for non-asymptotic analysis of deterministic samplers used for diffusion generative modeling. Several recent works have analyzed stochastic samplers using tools like Girsanov’s theorem and a chain rule variant of the interpolation argument. Unfortunately, these techniques give vacuous bounds when applied to deterministic samplers. We give a new operational interpretation for deterministic sampling by showing that one step along the probability flow ODE can be expressed as two steps: 1) a restoration step that runs gradient ascent on the conditional log-likelihood at some infinitesimally previous time, and 2) a degradation step that runs the forward process using noise pointing back towards the current iterate. This perspective allows us to extend denoising diffusion implicit models to general, non-linear forward processes. We then develop the first polynomial convergence bounds for these samplers under mild conditions on the data distribution.

ICLR Conference 2023 Conference Paper

Sampling is as easy as learning the score: theory for diffusion models with minimal data assumptions

  • Sitan Chen
  • Sinho Chewi
  • Jerry Li 0001
  • Yuanzhi Li
  • Adil Salim
  • Anru R. Zhang

We provide theoretical convergence guarantees for score-based generative models (SGMs) such as denoising diffusion probabilistic models (DDPMs), which constitute the backbone of large-scale real-world generative models such as DALL$\cdot$E 2. Our main result is that, assuming accurate score estimates, such SGMs can efficiently sample from essentially any realistic data distribution. In contrast to prior works, our results (1) hold for an $L^2$-accurate score estimate (rather than $L^\infty$-accurate); (2) do not require restrictive functional inequality conditions that preclude substantial non-log-concavity; (3) scale polynomially in all relevant problem parameters; and (4) match state-of-the-art complexity guarantees for discretization of the Langevin diffusion, provided that the score error is sufficiently small. We view this as strong theoretical justification for the empirical success of SGMs. We also examine SGMs based on the critically damped Langevin diffusion (CLD). Contrary to conventional wisdom, we provide evidence that the use of the CLD does *not* reduce the complexity of SGMs.

NeurIPS Conference 2023 Conference Paper

The probability flow ODE is provably fast

  • Sitan Chen
  • Sinho Chewi
  • Holden Lee
  • Yuanzhi Li
  • Jianfeng Lu
  • Adil Salim

We provide the first polynomial-time convergence guarantees for the probabilistic flow ODE implementation (together with a corrector step) of score-based generative modeling. Our analysis is carried out in the wake of recent results obtaining such guarantees for the SDE-based implementation (i. e. , denoising diffusion probabilistic modeling or DDPM), but requires the development of novel techniques for studying deterministic dynamics without contractivity. Through the use of a specially chosen corrector step based on the underdamped Langevin diffusion, we obtain better dimension dependence than prior works on DDPM ($O(\sqrt d)$ vs. $O(d)$, assuming smoothness of the data distribution), highlighting potential advantages of the ODE framework.

FOCS Conference 2023 Conference Paper

When Does Adaptivity Help for Quantum State Learning?

  • Sitan Chen
  • Brice Huang
  • Jerry Li 0001
  • Allen Liu
  • Mark Sellke

We consider the classic question of state tomography: given copies of an unknown quantum state $\rho \in \mathbb{C}^{d \times d}$, output $\widehat{\rho}$ which is close to $\rho$ in some sense, e. g. trace distance or fidelity. When one is allowed to make coherent measurements entangled across all copies, $\Theta\left(d^{2} / \varepsilon^{2}\right)$ copies are necessary and sufficient to get trace distance $\varepsilon$ [18], [29]. Unfortunately, the protocols achieving this rate incur large quantum memory overheads that preclude implementation on near-term devices. On the other hand, the best known protocol using incoherent (single-copy) measurements uses $O\left(d^{3} / \varepsilon^{2}\right)$ copies [24], and multiple papers have posed it as an open question to understand whether or not this rate is tight [6], [18]. In this work, we fully resolve this question, by showing that any protocol using incoherent measurements, even if they are chosen adaptively, requires $\Omega\left(d^{3} / \varepsilon^{2}\right)$ copies, matching the upper bound of [24]. We do so by a new proof technique which directly bounds the “tilt” of the posterior distribution after measurements, which yields a surprisingly short proof of our lower bound, and which we believe may be of independent interest. While this implies that adaptivity does not help for tomography with respect to trace distance, we show that it actually does help for tomography with respect to infidelity. We give an adaptive algorithm that outputs a state which is $\gamma$-close in infidelity to $\rho$ using only $\widetilde{O}\left(d^{3} / \gamma\right)$ copies, which is optimal for incoherent measurements. In contrast, it is known [18] that any nonadaptive algorithm requires $\Omega\left(d^{3} / \gamma^{2}\right)$ copies. While it is folklore that in 2 dimensions, one can achieve a scaling of $O(1 / \gamma)$, to the best of our knowledge, our algorithm is the first to achieve the optimal rate in all dimensions.

NeurIPS Conference 2022 Conference Paper

Hardness of Noise-Free Learning for Two-Hidden-Layer Neural Networks

  • Sitan Chen
  • Aravind Gollakota
  • Adam Klivans
  • Raghu Meka

We give superpolynomial statistical query (SQ) lower bounds for learning two-hidden-layer ReLU networks with respect to Gaussian inputs in the standard (noise-free) model. No general SQ lower bounds were known for learning ReLU networks of any depth in this setting: previous SQ lower bounds held only for adversarial noise models (agnostic learning) (Kothari and Klivans 2014, Goel et al. 2020a, Diakonikolas et al. 2020a) or restricted models such as correlational SQ (Goel et al. 2020b, Diakonikolas et al. 2020b). Prior work hinted at the impossibility of our result: Vempala and Wilmes (2019) showed that general SQ lower bounds cannot apply to any real-valued family of functions that satisfies a simple non-degeneracy condition. To circumvent their result, we refine a lifting procedure due to Daniely and Vardi (2021) that reduces Boolean PAC learning problems to Gaussian ones. We show how to extend their technique to other learning models and, in many well-studied cases, obtain a more efficient reduction. As such, we also prove new cryptographic hardness results for PAC learning two-hidden-layer ReLU networks, as well as new lower bounds for learning constant-depth ReLU networks from membership queries.

NeurIPS Conference 2022 Conference Paper

Learning (Very) Simple Generative Models Is Hard

  • Sitan Chen
  • Jerry Li
  • Yuanzhi Li

Motivated by the recent empirical successes of deep generative models, we study the computational complexity of the following unsupervised learning problem. For an unknown neural network $F: \mathbb{R}^d\to\mathbb{R}^{d'}$, let $D$ be the distribution over $\mathbb{R}^{d'}$ given by pushing the standard Gaussian $\mathcal{N}(0, \textrm{Id}_d)$ through $F$. Given i. i. d. samples from $D$, the goal is to output *any* distribution close to $D$ in statistical distance. We show under the statistical query (SQ) model that no polynomial-time algorithm can solve this problem even when the output coordinates of $F$ are one-hidden-layer ReLU networks with $\log(d)$ neurons. Previously, the best lower bounds for this problem simply followed from lower bounds for *supervised learning* and required at least two hidden layers and $\textrm{poly}(d)$ neurons [Daniely-Vardi '21, Chen-Gollakota-Klivans-Meka '22]. The key ingredient in our proof is an ODE-based construction of a compactly supported, piecewise-linear function $f$ with polynomially-bounded slopes such that the pushforward of $\mathcal{N}(0, 1)$ under $f$ matches all low-degree moments of $\mathcal{N}(0, 1)$.

ICLR Conference 2022 Conference Paper

Minimax Optimality (Probably) Doesn't Imply Distribution Learning for GANs

  • Sitan Chen
  • Jerry Li 0001
  • Yuanzhi Li
  • Raghu Meka

Arguably the most fundamental question in the theory of generative adversarial networks (GANs) is to understand when GANs can actually learn the underlying distribution. Theoretical and empirical evidence (see e.g. Arora-Risteski-Zhang '18) suggest local optimality of the empirical training objective is insufficient, yet it does not rule out the possibility that achieving a true population minimax optimal solution might imply distribution learning. In this paper, we show that standard cryptographic assumptions imply that this stronger condition is still insufficient. Namely, we show that if local pseudorandom generators (PRGs) exist, then for a large family of natural target distributions, there are ReLU network generators of constant depth and poly size which take Gaussian random seeds so that (i) the output is far in Wasserstein distance from the target distribution, but (ii) no polynomially large Lipschitz discriminator ReLU network can detect this. This implies that even achieving a population minimax optimal solution to the Wasserstein GAN objective is likely insufficient for distribution learning. Our techniques reveal a deep connection between GANs and PRGs, which we believe will lead to further insights into the computational landscape of GANs.

FOCS Conference 2022 Conference Paper

Tight Bounds for Quantum State Certification with Incoherent Measurements

  • Sitan Chen
  • Jerry Li 0001
  • Brice Huang
  • Allen Liu

We consider the problem of quantum state certification, where we are given the description of a mixed state $\sigma\in\mathbb{C}^{d\times d}, n$ copies of a mixed state $\rho\in\mathbb{C}^{d\times d}$, and $\varepsilon\gt 0$, and we are asked to determine whether $\rho=\sigma$ or whether $\|\rho-\sigma\|_{1}\gt \varepsilon$. When $\sigma$ is the maximally mixed state $\frac{1}{d}I_{d}$, this is known as mixedness testing. We focus on algorithms which use incoherent measurements, i. e. which only measure one copy of $\rho$ at a time. Unlike those that use entangled, multi-copy measurements, these can be implemented without persistent quantum memory and thus represent a large class of protocols that can be run on current or near-term devices. For mixedness testing, there is a folklore algorithm which uses incoherent measurements and only needs $O(d^{3/2}/\varepsilon^{2})$ copies. The algorithm is non-adaptive, that is, its measurements are fixed ahead of time, and is known to be optimal for non-adaptive algorithms. However, when the algorithm can make arbitrary incoherent measurements, the best known lower bound is only $\Omega(d^{4/3}/\varepsilon^{2})$ [5], and it has been an outstanding open problem to close this polynomial gap. In this work: •We settle the copy complexity of mixedness testing with incoherent measurements and show that $\Omega(d^{3/2}/\varepsilon^{2})$ copies are necessary. This fully resolves open questions of [15] and [5]. •We show that the instance-optimal bounds for state certification to general $\sigma$ first derived in [7] for non-adaptive measurements also hold for arbitrary incoherent measurements. Qualitatively, our results say that adaptivity does not help at all for these problems. Our results are based on new techniques that allow us to reduce the problem to understanding the concentration of certain matrix martingales, which we believe may be of independent interest.

NeurIPS Conference 2021 Conference Paper

Efficiently Learning One Hidden Layer ReLU Networks From Queries

  • Sitan Chen
  • Adam Klivans
  • Raghu Meka

While the problem of PAC learning neural networks from samples has received considerable attention in recent years, in certain settings like model extraction attacks, it is reasonable to imagine having more than just the ability to observe random labeled examples. Motivated by this, we consider the following problem: given \emph{black-box query access} to a neural network $F$, recover $F$ up to some error. Formally, we show that if $F$ is an arbitrary one hidden layer neural network with ReLU activations, there is an algorithm with query complexity and runtime polynomial in all parameters which outputs a network $F’$ achieving low square loss relative to $F$ with respect to the Gaussian measure. While a number of works in the security literature have proposed and empirically demonstrated the effectiveness of certain algorithms for this problem, ours is to the best of our knowledge the first provable guarantee in this vein.

FOCS Conference 2021 Conference Paper

Exponential Separations Between Learning With and Without Quantum Memory

  • Sitan Chen
  • Jordan Cotler
  • Hsin-Yuan Huang
  • Jerry Li 0001

We study the power of quantum memory for learning properties of quantum systems and dynamics, which is of great importance in physics and chemistry. Many state-of-the-art learning algorithms require access to an additional external quantum memory. While such a quantum memory is not required a priori, in many cases, algorithms that do not utilize quantum memory require much more data than those which do. We show that this trade-off is inherent in a wide range of learning problems. Our results include the following: •We show that to perform shadow tomography on an $n$ -qubit state $\rho$ with $M$ observables, any algorithm without quantum memory requires $\tilde{\Omega}(\min(M, 2^{n}))$ samples of $\rho$ in the worst case. Up to log factors, this matches the upper bound of [1], and completely resolves an open question in [2], [3]. •We establish exponential separations between algorithms with and without quantum memory for purity testing, distinguishing scrambling and depolarizing evolutions, and uncovering symmetry in physical dynamics. Our separations improve and generalize prior work of [4] by allowing for a broader class of algorithms without quantum memory. •We give the first tradeoff between quantum memory and sample complexity. More precisely, we prove that to estimate absolute values of all $n$ -qubit Pauli observables, algorithms with $k < n$ qubits of quantum memory require at least $\Omega(2^{(n-k)/3})$ samples, but there is an algorithm using $n$ -qubit quantum memory which only requires $\mathcal{O}(n)$ samples. The separations we show are sufficiently large and could already be evident, for instance, with tens of qubits. This provides a concrete path towards demonstrating real-world advantage for learning algorithms with quantum memory.

FOCS Conference 2021 Conference Paper

Learning Deep ReLU Networks Is Fixed-Parameter Tractable

  • Sitan Chen
  • Adam R. Klivans
  • Raghu Meka

We consider the problem of learning an unknown ReLU network with respect to Gaussian inputs and obtain the first nontrivial results for networks of depth more than two. We give an algorithm whose running time is a fixed polynomial in the ambient dimension and some (exponentially large) function of only the network's parameters. Our results provably cannot be obtained using gradient-based methods and give the first example of a class of efficiently learnable neural networks that gradient descent will fail to learn. Our bounds depend on the number of hidden units, depth, spectral norm of the weight matrices, and Lipschitz constant of the overall network (we show that some dependence on the Lipschitz constant is necessary). We also give a bound that is doubly exponential in the size of the network but is independent of spectral norm. In contrast, prior work for learning networks of depth three or higher requires exponential time in the ambient dimension, even when the above parameters are bounded by a constant. Additionally, all prior work for the depth-two case requires well-conditioned weights and/or positive coefficients to obtain efficient run-times. Our algorithm does not require these assumptions. Our main technical tool is a type of filtered PCA that can be used to iteratively recover an approximate basis for the subspace spanned by the hidden units in the first layer. Our analysis leverages new structural results on lattice polynomials from tropical geometry.

ICLR Conference 2021 Conference Paper

On InstaHide, Phase Retrieval, and Sparse Matrix Factorization

  • Sitan Chen
  • Xiaoxiao Li
  • Zhao Song 0002
  • Danyang Zhuo

In this work, we examine the security of InstaHide, a scheme recently proposed by \cite{hsla20} for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the distributed learners, InstaHide takes a convex combination of private feature vectors and randomly flips the sign of each entry of the resulting vector with probability 1/2. A salient question is whether this scheme is secure in any provable sense, perhaps under a plausible complexity-theoretic assumption. The answer to this turns out to be quite subtle and closely related to the average-case complexity of a multi-task, missing-data version of the classic problem of phase retrieval that is interesting in its own right. Motivated by this connection, under the standard distributional assumption that the public/private feature vectors are isotropic Gaussian, we design an algorithm that can actually recover a private vector using only the public vectors and a sequence of synthetic vectors generated by InstaHide.

FOCS Conference 2021 Conference Paper

Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination

  • Sitan Chen
  • Frederic Koehler
  • Ankur Moitra
  • Morris Yau

In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness guarantees even without distributional assumptions altogether, where the sequence of tasks we are asked to solve is adaptively and adversarially chosen? We answer this question in the affirmative for both linear regression and contextual bandits. In fact our algorithms succeed where conventional methods fail. In particular we show strong lower bounds against Huber regression and more generally any convex $M$ -estimator. Our approach is based on a novel alternating minimization scheme that interleaves ordinary least-squares with a simple convex program that finds the optimal reweighting of the distribution under a spectral constraint. Our results obtain essentially optimal dependence on the contamination level η, reach the optimal breakdown point, and naturally apply to infinite dimensional settings where the feature vectors are represented implicitly via a kernel map.

NeurIPS Conference 2020 Conference Paper

Classification Under Misspecification: Halfspaces, Generalized Linear Models, and Evolvability

  • Sitan Chen
  • Frederic Koehler
  • Ankur Moitra
  • Morris Yau

In this paper, we revisit the problem of distribution-independently learning halfspaces under Massart noise with rate $\eta$. Recent work resolved a long-standing problem in this model of efficiently learning to error $\eta + \epsilon$ for any $\epsilon > 0$, by giving an improper learner that partitions space into $\text{poly}(d, 1/\epsilon)$ regions. Here we give a much simpler algorithm and settle a number of outstanding open questions: (1) We give the first \emph{proper} learner for Massart halfspaces that achieves $\eta + \epsilon$. (2) Based on (1), we develop a blackbox knowledge distillation procedure to convert an arbitrarily complex classifier to an equally good proper classifier. (3) By leveraging a simple but overlooked connection to \emph{evolvability}, we show any SQ algorithm requires super-polynomially many queries to achieve $\mathsf{OPT} + \epsilon$. We then zoom out to study generalized linear models and give an efficient algorithm for learning under a challenging new corruption model generalizing Massart noise. Finally we study our algorithm for learning halfspaces under Massart noise empirically and find that it exhibits some appealing fairness properties as a byproduct of its strong provable robustness guarantees.

FOCS Conference 2020 Conference Paper

Entanglement is Necessary for Optimal Quantum Property Testing

  • Sébastien Bubeck
  • Sitan Chen
  • Jerry Li 0001

There has been a surge of progress in recent years in developing algorithms for testing and learning quantum states that achieve optimal copy complexity [1]–[6]. Unfortunately, they require the use of entangled measurements across many copies of the underlying state and thus remain outside the realm of what is currently experimentally feasible. A natural question is whether one can match the copy complexity of such algorithms using only independent-but possibly adaptively chosen-measurements on individual copies. We answer this in the negative for arguably the most basic quantum testing problem: deciding whether a given $d$ -dimensional quantum state is equal to or $\epsilon$ -far in trace distance from the maximally mixed state. While it is known how to achieve optimal $O(d/\epsilon^{2})$ copy complexity using entangled measurements, we show that with independent measurements, $\Omega(d^{4/3}/\epsilon^{2})$ is necessary, even if the measurements are chosen adaptively. This resolves a question posed in [7]. To obtain this lower bound, we develop several new techniques, including a chain-rule style proof of Paninski's lower bound for classical uniformity testing, which may be of independent interest.

NeurIPS Conference 2020 Conference Paper

Learning Structured Distributions From Untrusted Batches: Faster and Simpler

  • Sitan Chen
  • Jerry Li
  • Ankur Moitra

We revisit the problem of learning from untrusted batches introduced by Qiao and Valiant [QV17]. Recently, Jain and Orlitsky [JO19] gave a simple semidefinite programming approach based on the cut-norm that achieves essentially information-theoretically optimal error in polynomial time. Concurrently, Chen et al. [CLM19] considered a variant of the problem where μ is assumed to be structured, e. g. log-concave, monotone hazard rate, t-modal, etc. In this case, it is possible to achieve the same error with sample complexity sublinear in n, and they exhibited a quasi-polynomial time algorithm for doing so using Haar wavelets. In this paper, we find an appealing way to synthesize the techniques of [JO19] and [CLM19] to give the best of both worlds: an algorithm which runs in polynomial time and can exploit structure in the underlying distribution to achieve sublinear sample complexity. Along the way, we simplify the approach of [JO19] by avoiding the need for SDP rounding and giving a more direct interpretation of it through the lens of soft filtering, a powerful recent technique in high-dimensional robust estimation. We validate the usefulness of our algorithms in preliminary experimental evaluations.