FOCS 2025
Implicit High-Order Moment Tensor Estimation and Learning Latent Variable Models
Abstract
We study the general task of learning latent-variable models on ℝ d with k hidden parameters. A common technique to address this task algorithmically is (some version of) the method of moments. Unfortunately, moment-based approaches are often hampered by the fact that the moment tensors of super-constant degree cannot even be written down in polynomial time. Motivated by such learning applications, we develop a general efficient algorithm for implicit moment tensor computation. Roughly speaking, our algorithm computes in poly(d, k) time a succinct approximate description of tensors of the form ${M_m} = \sum\nolimits_{i = 1}^k {{w_i}} v_i^{ \otimes m}$, for w i ∈ ℝ + —even for m = ω(1)—assuming that there exists an unbiased estimator for M m with small variance that takes an appropriately nice form. Our framework broadly generalizes, both conceptually and technically, the work of [1] which developed an efficient algorithm for the specific moment tensors that arise in the task of clustering mixtures of spherical Gaussians. By leveraging our implicit moment estimation algorithm, we obtain the first poly(d, k)-time learning algorithms for the following classical latent-variable models—thereby resolving or making significant progress towards a number of important open problems in the literature. • Mixtures of Linear Regressions Given i. i. d. samples (x, y) with x ∼ N(0, I) and such that the joint distribution on (x, y) is an unknown k-mixture of linear regressions on ℝ d+1 corrupted with Gaussian noise, the goal is to learn the underlying distribution in total variation distance. We give a poly(d, k, 1/ϵ)-time algorithm for this task, where ϵ is the desired error. The previously best algorithm has super-polynomial complexity in k. • Mixtures of Spherical Gaussians Given i. i. d. samples from a k-mixture of identity covariance Gaussians on ℝ d, the goal is to learn the target mixture. For density estimation, we give a poly(d, k, 1/ ϵ)-time learning algorithm, where ϵ is the desired total variation error, under the condition that the means lie in a ball of radius $O(\sqrt {\log k} )$. Prior algorithms incur super-polynomial complexity in k. For parameter estimation, we give a poly(d, k, 1/ ϵ)-time algorithm where ϵ is the target accuracy, under the optimal mean separation of Ω(log 1/2 (k/ϵ)) and the condition that the largest distance is comparable to the smallest. Prior polynomial-time parameter estimation algorithms require separation Ω(log 1/2+c (k/ϵ)), for c > 0. • Positive Linear Combinations of Non-Linear Activations Given i. i. d. samples (x, y) with x ∼ N(0, I) and y = F(x), where F is a positive linear combination of k reasonable non-linear activations on ℝ d, the goal is to learn the target function in L 2 -norm. Our main result is a general algorithm for this task with complexity poly(d, k)g(ϵ), where ϵ is the desired error and the function g depends on the Hermite concentration of the target class of functions. Specifically, for positive linear combinations of ReLU activations, our algorithm has complexity poly(d, k)2 poly(1/ϵ). This is the first algorithm for this class that runs in poly(d, k) time for sub-constant values of ϵ = o k, d (1). Finally, for positive linear combinations of cosine activations with bounded frequency, our algorithm runs in poly(d, k, 1/ ϵ) time.
Authors
Keywords
Context
- Venue
- IEEE Symposium on Foundations of Computer Science
- Archive span
- 1975-2025
- Indexed papers
- 3809
- Paper id
- 375866144525197317