Arrow Research search

Author name cluster

Mario Beraha

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

3 papers
1 author row

Possible papers

3

JMLR Journal 2024 Journal Article

Random measure priors in Bayesian recovery from sketches

  • Mario Beraha
  • Stefano Favaro
  • Matteo Sesia

This paper introduces a Bayesian nonparametric approach to frequency recovery from lossy-compressed discrete data, leveraging all information contained in a sketch obtained through random hashing. By modeling the data points as random samples from an unknown discrete distribution endowed with a Poisson-Kingman prior, we derive the posterior distribution of a symbol's empirical frequency given the sketch. This leads to principled frequency estimates through mean functionals, e.g., the posterior mean, median and mode. We highlight applications of this general result to Dirichlet process and Pitman-Yor process priors. Notably, we prove that the former prior uniquely satisfies a sufficiency property that simplifies the posterior distribution, while the latter enables a convenient large-sample asymptotic approximation. Additionally, we extend our approach to the problem of cardinality recovery, estimating the number of distinct symbols in the sketched dataset. Our approach to frequency recovery also adapts to a more general “traits” setting, where each data point has integer levels of association with multiple symbols, typically referred to as “traits”. By employing a generalized Indian buffet process, we compute the posterior distribution of a trait's frequency using both the Poisson and Bernoulli distributions for the trait association levels, respectively yielding exact and approximate posterior frequency distributions. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

JMLR Journal 2022 Journal Article

Projected Statistical Methods for Distributional Data on the Real Line with the Wasserstein Metric

  • Matteo Pegoraro
  • Mario Beraha

We present a novel class of projected methods to perform statistical analysis on a data set of probability distributions on the real line, with the 2-Wasserstein metric. We focus in particular on Principal Component Analysis (PCA) and regression. To define these models, we exploit a representation of the Wasserstein space closely related to its weak Riemannian structure by mapping the data to a suitable linear space and using a metric projection operator to constrain the results in the Wasserstein space. By carefully choosing the tangent point, we are able to derive fast empirical methods, exploiting a constrained B-spline approximation. As a byproduct of our approach, we are also able to derive faster routines for previous work on PCA for distributions. By means of simulation studies, we compare our approaches to previously proposed methods, showing that our projected PCA has similar performance for a fraction of the computational cost and that the projected regression is extremely flexible even under misspecification. Several theoretical properties of the models are investigated, and asymptotic consistency is proven. Two real world applications to Covid-19 mortality in the US and wind speed forecasting are discussed. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

AAAI Conference 2021 Conference Paper

Fast PCA in 1-D Wasserstein Spaces via B-splines Representation and Metric Projection

  • Matteo Pegoraro
  • Mario Beraha

We address the problem of performing Principal Component Analysis over a family of probability measures on the real line, using the Wasserstein geometry. We present a novel representation of the 2-Wasserstein space, based on a well known isometric bijection and a B-spline expansion. Thanks to this representation, we are able to reinterpret previous work and derive more efficient optimization routines for existing approaches. As shown in our simulations, the solution of these optimization problems can be costly in practice and thus pose a limit to their usage. We propose a novel definition of Principal Component Analysis in the Wasserstein space that, when used in combination with the B-spline representation, yields a straightforward optimization problem that is extremely fast to compute. Through extensive simulation studies, we show how our PCA performs similarly to the ones already proposed in the literature while retaining a much smaller computational cost. We apply our method to a real dataset of mortality rates due to Covid-19 in the US, concluding that our analyses are consistent with the current scientific consensus on the disease.