Author name cluster

Arthur Gretton

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

106 papers

2 author rows

JMLR Journal 2025 Journal Article

(De)-regularized Maximum Mean Discrepancy Gradient Flow

Zonghao Chen
Aratrika Mustafi
Pierre Glaser
Anna Korba
Arthur Gretton
Bharath K. Sriperumbudur

We introduce a (de)-regularization of the Maximum Mean Discrepancy (DrMMD) and its Wasserstein gradient flow. Existing gradient flows that transport samples from source distribution to target distribution with only target samples, either lack tractable numerical implementation ($f$-divergence flows) or require strong assumptions and modifications, such as noise injection, to ensure convergence (Maximum Mean Discrepancy flows). In contrast, DrMMD flow can simultaneously (i) guarantee near-global convergence for a broad class of targets in both continuous and discrete time, and (ii) be implemented in closed form using only samples. The former is achieved by leveraging the connection between the DrMMD and the $\chi^2$-divergence, while the latter comes by treating DrMMD as MMD with a de-regularized kernel. Our numerical scheme employs an adaptive de-regularization schedule throughout the flow to optimally balance the trade-off between discretization errors and deviations from the $\chi^2$ regime. The potential application of the DrMMD flow is demonstrated across several numerical experiments, including a large-scale setting of training student/teacher networks. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

UAI Conference 2025 Conference Paper

A Unified Data Representation Learning for Non-parametric Two-sample Testing

Xunye Tian
Liuhua Peng
Zhijian Zhou
Mingming Gong
Arthur Gretton
Feng Liu 0003

Learning effective data representations has been crucial in non-parametric two-sample testing. Common approaches will first split data into training and test sets and then learn data representations purely on the training set. However, recent theoretical studies have shown that, as long as the sample indexes are not used during the learning process, the whole data can be used to learn data representations, meanwhile ensuring control of Type-I errors. The above fact motivates us to use the test set (but without sample indexes) to facilitate the data representation learning in the testing. To this end, we propose a representation-learning two-sample testing (RL-TST) framework. RL-TST first performs purely self-supervised representation learning on the entire dataset to capture inherent representations (IRs) that reflect the underlying data manifold. A discriminative model is then trained on these IRs to learn discriminative representations (DRs), enabling the framework to leverage both the rich structural information from IRs and the discriminative power of DRs. Extensive experiments demonstrate that RL-TST outperforms representative approaches by simultaneously using data manifold information in the test set and enhancing test power via finding the DRs with the training set.

ICML Conference 2025 Conference Paper

Accelerated Diffusion Models via Speculative Sampling

Valentin De Bortoli
Alexandre Galashov
Arthur Gretton
Arnaud Doucet

Speculative sampling is a popular technique for accelerating inference in Large Language Models by generating candidate tokens using a fast draft model and then accepting or rejecting them based on the target model’s distribution. While speculative sampling was previously limited to discrete sequences, we extend it to diffusion models, which generate samples via continuous, vector-valued Markov chains. In this context, the target model is a high-quality but computationally expensive diffusion model. We propose various drafting strategies, including a simple and effective approach that does not require training a draft model and is applicable out-of-the-box to any diffusion model. We demonstrate significant generation speedup on various diffusion models, halving the number of function evaluations while generating exact samples from the target model. Finally, we also show how this procedure can be used to accelerate Langevin diffusions to sample unnormalized distributions.

JMLR Journal 2025 Journal Article

Composite Goodness-of-fit Tests with Kernels

Oscar Key
Arthur Gretton
François-Xavier Briol
Tamara Fernandez

We propose kernel-based hypothesis tests for the challenging composite testing problem, where we are interested in whether the data comes from any distribution in some parametric family. Our tests make use of minimum distance estimators based on kernel-based distances such as the maximum mean discrepancy. As our main result, we show that we are able to estimate the parameter and conduct our test on the same data (without data splitting), while maintaining a correct test level. We also prove that the popular wild bootstrap will lead to an overly conservative test, and show that the parametric bootstrap is consistent and can lead to significantly improved performance in practice. Our approach is illustrated on a range of problems, including testing for goodness-of-fit of a non-parametric density model, and an intractable generative model of a biological cellular network. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

ICLR Conference 2025 Conference Paper

Deep MMD Gradient Flow without adversarial training

Alexandre Galashov
Valentin De Bortoli
Arthur Gretton

We propose a gradient flow procedure for generative modeling by transporting particles from an initial source distribution to a target distribution, where the gradient field on the particles is given by a noise-adaptive Wasserstein Gradient of the Maximum Mean Discrepancy (MMD). The noise adaptive MMD is trained on data distributions corrupted by increasing levels of noise, obtained via a forward diffusion process, as commonly used in denoising diffusion probabilistic models. The result is a generalization of MMD Gradient Flow, which we call Diffusion-MMD-Gradient Flow or DMMD. The divergence training procedure is related to discriminator training in Generative Adversarial Networks (GAN), but does not require adversarial training. We obtain competitive empirical performance in unconditional image generation on CIFAR10, MNIST, CELEB-A (64 x64) and LSUN Church (64 x 64). Furthermore, we demonstrate the validity of the approach when MMD is replaced by a lower bound on the KL divergence.

NeurIPS Conference 2025 Conference Paper

Demystifying Spectral Feature Learning for Instrumental Variable Regression

Dimitri Meunier
Antoine Moulin
Jakub Wornbard
Vladimir Kostic
Arthur Gretton

We address the problem of causal effect estimation in the presence of hidden confounders, using nonparametric instrumental variable (IV) regression. A leading strategy employs \emph{spectral features} - that is, learned features spanning the top eigensubspaces of the operator linking treatments to instruments. We derive a generalization error bound for a two-stage least squares estimator based on spectral features, and gain insights into the method's performance and failure modes. We show that performance depends on two key factors, leading to a clear taxonomy of outcomes. In a \emph{good} scenario, the approach is optimal. This occurs with strong \emph{spectral alignment}, meaning the structural function is well-represented by the top eigenfunctions of the conditional operator, coupled with this operator's slow eigenvalue decay, indicating a strong instrument. Performance degrades in a \emph{bad} scenario: spectral alignment remains strong, but rapid eigenvalue decay (indicating a weaker instrument) demands significantly more samples for effective feature learning. Finally, in the \emph{ugly} scenario, weak spectral alignment causes the method to fail, regardless of the eigenvalues' characteristics. Our synthetic experiments empirically validate this taxonomy. We further introduce a practical procedure to estimate these spectral properties from data, allowing practitioners to diagnose which regime a given problem falls into. We apply this method to the dSprites dataset, demonstrating its utility.

NeurIPS Conference 2025 Conference Paper

Density Ratio-Free Doubly Robust Proxy Causal Learning

Bariscan Bozkurt
Houssam Zenati
Dimitri Meunier
Liyuan Xu
Arthur Gretton

We study the problem of causal function estimation in the Proxy Causal Learning (PCL) framework, where confounders are not observed but proxies for the confounders are available. Two main approaches have been proposed: outcome bridge-based and treatment bridge-based methods. In this work, we propose two kernel-based doubly robust estimators that combine the strengths of both approaches, and naturally handle continuous and high-dimensional variables. Our identification strategy builds on a recent density ratio-free method for treatment bridge-based PCL; furthermore, in contrast to previous approaches, it does not require indicator functions or kernel smoothing over the treatment variable. These properties make it especially well-suited for continuous or high-dimensional treatments. By using kernel mean embeddings, we propose the first density-ratio free doubly robust estimators for proxy causal learning, which have closed form solutions and strong uniform consistency guarantees. Our estimators outperform existing methods on PCL benchmarks, including a prior doubly robust method that requires both kernel smoothing and density ratio estimation.

ICML Conference 2025 Conference Paper

Distributional Diffusion Models with Scoring Rules

Valentin De Bortoli
Alexandre Galashov
J. Swaroop Guntupalli
Guangyao Zhou
Kevin Murphy 0002
Arthur Gretton
Arnaud Doucet

Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively “denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to speed up sample generation by learning the posterior distribution of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

NeurIPS Conference 2025 Conference Paper

Doubly-Robust Estimation of Counterfactual Policy Mean Embeddings

Houssam Zenati
Bariscan Bozkurt
Arthur Gretton

Estimating the distribution of outcomes under counterfactual policies is critical for decision-making in domains such as recommendation, advertising, and healthcare. We propose and analyze a novel framework—Counterfactual Policy Mean Embedding (CPME)—that represents the entire counterfactual outcome distribution in a reproducing kernel Hilbert space (RKHS), enabling flexible and nonparametric distributional off-policy evaluation. We introduce both a plug-in estimator and a doubly robust estimator; the latter enjoys improved convergence rates by correcting for bias in both the outcome embedding and propensity models. Building on this, we develop a doubly robust kernel test statistic for hypothesis testing, which achieves asymptotic normality and thus enables computationally efficient testing and straightforward construction of confidence intervals. Our framework also supports sampling from the counterfactual distribution. Numerical simulations illustrate the practical benefits of CPME over existing methods.

ICML Conference 2025 Conference Paper

Learning-Order Autoregressive Models with Application to Molecular Graph Generation

Zhe Wang 0055
Jiaxin Shi
Nicolas Heess
Arthur Gretton
Michalis K. Titsias

Autoregressive models (ARMs) have become the workhorse for sequence generation tasks, since many problems can be modeled as next-token prediction. While there appears to be a natural ordering for text (i. e. , left-to-right), for many data types, such as graphs, the canonical ordering is less obvious. To address this problem, we introduce a variant of ARM that generates high-dimensional data using a probabilistic ordering that is sequentially inferred from data. This model incorporates a trainable probability distribution, referred to as an order-policy, that dynamically decides the autoregressive order in a state-dependent manner. To train the model, we introduce a variational lower bound on the exact log-likelihood, which we optimize with stochastic gradient estimation. We demonstrate experimentally that our method can learn meaningful autoregressive orderings in image and graph generation. On the challenging domain of molecular graph generation, we achieve state-of-the-art results on the QM9 and ZINC250k benchmarks, evaluated using the Fréchet ChemNet Distance (FCD), Synthetic Accessibility Score (SAS), Quantitative Estimate of Drug-likeness (QED).

NeurIPS Conference 2025 Conference Paper

On the Hardness of Conditional Independence Testing In Practice

Zheng He
Roman Pogodin
Yazhe Li
Namrata Deka
Arthur Gretton
Danica J. Sutherland

Tests of conditional independence (CI) underpin a number of important problems in machine learning and statistics, from causal discovery to evaluation of predictor fairness and out-of-distribution robustness. Shah and Peters (2020) showed that, contrary to the unconditional case, no universally finite-sample valid test can ever achieve nontrivial power. While informative, this result (based on “hiding” dependence) does not seem to explain the frequent practical failures observed with popular CI tests. We investigate the Kernel-based Conditional Independence (KCI) test – of which we show the Generalized Covariance Measure underlying many recent tests is nearly a special case – and identify the major factors underlying its practical behavior. We highlight the key role of errors in the conditional mean embedding estimate for the Type I error, while pointing out the importance of selecting an appropriate conditioning kernel (not recognized in previous work) as being necessary for good test power but also tending to inflate Type I error.

ICLR Conference 2025 Conference Paper

Optimality and Adaptivity of Deep Neural Features for Instrumental Variable Regression

Juno Kim
Dimitri Meunier
Arthur Gretton
Taiji Suzuki
Zhu Li

We provide a convergence analysis of \emph{deep feature instrumental variable} (DFIV) regression (Xu et al., 2021), a nonparametric approach to IV regression using data-adaptive features learned by deep neural networks in two stages. We prove that the DFIV algorithm achieves the minimax optimal learning rate when the target structural function lies in a Besov space. This is shown under standard nonparametric IV assumptions, and an additional smoothness assumption on the regularity of the conditional distribution of the covariate given the instrument, which controls the difficulty of Stage 1. We further demonstrate that DFIV, as a data-adaptive algorithm, is superior to fixed-feature (kernel or sieve) IV methods in two ways. First, when the target function possesses low spatial homogeneity (i.e., it has both smooth and spiky/discontinuous regions), DFIV still achieves the optimal rate, while fixed-feature methods are shown to be strictly suboptimal. Second, comparing with kernel-based two-stage regression estimators, DFIV is provably more data efficient in the Stage 1 samples.

NeurIPS Conference 2025 Conference Paper

Regularized least squares learning with heavy-tailed noise is minimax optimal

Mattes Mollenhauer
Nicole Muecke
Dimitri Meunier
Arthur Gretton

This paper examines the performance of ridge regression in reproducing kernel Hilbert spaces in the presence of noise that exhibits a finite number of higher moments. We establish excess risk bounds consisting of subgaussian and polynomial terms based on the well known integral operator framework. The dominant subgaussian component allows to achieve convergence rates that have previously only been derived under subexponential noise—a prevalent assumption in related work from the last two decades. These rates are optimal under standard eigenvalue decay conditions, demonstrating the asymptotic robustness of regularized least squares against heavy- tailed noise. Our derivations are based on a Fuk–Nagaev inequality for Hilbert-space valued random variables.

EWRL Workshop 2024 Workshop Paper

A Distributional Analogue to the Successor Representation

Harley Wiltzer
Jesse Farebrother
Arthur Gretton
Yunhao Tang
Andre Barreto
Will Dabney
Marc G Bellemare
Mark Rowland

This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.

ICML Conference 2024 Conference Paper

A Distributional Analogue to the Successor Representation

Harley Wiltzer
Jesse Farebrother
Arthur Gretton
Yunhao Tang
André Barreto 0001
Will Dabney
Marc G. Bellemare
Mark Rowland 0001

This paper contributes a new approach for distributional reinforcement learning which elucidates a clean separation of transition structure and reward in the learning process. Analogous to how the successor representation (SR) describes the expected consequences of behaving according to a given policy, our distributional successor measure (SM) describes the distributional consequences of this behaviour. We formulate the distributional SM as a distribution over distributions and provide theory connecting it with distributional and model-based reinforcement learning. Moreover, we propose an algorithm that learns the distributional SM from data by minimizing a two-level maximum mean discrepancy. Key to our method are a number of algorithmic techniques that are independently valuable for learning generative models of state. As an illustration of the usefulness of the distributional SM, we show that it enables zero-shot risk-sensitive policy evaluation in a way that was not previously possible.

UAI Conference 2024 Conference Paper

Conditional Bayesian Quadrature

Zonghao Chen
Masha Naslidnyk
Arthur Gretton
François-Xavier Briol

We propose a novel approach for estimating conditional or parametric expectations in the setting where obtaining samples or evaluating integrands is costly. Through the framework of probabilistic numerical methods (such as Bayesian quadrature), our novel approach allows to incorporates prior information about the integrands especially the prior smoothness knowledge about the integrands and the conditional expectation. As a result, our approach provides a way of quantifying uncertainty and leads to a fast convergence rate, which is confirmed both theoretically and empirically on challenging tasks in Bayesian sensitivity analysis, computational finance and decision making under uncertainty.

ICML Conference 2024 Conference Paper

Distributional Bellman Operators over Mean Embeddings

Li Kevin Wenliang
Grégoire Delétang
Matthew Aitchison
Marcus Hutter
Anian Ruoss
Arthur Gretton
Mark Rowland 0001

We propose a novel algorithmic framework for distributional reinforcement learning, based on learning finite-dimensional mean embeddings of return distributions. The framework reveals a wide variety of new algorithms for dynamic programming and temporal-difference algorithms that rely on the sketch Bellman operator, which updates mean embeddings with simple linear-algebraic computations. We provide asymptotic convergence theory, and examine the empirical performance of the algorithms on a suite of tabular tasks. Further, we show that this approach can be straightforwardly combined with deep reinforcement learning.

NeurIPS Conference 2024 Conference Paper

Foundations of Multivariate Distributional Reinforcement Learning

Harley Wiltzer
Jesse Farebrother
Arthur Gretton
Mark Rowland

In reinforcement learning (RL), the consideration of multivariate reward signals has led to fundamental advancements in multi-objective decision-making, transfer learning, and representation learning. This work introduces the first oracle-free and computationally-tractable algorithms for provably convergent multivariate *distributional* dynamic programming and temporal difference learning. Our convergence rates match the familiar rates in the scalar reward setting, and additionally provide new insights into the fidelity of approximate return distribution representations as a function of the reward dimension. Surprisingly, when the reward dimension is larger than $1$, we show that standard analysis of categorical TD learning fails, which we resolve with a novel projection onto the space of mass-$1$ signed measures. Finally, with the aid of our technical results and simulations, we identify tradeoffs between distribution representations that influence the performance of multivariate distributional RL in practice.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Mind the Graph When Balancing Data for Fairness or Robustness

Jessica Schrouff
Alexis Bellot
Amal Rannen-Triki
Alan Malek
Isabela Albuquerque
Arthur Gretton
Alexander D'Amour
Silvia Chiappa

Failures of fairness or robustness in machine learning predictive settings can be due to undesired dependencies between covariates, outcomes and auxiliary factors of variation. A common strategy to mitigate these failures is data balancing, which attempts to remove those undesired dependencies. In this work, we define conditions on the training distribution for data balancing to lead to fair or robust models. Our results display that in many cases, the balanced distribution does not correspond to selectively removing the undesired dependencies in a causal graph of the task, leading to multiple failure modes and even interference with other mitigation techniques such as regularization. Overall, our results highlight the importance of taking the causal graph into account before performing data balancing.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Near-Optimality of Contrastive Divergence Algorithms

Pierre Glaser
Kevin Han Huang
Arthur Gretton

We provide a non-asymptotic analysis of the contrastive divergence (CD) algorithm, a training method for unnormalized models. While prior work has established that (for exponential family distributions) the CD iterates asymptotically converge at an $O(n^{-1 / 3})$ rate to the true parameter of the data distribution, we show that CD can achieve the parametric rate $O(n^{-1 / 2})$. Our analysis provides results for various data batching schemes, including fully online and minibatch. We additionally show that CD is near-optimal, in the sense that its asymptotic variance is close to the Cramér-Rao lower bound.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Optimal Rates for Vector-Valued Spectral Regularization Learning Algorithms

Dimitri Meunier
Zikai Shen
Mattes Mollenhauer
Arthur Gretton
Zhu Li

We study theoretical properties of a broad class of regularized algorithms with vector-valued output. These spectral algorithms include kernel ridge regression, kernel principal component regression and various implementations of gradient descent. Our contributions are twofold. First, we rigorously confirm the so-called saturation effect for ridge regression with vector-valued output by deriving a novel lower bound on learning rates; this bound is shown to be suboptimal when the smoothness of the regression function exceeds a certain level. Second, we present an upper bound on the finite sample risk for general vector-valued spectral algorithms, applicable to both well-specified and misspecified scenarios (where the true regression function lies outside of the hypothesis space), and show that this bound is minimax optimal in various regimes. All of our results explicitly allow the case of infinite-dimensional output variables, proving consistency of recent practical applications.

PDF Details DOI

JMLR Journal 2024 Journal Article

Towards Optimal Sobolev Norm Rates for the Vector-Valued Regularized Least-Squares Algorithm

Zhu Li
Dimitri Meunier
Mattes Mollenhauer
Arthur Gretton

We present the first optimal rates for infinite-dimensional vector-valued ridge regression on a continuous scale of norms that interpolate between L2 and the hypothesis space, which we consider as a vector-valued reproducing kernel Hilbert space. These rates allow to treat the misspecified case in which the true regression function is not contained in the hypothesis space. We combine standard assumptions on the capacity of the hypothesis space with a novel tensor product construction of vector-valued interpolation spaces in order to characterize the smoothness of the regression function. Our upper bound not only attains the same rate as real-valued kernel ridge regression, but also removes the assumption that the target regression function is bounded. For the lower bound, we reduce the problem to the scalar setting using a projection argument. We show that these rates are optimal in most cases and independent of the dimension of the output space. We illustrate our results for the special case of vector-valued Sobolev spaces. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICML Conference 2023 Conference Paper

A Kernel Stein Test of Goodness of Fit for Sequential Models

Jerome Baum
Heishiro Kanagawa
Arthur Gretton

We propose a goodness-of-fit measure for probability densities modeling observations with varying dimensionality, such as text documents of differing lengths or variable-length sequences. The proposed measure is an instance of the kernel Stein discrepancy (KSD), which has been used to construct goodness-of-fit tests for unnormalized densities. The KSD is defined by its Stein operator: current operators used in testing apply to fixed-dimensional spaces. As our main contribution, we extend the KSD to the variable-dimension setting by identifying appropriate Stein operators, and propose a novel KSD goodness-of-fit test. As with the previous variants, the proposed KSD does not require the density to be normalized, allowing the evaluation of a large class of models. Our test is shown to perform well in practice on discrete sequential data benchmarks.

ICLR Conference 2023 Conference Paper

A Neural Mean Embedding Approach for Back-door and Front-door Adjustment

Liyuan Xu
Arthur Gretton

We consider the estimation of average and counterfactual treatment effects, under two settings: back-door adjustment and front-door adjustment. The goal in both cases is to recover the treatment effect without having an access to a hidden confounder. This objective is attained by first estimating the conditional mean of the desired outcome variable given relevant covariates (the ``first stage" regression), and then taking the (conditional) expectation of this function as a ``second stage" procedure. We propose to compute these conditional expectations directly using a regression function to the learned input features of the first stage, thus avoiding the need for sampling or density estimation. All functions and features (and in particular, the output features in the second stage) are neural networks learned adaptively from data, with the sole requirement that the final layer of the first stage should be linear. The proposed method is shown to converge to the true causal parameter, and outperforms the recent state-of-the-art methods on challenging causal benchmarks, including settings involving high-dimensional image data.

ICLR Conference 2023 Conference Paper

Efficient Conditionally Invariant Representation Learning

Roman Pogodin
Namrata Deka
Yazhe Li
Danica J. Sutherland
Victor Veitch
Arthur Gretton

We introduce the Conditional Independence Regression CovariancE (CIRCE), a measure of conditional independence for multivariate continuous-valued variables. CIRCE applies as a regularizer in settings where we wish to learn neural features $\varphi(X)$ of data $X$ to estimate a target $Y$, while being conditionally independent of a distractor $Z$ given $Y$. Both $Z$ and $Y$ are assumed to be continuous-valued but relatively low dimensional, whereas $X$ and its features may be complex and high dimensional. Relevant settings include domain-invariant learning, fairness, and causal learning. The procedure requires just a single ridge regression from $Y$ to kernelized features of $Z$, which can be done in advance. It is then only necessary to enforce independence of $\varphi(X)$ from residuals of this regression, which is possible with attractive estimation properties and consistency guarantees. By contrast, earlier measures of conditional feature dependence require multiple regressions for each step of feature learning, resulting in more severe bias and variance, and greater computational cost. When sufficiently rich features are used, we establish that CIRCE is zero if and only if $\varphi(X) \perp \!\!\! \perp Z \mid Y$. In experiments, we show superior performance to previous methods on challenging benchmarks, including learning conditionally invariant image features. Code for image data experiments is available at github.com/namratadeka/circe.

UAI Conference 2023 Conference Paper

Fast and scalable score-based kernel calibration tests

Pierre Glaser
David Widmann
Fredrik Lindsten
Arthur Gretton

We introduce the Kernel Calibration Conditional Stein Discrepancy test (KCCSD test), a nonparametric, kernel-based test for assessing the calibration of probabilistic models with well-defined scores. In contrast to previous methods, our test avoids the need for possibly expensive expectation approximations while providing control over its type-I error. We achieve these improvements by using a new family of kernels for score-based probabilities that can be estimated without probability density samples, and by using a Conditional Goodness of Fit criterion for the KCCSD test’s U-statistic. We demonstrate the properties of our test on various synthetic settings.

JMLR Journal 2023 Journal Article

MMD Aggregated Two-Sample Test

Antonin Schrab
Ilmun Kim
Mélisande Albert
Béatrice Laurent
Benjamin Guedj
Arthur Gretton

We propose two novel nonparametric two-sample kernel tests based on the Maximum Mean Discrepancy (MMD). First, for a fixed kernel, we construct an MMD test using either permutations or a wild bootstrap, two popular numerical procedures to determine the test threshold. We prove that this test controls the probability of type I error non-asymptotically. Hence, it can be used reliably even in settings with small sample sizes as it remains well-calibrated, which differs from previous MMD tests which only guarantee correct test level asymptotically. When the difference in densities lies in a Sobolev ball, we prove minimax optimality of our MMD test with a specific kernel depending on the smoothness parameter of the Sobolev ball. In practice, this parameter is unknown and, hence, the optimal MMD test with this particular kernel cannot be used. To overcome this issue, we construct an aggregated test, called MMDAgg, which is adaptive to the smoothness parameter. The test power is maximised over the collection of kernels used, without requiring held-out data for kernel selection (which results in a loss of test power), or arbitrary kernel choices such as the median heuristic. We prove that MMDAgg still controls the level non-asymptotically, and achieves the minimax rate over Sobolev balls, up to an iterated logarithmic term. Our guarantees are not restricted to a specific type of kernel, but hold for any product of one-dimensional translation invariant characteristic kernels. We provide a user-friendly parameter-free implementation of MMDAgg using an adaptive collection of bandwidths. We demonstrate that MMDAgg significantly outperforms alternative state-of-the-art MMD-based two-sample tests on synthetic data satisfying the Sobolev smoothness assumption, and that, on real-world image data, MMDAgg closely matches the power of tests leveraging the use of models such as neural networks. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2023 Conference Paper

MMD-Fuse: Learning and Combining Kernels for Two-Sample Testing Without Data Splitting

Felix Biggs
Antonin Schrab
Arthur Gretton

We propose novel statistics which maximise the power of a two-sample test based on the Maximum Mean Discrepancy (MMD), byadapting over the set of kernels used in defining it. For finite sets, this reduces to combining (normalised) MMD values under each of these kernels via a weighted soft maximum. Exponential concentration bounds are proved for our proposed statistics under the null and alternative. We further show how these kernels can be chosen in a data-dependent but permutation-independent way, in a well-calibrated test, avoiding data splitting. This technique applies more broadly to general permutation-based MMD testing, and includes the use of deep kernels with features learnt using unsupervised models such as auto-encoders. We highlight the applicability of our MMD-Fuse tests on both synthetic low-dimensional and real-world high-dimensional data, and compare its performance in terms of power against current state-of-the-art kernel tests.

UAI Conference 2022 Conference Paper

Causal inference with treatment measurement error: a nonparametric instrumental variable approach

Yuchen Zhu
Limor Gultchin
Arthur Gretton
Matt J. Kusner
Ricardo Silva 0001

We propose a kernel-based nonparametric estimator for the causal effect when the cause is corrupted by error. We do so by generalizing estimation in the instrumental variable setting. Despite significant work on regression with measurement error, additionally handling unobserved confounding in the continuous setting is non-trivial: we have seen little prior work. As a by-product of our investigation, we clarify a connection between mean embeddings and characteristic functions, and how learning one simultaneously allows one to learn the other. This opens the way for kernel method research to leverage existing results in characteristic function estimation. Finally, we empirically show that our proposed method, MEKIV, improves over baselines and is robust under changes in the strength of measurement error and to the type of error distributions.

NeurIPS Conference 2022 Conference Paper

Efficient Aggregated Kernel Tests using Incomplete $U$-statistics

Antonin Schrab
Ilmun Kim
Benjamin Guedj
Arthur Gretton

We propose a series of computationally efficient, nonparametric tests for the two-sample, independence and goodness-of-fit problems, using the Maximum Mean Discrepancy (MMD), Hilbert Schmidt Independence Criterion (HSIC), and Kernel Stein Discrepancy (KSD), respectively. Our test statistics are incomplete $U$-statistics, with a computational cost that interpolates between linear time in the number of samples, and quadratic time, as associated with classical $U$-statistic tests. The three proposed tests aggregate over several kernel bandwidths to detect departures from the null on various scales: we call the resulting tests MMDAggInc, HSICAggInc and KSDAggInc. This procedure provides a solution to the fundamental kernel selection problem as we can aggregate a large number of kernels with several bandwidths without incurring a significant loss of test power. For the test thresholds, we derive a quantile bound for wild bootstrapped incomplete $U$-statistics, which is of independent interest. We derive non-asymptotic uniform separation rates for MMDAggInc and HSICAggInc, and quantify exactly the trade-off between computational efficiency and the attainable rates: this result is novel for tests based on incomplete $U$-statistics, to our knowledge. We further show that in the quadratic-time case, the wild bootstrap incurs no penalty to test power over more widespread permutation-based approaches, since both attain the same minimax optimal rates (which in turn match the rates that use oracle quantiles). We support our claims with numerical experiments on the trade-off between computational efficiency and test power. In all three testing frameworks, our proposed linear-time tests outperform the current linear-time state-of-the-art tests (or at least match their test power).

ICML Conference 2022 Conference Paper

Importance Weighted Kernel Bayes' Rule

Liyuan Xu
Yutian Chen
Arnaud Doucet
Arthur Gretton

We study a nonparametric approach to Bayesian computation via feature means, where the expectation of prior features is updated to yield expected posterior features, based on regression from kernel or neural net features of the observations. All quantities involved in the Bayesian update are learned from observed data, making the method entirely model-free. The resulting algorithm is a novel instance of a kernel Bayes’ rule (KBR). Our approach is based on importance weighting, which results in superior numerical stability to the existing approach to KBR, which requires operator inversion. We show the convergence of the estimator using a novel consistency analysis on the importance weighting estimator in the infinity norm. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods. We evaluate our KBR on challenging synthetic benchmarks, including a filtering problem with a state-space model involving high dimensional image observations. The proposed method yields uniformly better empirical performance than the existing KBR, and competitive performance with other competing methods.

NeurIPS Conference 2022 Conference Paper

KSD Aggregated Goodness-of-fit Test

Antonin Schrab
Benjamin Guedj
Arthur Gretton

We investigate properties of goodness-of-fit tests based on the Kernel Stein Discrepancy (KSD). We introduce a strategy to construct a test, called KSDAgg, which aggregates multiple tests with different kernels. KSDAgg avoids splitting the data to perform kernel selection (which leads to a loss in test power), and rather maximises the test power over a collection of kernels. We provide theoretical guarantees on the power of KSDAgg: we show it achieves the smallest uniform separation rate of the collection, up to a logarithmic term. For compactly supported densities with bounded score function for the model, we derive the rate for KSDAgg over restricted Sobolev balls; this rate corresponds to the minimax optimal rate over unrestricted Sobolev balls, up to an iterated logarithmic term. KSDAgg can be computed exactly in practice as it relies either on a parametric bootstrap or on a wild bootstrap to estimate the quantiles and the level corrections. In particular, for the crucial choice of bandwidth of a fixed kernel, it avoids resorting to arbitrary heuristics (such as median or standard deviation) or to data splitting. We find on both synthetic and real-world data that KSDAgg outperforms other state-of-the-art quadratic-time adaptive KSD-based goodness-of-fit testing procedures.

JMLR Journal 2022 Journal Article

On Instrumental Variable Regression for Deep Offline Policy Evaluation

Yutian Chen
Liyuan Xu
Caglar Gulcehre
Tom Le Paine
Arthur Gretton
Nando de Freitas
Arnaud Doucet

We show that the popular reinforcement learning (RL) strategy of estimating the state-action value (Q-function) by minimizing the mean squared Bellman error leads to a regression problem with confounding, the inputs and output noise being correlated. Hence, direct minimization of the Bellman error can result in significantly biased Q-function estimates. We explain why fixing the target Q-network in Deep Q-Networks and Fitted Q Evaluation provides a way of overcoming this confounding, thus shedding new light on this popular but not well understood trick in the deep RL literature. An alternative approach to address confounding is to leverage techniques developed in the causality literature, notably instrumental variables (IV). We bring together here the literature on IV and RL by investigating whether IV approaches can lead to improved Q-function estimates. This paper analyzes and compares a wide range of recent IV methods in the context of offline policy evaluation (OPE), where the goal is to estimate the value of a policy using logged data only. By applying different IV techniques to OPE, we are not only able to recover previously proposed OPE methods such as model-based techniques but also to obtain competitive new techniques. We find empirically that state-of-the-art OPE methods are closely matched in performance by some IV methods such as AGMM, which were not developed for OPE. We open-source all our code and datasets at https://github.com/liyuan9988/IVOPEwithACME. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Optimal Rates for Regularized Conditional Mean Embedding Learning

Zhu Li
Dimitri Meunier
Mattes Mollenhauer
Arthur Gretton

We address the consistency of a kernel ridge regression estimate of the conditional mean embedding (CME), which is an embedding of the conditional distribution of $Y$ given $X$ into a target reproducing kernel Hilbert space $\mathcal{H}_Y$. The CME allows us to take conditional expectations of target RKHS functions, and has been employed in nonparametric causal and Bayesian inference. We address the misspecified setting, where the target CME isin the space of Hilbert-Schmidt operators acting from an input interpolation space between $\mathcal{H}_X$ and $L_2$, to $\mathcal{H}_Y$. This space of operators is shown to be isomorphic to a newly defined vector-valued interpolation space. Using this isomorphism, we derive a novel and adaptive statistical learning rate for the empirical CME estimator under the misspecified setting. Our analysis reveals that our rates match the optimal $O(\log n / n)$ rates without assuming $\mathcal{H}_Y$ to be finite dimensional. We further establish a lower bound on the learning rate, which shows that the obtained upper bound is optimal.

UAI Conference 2021 Conference Paper

A weaker faithfulness assumption based on triple interactions

Alexander Marx 0001
Arthur Gretton
Joris M. Mooij

One of the core assumptions in causal discovery is the faithfulness assumption—i. e. assuming that independencies found in the data are due to separations in the true causal graph. This assumption can, however, be violated in many ways, including xor connections, deterministic functions or cancelling paths. In this work, we propose a weaker assumption that we call 2-adjacency faithfulness. In contrast to adjacency faithfulness, which assumes that there is no conditional independence between each pair of variables that are connected in the causal graph, we only require no conditional independence between a node and a subset of its Markov blanket that can contain up to two nodes. Equivalently, we adapt orientation faithfulness to this setting. We further propose a sound orientation rule for causal discovery that applies under weaker assumptions. As a proof of concept, we derive a modified Grow and Shrink algorithm that recovers the Markov blanket of a target node and prove its correctness under strictly weaker assumptions than the standard faithfulness assumption.

NeurIPS Conference 2021 Conference Paper

Deep Proxy Causal Learning and its Application to Confounded Bandit Policy Evaluation

Liyuan Xu
Heishiro Kanagawa
Arthur Gretton

Proxy causal learning (PCL) is a method for estimating the causal effect of treatments on outcomes in the presence of unobserved confounding, using proxies (structured side information) for the confounder. This is achieved via two-stage regression: in the first stage, we model relations among the treatment and proxies; in the second stage, we use this model to learn the effect of treatment on the outcome, given the context provided by the proxies. PCL guarantees recovery of the true causal effect, subject to identifiability conditions. We propose a novel method for PCL, the deep feature proxy variable method (DFPV), to address the case where the proxies, treatments, and outcomes are high-dimensional and have nonlinear complex relationships, as represented by deep neural network features. We show that DFPV outperforms recent state-of-the-art PCL methods on challenging synthetic benchmarks, including settings involving high dimensional image data. Furthermore, we show that PCL can be applied to off-policy evaluation for the confounded bandit problem, in which DFPV also exhibits competitive performance.

ICLR Conference 2021 Conference Paper

Efficient Wasserstein Natural Gradients for Reinforcement Learning

Ted Moskovitz
Michael Arbel
Ferenc Huszar
Arthur Gretton

A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient \emph{Wasserstein natural gradient} (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including divergence penalties in the objective to establish trust regions. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.

ICLR Conference 2021 Conference Paper

Generalized Energy Based Models

Michael Arbel
Liang Zhou
Arthur Gretton

We introduce the Generalized Energy Based Model (GEBM) for generative modelling. These models combine two trained components: a base distribution (generally an implicit model), which can learn the support of data with low intrinsic dimension in a high dimensional space; and an energy function, to refine the probability mass on the learned support. Both the energy function and base jointly constitute the final model, unlike GANs, which retain only the base distribution (the "generator"). GEBMs are trained by alternating between learning the energy and the base. We show that both training stages are well-defined: the energy is learned by maximising a generalized likelihood, and the resulting energy-based loss provides informative gradients for learning the base. Samples from the posterior on the latent space of the trained model can be obtained via MCMC, thus finding regions in this space that produce better quality samples. Empirically, the GEBM samples on image-generation tasks are of much better quality than those from the learned generator alone, indicating that all else being equal, the GEBM will outperform a GAN of the same complexity. When using normalizing flows as base measures, GEBMs succeed on density modelling tasks returning comparable performance to direct maximum likelihood of the same networks.

NeurIPS Conference 2021 Conference Paper

KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support

Pierre Glaser
Michael Arbel
Arthur Gretton

We study the gradient flow for a relaxed approximation to the Kullback-Leibler (KL) divergencebetween a moving source and a fixed target distribution. This approximation, termed theKALE (KL approximate lower-bound estimator), solves a regularized version ofthe Fenchel dual problem defining the KL over a restricted class of functions. When using a Reproducing Kernel Hilbert Space (RKHS) to define the functionclass, we show that the KALE continuously interpolates between the KL and theMaximum Mean Discrepancy (MMD). Like the MMD and other Integral ProbabilityMetrics, the KALE remains well defined for mutually singulardistributions. Nonetheless, the KALE inherits from the limiting KL a greater sensitivity to mismatch in the support of the distributions, compared with the MMD. These two properties make theKALE gradient flow particularly well suited when the target distribution is supported on a low-dimensional manifold. Under an assumption of sufficient smoothness of the trajectories, we show the global convergence of the KALE flow. We propose a particle implementation of the flow given initial samples from the source and the target distribution, which we use to empirically confirm the KALE's properties.

ICLR Conference 2021 Conference Paper

Learning Deep Features in Instrumental Variable Regression

Liyuan Xu
Yutian Chen 0001
Siddarth Srinivasan
Nando de Freitas
Arnaud Doucet
Arthur Gretton

Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables from observational data by using an instrumental variable, which affects the outcome only through the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.

ICML Conference 2021 Conference Paper

Proximal Causal Learning with Kernels: Two-Stage Estimation and Moment Restriction

Afsaneh Mastouri
Yuchen Zhu
Limor Gultchin
Anna Korba
Ricardo Silva 0001
Matt J. Kusner
Arthur Gretton
Krikamol Muandet

We address the problem of causal effect estima-tion in the presence of unobserved confounding, but where proxies for the latent confounder(s) areobserved. We propose two kernel-based meth-ods for nonlinear causal effect estimation in thissetting: (a) a two-stage regression approach, and(b) a maximum moment restriction approach. Wefocus on the proximal causal learning setting, butour methods can be used to solve a wider classof inverse problems characterised by a Fredholmintegral equation. In particular, we provide a uni-fying view of two-stage and moment restrictionapproaches for solving this problem in a nonlin-ear setting. We provide consistency guaranteesfor each algorithm, and demonstrate that these ap-proaches achieve competitive results on syntheticdata and data simulating a real-world task. In par-ticular, our approach outperforms earlier methodsthat are not suited to leveraging proxy variables.

NeurIPS Conference 2021 Conference Paper

Self-Supervised Learning with Kernel Dependence Maximization

Yazhe Li
Roman Pogodin
Danica J. Sutherland
Arthur Gretton

We approach self-supervised learning of image representations from a statistical dependence perspective, proposing Self-Supervised Learning with the Hilbert-Schmidt Independence Criterion (SSL-HSIC). SSL-HSIC maximizes dependence between representations of transformations of an image and the image identity, while minimizing the kernelized variance of those representations. This framework yields a new understanding of InfoNCE, a variational lower bound on the mutual information (MI) between different transformations. While the MI itself is known to have pathologies which can result in learning meaningless representations, its bound is much better behaved: we show that it implicitly approximates SSL-HSIC (with a slightly different regularizer). Our approach also gives us insight into BYOL, a negative-free SSL method, since SSL-HSIC similarly learns local neighborhoods of samples. SSL-HSIC allows us to directly optimize statistical dependence in time linear in the batch size, without restrictive data assumptions or indirect mutual information estimators. Trained with or without a target network, SSL-HSIC matches the current state-of-the-art for standard linear evaluation on ImageNet, semi-supervised learning and transfer to other classification and vision tasks such as semantic segmentation, depth estimation and object recognition. Code is available at https: //github. com/deepmind/ssl_hsic.

NeurIPS Conference 2020 Conference Paper

A kernel test for quasi-independence

Tamara Fernandez
Wenkai Xu
Marc Ditzhaus
Arthur Gretton

We consider settings in which the data of interest correspond to pairs of ordered times, e. g, the birth times of the first and second child, the times at which a new user creates an account and makes the first purchase on a website, and the entry and survival times of patients in a clinical trial. In these settings, the two times are not independent (the second occurs after the first), yet it is still of interest to determine whether there exists significant dependence "beyond" their ordering in time. We refer to this notion as "quasi-(in)dependence. " For instance, in a clinical trial, to avoid biased selection, we might wish to verify that recruitment times are quasi-independent of survival times, where dependencies might arise due to seasonal effects. In this paper, we propose a nonparametric statistical test of quasi-independence. Our test considers a potentially infinite space of alternatives, making it suitable for complex data where the nature of the possible quasi-dependence is not known in advance. Standard parametric approaches are recovered as special cases, such as the classical conditional Kendall's tau, and log-rank tests. The tests apply in the right-censored setting: an essential feature in clinical trials, where patients can withdraw from the study. We provide an asymptotic analysis of our test-statistic, and demonstrate in experiments that our test obtains better power than existing approaches, while being more computationally efficient.

NeurIPS Conference 2020 Conference Paper

A Non-Asymptotic Analysis for Stein Variational Gradient Descent

Anna Korba
Adil Salim
Michael Arbel
Giulia Luise
Arthur Gretton

We study the Stein Variational Gradient Descent (SVGD) algorithm, which optimises a set of particles to approximate a target probability distribution $\pi\propto e^{-V}$ on $\R^d$. In the population limit, SVGD performs gradient descent in the space of probability distributions on the KL divergence with respect to $\pi$, where the gradient is smoothed through a kernel integral operator. In this paper, we provide a novel finite time analysis for the SVGD algorithm. We provide a descent lemma establishing that the algorithm decreases the objective at each iteration, and rates of convergence. We also provide a convergence result of the finite particle system corresponding to the practical implementation of SVGD to its population version.

ICML Conference 2020 Conference Paper

Kernelized Stein Discrepancy Tests of Goodness-of-fit for Time-to-Event Data

Tamara Fernandez
Nicolás Rivera
Wenkai Xu
Arthur Gretton

Survival Analysis and Reliability Theory are concerned with the analysis of time-to-event data, in which observations correspond to waiting times until an event of interest such as death from a particular disease or failure of a component in a mechanical system. This type of data is unique due to the presence of censoring, a type of missing data that occurs when we do not observe the actual time of the event of interest but, instead, we have access to an approximation for it given by random interval in which the observation is known to belong. Most traditional methods are not designed to deal with censoring, and thus we need to adapt them to censored time-to-event data. In this paper, we focus on non-parametric goodness-of-fit testing procedures based on combining the Stein’s method and kernelized discrepancies. While for uncensored data, there is a natural way of implementing a kernelized Stein discrepancy test, for censored data there are several options, each of them with different advantages and disadvantages. In this paper, we propose a collection of kernelized Stein discrepancy tests for time-to-event data, and we study each of them theoretically and empirically; our experimental results show that our proposed methods perform better than existing tests, including previous tests based on a kernelized maximum mean discrepancy.

ICLR Conference 2020 Conference Paper

Kernelized Wasserstein Natural Gradient

Michael Arbel
Arthur Gretton
Wuchen Li
Guido Montúfar

Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on \texttt{Cifar10} and \texttt{Cifar100} empirically.

ICML Conference 2020 Conference Paper

Learning Deep Kernels for Non-Parametric Two-Sample Tests

Feng Liu 0003
Wenkai Xu
Jie Lu 0001
Guangquan Zhang 0001
Arthur Gretton
Danica J. Sutherland

We propose a class of kernel-based two-sample tests, which aim to determine whether two sets of samples are drawn from the same distribution. Our tests are constructed from kernels parameterized by deep neural nets, trained to maximize test power. These tests adapt to variations in distribution smoothness and shape over space, and are especially suited to high dimensions and complex data. By contrast, the simpler kernels used in prior kernel testing work are spatially homogeneous, and adaptive only in lengthscale. We explain how this scheme includes popular classifier-based two-sample tests as a special case, but improves on them in general. We provide the first proof of consistency for the proposed adaptation method, which applies both to kernels on deep features and to simpler radial basis kernels or multiple kernel learning. In experiments, we establish the superior performance of our deep kernels in hypothesis testing on benchmark and real-world data. The code of our deep-kernel-based two-sample tests is available at github. com/fengliu90/DK-for-TST.

NeurIPS Conference 2019 Conference Paper

Exponential Family Estimation via Adversarial Dynamics Embedding

Bo Dai
Zhen Liu
Hanjun Dai
Niao He
Arthur Gretton
Le Song
Dale Schuurmans

We present an efficient algorithm for maximum likelihood estimation (MLE) of exponential family models, with a general parametrization of the energy function that includes neural networks. We exploit the primal-dual view of the MLE with a kinetics augmented model to obtain an estimate associated with an adversarial dual sampler. To represent this sampler, we introduce a novel neural architecture, dynamics embedding, that generalizes Hamiltonian Monte-Carlo (HMC). The proposed approach inherits the flexibility of HMC while enabling tractable entropy estimation for the augmented model. By learning both a dual sampler and the primal model simultaneously, and sharing parameters between them, we obviate the requirement to design a separate sampling procedure once the model has been trained, leading to more effective learning. We show that many existing estimators, such as contrastive divergence, pseudo/composite-likelihood, score matching, minimum Stein discrepancy estimator, non-local contrastive objectives, noise-contrastive estimation, and minimum probability flow, are special cases of the proposed approach, each expressed by a different (fixed) dual sampler. An empirical investigation shows that adapting the sampler during MLE can significantly improve on state-of-the-art estimators.

NeurIPS Conference 2019 Conference Paper

Kernel Instrumental Variable Regression

Rahul Singh
Maneesh Sahani
Arthur Gretton

Instrumental variable (IV) regression is a strategy for learning causal relationships in observational data. If measurements of input X and output Y are confounded, the causal relationship can nonetheless be identified if an instrumental variable Z is available that influences X directly, but is conditionally independent of Y given X and the unmeasured confounder. The classic two-stage least squares algorithm (2SLS) simplifies the estimation problem by modeling all relationships as linear functions. We propose kernel instrumental variable regression (KIV), a nonparametric generalization of 2SLS, modeling relations among X, Y, and Z as nonlinear functions in reproducing kernel Hilbert spaces (RKHSs). We prove the consistency of KIV under mild assumptions, and derive conditions under which convergence occurs at the minimax optimal rate for unconfounded, single-stage RKHS regression. In doing so, we obtain an efficient ratio between training sample sizes used in the algorithm's first and second stages. In experiments, KIV outperforms state of the art alternatives for nonparametric IV regression.

ICML Conference 2019 Conference Paper

Learning deep kernels for exponential family densities

Wenliang Li
Danica J. Sutherland
Heiko Strathmann
Arthur Gretton

The kernel exponential family is a rich class of distributions, which can be fit efficiently and with statistical guarantees by score matching. Being required to choose a priori a simple kernel such as the Gaussian, however, limits its practical applicability. We provide a scheme for learning a kernel parameterized by a deep network, which can find complex location-dependent local features of the data geometry. This gives a very rich class of density models, capable of fitting complex structures on moderate-dimensional problems. Compared to deep density models fit via maximum likelihood, our approach provides a complementary set of strengths and tradeoffs: in empirical studies, the former can yield higher likelihoods, whereas the latter gives better estimates of the gradient of the log density, the score, which describes the distribution’s shape.

NeurIPS Conference 2019 Conference Paper

Maximum Mean Discrepancy Gradient Flow

Michael Arbel
Anna Korba
Adil Salim
Arthur Gretton

We construct a Wasserstein gradient flow of the maximum mean discrepancy (MMD) and study its convergence properties. The MMD is an integral probability metric defined for a reproducing kernel Hilbert space (RKHS), and serves as a metric on probability measures for a sufficiently rich RKHS. We obtain conditions for convergence of the gradient flow towards a global optimum, that can be related to particle transport when optimizing neural networks. We also propose a way to regularize this MMD flow, based on an injection of noise in the gradient. This algorithmic fix comes with theoretical and empirical evidence. The practical implementation of the flow is straightforward, since both the MMD and its gradient have simple closed-form expressions, which can be easily estimated with samples.

NeurIPS Conference 2018 Conference Paper

BRUNO: A Deep Recurrent Model for Exchangeable Data

Iryna Korshunova
Jonas Degrave
Ferenc Huszar
Yarin Gal
Arthur Gretton
Joni Dambre

We present a novel model architecture which leverages deep learning tools to perform exact Bayesian inference on sets of high dimensional, complex observations. Our model is provably exchangeable, meaning that the joint distribution over observations is invariant under permutation: this property lies at the heart of Bayesian inference. The model does not require variational approximations to train, and new samples can be generated conditional on previous samples, with cost linear in the size of the conditioning set. The advantages of our architecture are demonstrated on learning tasks that require generalisation from short observed sequences while modelling sequence variability, such as conditional image generation, few-shot learning, and anomaly detection.

NeurIPS Conference 2018 Conference Paper

Informative Features for Model Comparison

Wittawat Jitkrittum
Heishiro Kanagawa
Patsorn Sangkloy
James Hays
Bernhard Schölkopf
Arthur Gretton

Given two candidate models, and a set of target observations, we address the problem of measuring the relative goodness of fit of the two models. We propose two new statistical tests which are nonparametric, computationally efficient (runtime complexity is linear in the sample size), and interpretable. As a unique advantage, our tests can produce a set of examples (informative features) indicating the regions in the data domain where one model fits significantly better than the other. In a real-world problem of comparing GAN models, the test power of our new test matches that of the state-of-the-art test of relative goodness of fit, while being one order of magnitude faster.

NeurIPS Conference 2018 Conference Paper

On gradient regularizers for MMD GANs

Michael Arbel
Dougal Sutherland
Mikołaj Bińkowski
Arthur Gretton

We propose a principled method for gradient-based regularization of the critic of GAN-like models trained by adversarially optimizing the kernel of a Maximum Mean Discrepancy (MMD). We show that controlling the gradient of the critic is vital to having a sensible loss function, and devise a method to enforce exact, analytical gradient constraints at no additional cost compared to existing approximate techniques based on additive regularizers. The new loss function is provably continuous, and experiments show that it stabilizes and accelerates training, giving image generation models that outperform state-of-the art methods on $160 \times 160$ CelebA and $64 \times 64$ unconditional ImageNet.

NeurIPS Conference 2017 Conference Paper

A Linear-Time Kernel Goodness-of-Fit Test

Wittawat Jitkrittum
Wenkai Xu
Zoltan Szabo
Kenji Fukumizu
Arthur Gretton

We propose a novel adaptive test of goodness-of-fit, with computational cost linear in the number of samples. We learn the test features that best indicate the differences between observed samples and a reference model, by minimizing the false negative rate. These features are constructed via Stein's method, meaning that it is not necessary to compute the normalising constant of the model. We analyse the asymptotic Bahadur efficiency of the new test, and prove that under a mean-shift alternative, our test always has greater relative efficiency than a previous linear-time kernel test, regardless of the choice of parameters for that test. In experiments, the performance of our method exceeds that of the earlier linear-time test, and matches or exceeds the power of a quadratic-time kernel test. In high dimensions and where model structure may be exploited, our goodness of fit test performs far better than a quadratic-time two-sample test based on the Maximum Mean Discrepancy, with samples drawn from the model.

ICML Conference 2017 Conference Paper

An Adaptive Test of Independence with Analytic Kernel Embeddings

Wittawat Jitkrittum
Zoltán Szabó 0001
Arthur Gretton

A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is data-efficient, and that runs in linear time (with respect to the sample size n). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In real-world benchmarks, independence tests using the optimized features perform comparably to the state-of-the-art quadratic-time HSIC test, and outperform competing O(n) and O(n log n) tests.

JMLR Journal 2017 Journal Article

Density Estimation in Infinite Dimensional Exponential Families

Bharath Sriperumbudur
Kenji Fukumizu
Arthur Gretton
Aapo Hyvärinen
Revant Kumar

In this paper, we consider an infinite dimensional exponential family $\mathcal{P}$ of probability densities, which are parametrized by functions in a reproducing kernel Hilbert space $\mathcal{H}$, and show it to be quite rich in the sense that a broad class of densities on $\mathbb{R}^d$ can be approximated arbitrarily well in Kullback-Leibler (KL) divergence by elements in $\mathcal{P}$. Motivated by this approximation property, the paper addresses the question of estimating an unknown density $p_0$ through an element in $\mathcal{P}$. Standard techniques like maximum likelihood estimation (MLE) or pseudo MLE (based on the method of sieves), which are based on minimizing the KL divergence between $p_0$ and $\mathcal{P}$, do not yield practically useful estimators because of their inability to efficiently handle the log-partition function. We propose an estimator $\hat{p}_n$ based on minimizing the Fisher divergence, $J(p_0\Vert p)$ between $p_0$ and $p\in \mathcal{P}$, which involves solving a simple finite-dimensional linear system. When $p_0\in\mathcal{P}$, we show that the proposed estimator is consistent, and provide a convergence rate of $n^{-\min\left\{\frac{2}{3},\frac{2\beta+1}{2\beta+2}\right\}}$ in Fisher divergence under the smoothness assumption that $\log p_0\in\mathcal{R}(C^\beta)$ for some $\beta\ge 0$, where $C$ is a certain Hilbert-Schmidt operator on $\mathcal{H}$ and $\mathcal{R}(C^\beta)$ denotes the image of $C^\beta$. We also investigate the misspecified case of $p_0\notin\mathcal{P}$ and show that $J(p_0\Vert\hat{p}_n)\rightarrow \inf_{p\in\mathcal{P}}J(p_0\Vert p)$ as $n\rightarrow \infty$, and provide a rate for this convergence under a similar smoothness condition as above. Through numerical simulations we demonstrate that the proposed estimator outperforms the non- parametric kernel density estimator, and that the advantage of the proposed estimator grows as $d$ increases. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

UAI Conference 2016 Conference Paper

A Kernel Test for Three-Variable Interactions with Random Processes

Paul K. Rubenstein
Kacper Chwialkowski
Arthur Gretton

We apply a wild bootstrap method to the Lancaster three-variable interaction measure in order to detect factorisation of the joint distribution on three variables forming a stationary random process, for which the existing permutation bootstrap method fails. As in the i. i. d. case, the Lancaster test is found to outperform existing tests in cases for which two independent variables individually have a weak influence on a third, but that when considered jointly the influence is strong. The main contributions of this paper are twofold: first, we prove that the Lancaster statistic satisfies the conditions required to estimate the quantiles of the null distribution using the wild bootstrap; second, the manner in which this is proved is novel, simpler than existing methods, and can further be applied to other statistics.

ICML Conference 2016 Conference Paper

A Kernel Test of Goodness of Fit

Kacper Chwialkowski
Heiko Strathmann
Arthur Gretton

We propose a nonparametric statistical test for goodness-of-fit: given a set of samples, the test determines how likely it is that these were generated from a target density function. The measure of goodness-of-fit is a divergence constructed via Stein’s method using functions from a Reproducing Kernel Hilbert Space. Our test statistic is based on an empirical estimate of this divergence, taking the form of a V-statistic in terms of the log gradients of the target density and the kernel. We derive a statistical test, both for i. i. d. and non-i. i. d. samples, where we estimate the null distribution quantiles using a wild bootstrap procedure. We apply our test to quantifying convergence of approximate Markov Chain Monte Carlo methods, statistical model criticism, and evaluating quality of fit vs model complexity in nonparametric density estimation.

NeurIPS Conference 2016 Conference Paper

Interpretable Distribution Features with Maximum Testing Power

Wittawat Jitkrittum
Zoltán Szabó
Kacper Chwialkowski
Arthur Gretton

Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i. e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. An empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In real-world benchmarks on high-dimensional text and image data, linear-time tests using the proposed semimetrics achieve comparable performance to the state-of-the-art quadratic-time maximum mean discrepancy test, while returning human-interpretable features that explain the test results.

JMLR Journal 2016 Journal Article

Kernel Mean Shrinkage Estimators

Krikamol Muandet
Bharath Sriperumbudur
Kenji Fukumizu
Arthur Gretton
Bernhard Schölkopf

A mean function in a reproducing kernel Hilbert space (RKHS), or a kernel mean, is central to kernel methods in that it is used by many classical algorithms such as kernel principal component analysis, and it also forms the core inference step of modern kernel methods that rely on embedding probability distributions in RKHSs. Given a finite sample, an empirical average has been used commonly as a standard estimator of the true kernel mean. Despite a widespread use of this estimator, we show that it can be improved thanks to the well-known Stein phenomenon. We propose a new family of estimators called kernel mean shrinkage estimators (KMSEs), which benefit from both theoretical justifications and good empirical performance. The results demonstrate that the proposed estimators outperform the standard one, especially in a "large $d$, small $n$" paradigm. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

JMLR Journal 2016 Journal Article

Learning Theory for Distribution Regression

Zoltán Szabó
Bharath K. Sriperumbudur
Barnabás Póczos
Arthur Gretton

We focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning and point estimation problems without analytical solution (such as hyperparameter or entropy estimation). Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging, since in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires kernel density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. In this paper, we study a simple, analytically computable, ridge regression-based alternative to distribution regression, where we embed the distributions to a reproducing kernel Hilbert space, and learn the regressor from the embeddings to the outputs. Our main contribution is to prove that this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels): we present an exact computational-statistical efficiency trade-off analysis showing that our estimator is able to match the one-stage sampled minimax optimal rate (Caponnetto and De Vito, 2007; Steinwart et al., 2009). This result answers a $17 $-year-old open question, establishing the consistency of the classical set kernel (Haussler, 1999; GÃ¤rtner et al., 2002) in regression. We also cover consistency for more recent kernels on distributions, including those due to Christmann and Steinwart (2010). [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

ICML Conference 2015 Conference Paper

A low variance consistent test of relative dependency

Wacha Bounliphone
Arthur Gretton
Arthur Tenenhaus
Matthew B. Blaschko

We describe a novel non-parametric statistical hypothesis test of relative dependence between a source variable and two candidate target variables. Such a test enables us to determine whether one source variable is significantly more dependent on a first target variable or a second. Dependence is measured via the Hilbert-Schmidt Independence Criterion (HSIC), resulting in a pair of empirical dependence measures (source-target 1, source-target 2). We test whether the first dependence measure is significantly larger than the second. Modeling the covariance between these HSIC statistics leads to a provably more powerful test than the construction of independent HSIC statistics by sub-sampling. The resulting test is consistent and unbiased, and (being based on U-statistics) has favorable convergence properties. The test can be computed in quadratic time, matching the computational complexity of standard empirical HSIC estimators. The effectiveness of the test is demonstrated on several real-world problems: we identify language groups from a multilingual corpus, and we prove that tumor location is more dependent on gene expression than chromosomal imbalances. Source code is available for download at https: //github. com/wbounliphone/reldep/.

NeurIPS Conference 2015 Conference Paper

Fast Two-Sample Testing with Analytic Representations of Probability Measures

Kacper Chwialkowski
Aaditya Ramdas
Dino Sejdinovic
Arthur Gretton

We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics.

NeurIPS Conference 2015 Conference Paper

Gradient-free Hamiltonian Monte Carlo with Efficient Kernel Exponential Families

Heiko Strathmann
Dino Sejdinovic
Samuel Livingstone
Zoltan Szabo
Arthur Gretton

We propose Kernel Hamiltonian Monte Carlo (KMC), a gradient-free adaptive MCMC algorithm based on Hamiltonian Monte Carlo (HMC). On target densities where classical HMC is not an option due to intractable gradients, KMC adaptively learns the target's gradient structure by fitting an exponential family model in a Reproducing Kernel Hilbert Space. Computational costs are reduced by two novel efficient approximations to this gradient. While being asymptotically exact, KMC mimics HMC in terms of sampling efficiency, and offers substantial mixing improvements over state-of-the-art gradient free samplers. We support our claims with experimental studies on both toy and real-world applications, including Approximate Bayesian Computation and exact-approximate MCMC.

UAI Conference 2015 Conference Paper

Kernel-Based Just-In-Time Learning for Passing Expectation Propagation Messages

Wittawat Jitkrittum
Arthur Gretton
Nicolas Heess
S. M. Ali Eslami
Balaji Lakshminarayanan
Dino Sejdinovic
Zoltán Szabó 0001

We propose an efficient nonparametric strategy for learning a message operator in expectation propagation (EP), which takes as input the set of incoming messages to a factor node, and produces an outgoing message as output. This learned operator replaces the multivariate integral required in classical EP, which may not have an analytic expression. We use kernel-based regression, which is trained on a set of probability distributions representing the incoming messages, and the associated outgoing messages. The kernel approach has two main advantages: first, it is fast, as it is implemented using a novel two-layer random feature representation of the input message distributions; second, it has principled uncertainty estimates, and can be cheaply updated online, meaning it can request and incorporate new training data when it encounters inputs on which it is uncertain. In experiments, our approach is able to solve learning problems where a single message operator is required for multiple, substantially different data sets (logistic regression for a variety of classification problems), where it is essential to accurately assess uncertainty and to efficiently and robustly update the message operator.

ICML Conference 2014 Conference Paper

A Kernel Independence Test for Random Processes

Kacper Chwialkowski
Arthur Gretton

A non-parametric approach to the problem of testing the independence of two random processes is developed. The test statistic is the Hilbert-Schmidt Independence Criterion (HSIC), which was used previously in testing independence for i. i. d. pairs of variables. The asymptotic behaviour of HSIC is established when computed from samples drawn from random processes. It is shown that earlier bootstrap procedures which worked in the i. i. d. case will fail for random processes, and an alternative consistent estimate of the p-values is proposed. Tests on artificial data and real-world forex data indicate that the new test procedure discovers dependence which is missed by linear approaches, while the earlier bootstrap procedure returns an elevated number of false positives.

NeurIPS Conference 2014 Conference Paper

A Wild Bootstrap for Degenerate Kernel Tests

Kacper Chwialkowski
Dino Sejdinovic
Arthur Gretton

A wild bootstrap method for nonparametric hypothesis tests based on kernel distribution embeddings is proposed. This bootstrap method is used to construct provably consistent tests that apply to random processes, for which the naive permutation-based bootstrap fails. It applies to a large group of kernel tests based on V-statistics, which are degenerate under the null hypothesis, and non-degenerate elsewhere. To illustrate this approach, we construct a two-sample test, an instantaneous independence test and a multiple lag independence test for time series. In experiments, the wild bootstrap gives strong performance on synthetic examples, on audio data, and in performance benchmarking for the Gibbs sampler. The code is available at https: //github. com/kacperChwialkowski/wildBootstrap.

ICML Conference 2014 Conference Paper

Kernel Adaptive Metropolis-Hastings

Dino Sejdinovic
Heiko Strathmann
Maria Lomeli Garcia
Christophe Andrieu
Arthur Gretton

A Kernel Adaptive Metropolis-Hastings algorithm is introduced, for the purpose of sampling from a target distribution with strongly nonlinear support. The algorithm embeds the trajectory of the Markov chain into a reproducing kernel Hilbert space (RKHS), such that the feature space covariance of the samples informs the choice of proposal. The procedure is computationally efficient and straightforward to implement, since the RKHS moves can be integrated out analytically: our proposal distribution in the original space is a normal distribution whose mean and covariance depend on where the current sample lies in the support of the target distribution, and adapts to its local covariance structure. Furthermore, the procedure requires neither gradients nor any other higher order information about the target, making it particularly attractive for contexts such as Pseudo-Marginal MCMC. Kernel Adaptive Metropolis-Hastings outperforms competing fixed and adaptive samplers on multivariate, highly nonlinear target distributions, arising in both real-world and synthetic examples.

ICML Conference 2014 Conference Paper

Kernel Mean Estimation and Stein Effect

Krikamol Muandet
Kenji Fukumizu
Bharath K. Sriperumbudur
Arthur Gretton
Bernhard Schölkopf

A mean function in reproducing kernel Hilbert space (RKHS), or a kernel mean, is an important part of many algorithms ranging from kernel principal component analysis to Hilbert-space embedding of distributions. Given a finite sample, an empirical average is the standard estimate for the true kernel mean. We show that this estimator can be improved due to a well-known phenomenon in statistics called Stein phenomenon. After consideration, our theoretical analysis reveals the existence of a wide class of estimators that are better than the standard one. Focusing on a subset of this class, we propose efficient shrinkage estimators for the kernel mean. Empirical evaluations on several applications clearly demonstrate that the proposed estimators outperform the standard kernel mean estimator.

AAAI Conference 2014 Conference Paper

Monte Carlo Filtering Using Kernel Embedding of Distributions

Motonobu Kanagawa
Yu Nishiyama
Arthur Gretton
Kenji Fukumizu

Recent advances of kernel methods have yielded a framework for representing probabilities using a reproducing kernel Hilbert space, called kernel embedding of distributions. In this paper, we propose a Monte Carlo filtering algorithm based on kernel embeddings. The proposed method is applied to state-space models where sampling from the transition model is possible, while the observation model is to be learned from training samples without assuming a parametric model. As a theoretical basis of the proposed method, we prove consistency of the Monte Carlo method combined with kernel embeddings. Experimental results on synthetic models and real vision-based robot localization confirm the effectiveness of the proposed approach.

NeurIPS Conference 2013 Conference Paper

A Kernel Test for Three-Variable Interactions

Dino Sejdinovic
Arthur Gretton
Wicher Bergsma

We introduce kernel nonparametric tests for Lancaster three-variable interaction and for total independence, using embeddings of signed measures into a reproducing kernel Hilbert space. The resulting test statistics are straightforward to compute, and are used in powerful three-variable interaction tests, which are consistent against all alternatives for a large family of reproducing kernels. We show the Lancaster test to be sensitive to cases where two independent causes individually have weak influence on a third dependent variable, but their combined effect has a strong influence. This makes the Lancaster test especially suited to finding structure in directed graphical models, where it outperforms competing nonparametric tests in detecting such V-structures.

NeurIPS Conference 2013 Conference Paper

B-test: A Non-parametric, Low Variance Kernel Two-sample Test

Wojciech Zaremba
Arthur Gretton
Matthew Blaschko

We propose a family of maximum mean discrepancy (MMD) kernel two-sample tests that have low sample complexity and are consistent. The test has a hyperparameter that allows one to control the tradeoff between sample complexity and computational time. Our family of tests, which we denote as B-tests, is both computationally and statistically efficient, combining favorable properties of previously proposed MMD two-sample tests. It does so by better leveraging samples to produce low variance estimates in the finite sample case, while avoiding a quadratic number of kernel evaluations and complex null-hypothesis approximation as would be required by tests relying on one sample U-statistics. The B-test uses a smaller than quadratic number of kernel evaluations and avoids completely the computational burden of complex null-hypothesis approximation while maintaining consistency and probabilistically conservative thresholds on Type I error. Finally, recent results of combining multiple kernels transfer seamlessly to our hypothesis test, allowing a further increase in discriminative power and decrease in sample complexity.

UAI Conference 2013 Conference Paper

Hilbert Space Embeddings of Predictive State Representations

Byron Boots
Geoffrey J. Gordon
Arthur Gretton

Predictive State Representations (PSRs) are an expressive class of models for controlled stochastic processes. PSRs represent state as a set of predictions of future observable events. Because PSRs are defined entirely in terms of observable data, statistically consistent estimates of PSR parameters can be learned efficiently by manipulating moments of observed training data. Most learning algorithms for PSRs have assumed that actions and observations are finite with low cardinality. In this paper, we generalize PSRs to infinite sets of observations and actions, using the recent concept of Hilbert space embeddings of distributions. The essence is to represent the state as one or more nonparametric conditional embedding operators in a Reproducing Kernel Hilbert Space (RKHS) and leverage recent work in kernel methods to estimate, predict, and update the representation. We show that these Hilbert space embeddings of PSRs are able to gracefully handle continuous actions and observations, and that our learned models outperform competing system identification algorithms on several prediction benchmarks.

JMLR Journal 2013 Journal Article

Kernel Bayes' Rule: Bayesian Inference with Positive Definite Kernels

Kenji Fukumizu
Le Song
Arthur Gretton

A kernel method for realizing Bayes' rule is proposed, based on representations of probabilities in reproducing kernel Hilbert spaces. Probabilities are uniquely characterized by the mean of the canonical map to the RKHS. The prior and conditional probabilities are expressed in terms of RKHS functions of an empirical sample: no explicit parametric model is needed for these quantities. The posterior is likewise an RKHS mean of a weighted sample. The estimator for the expectation of a function of the posterior is derived, and rates of consistency are shown. Some representative applications of the kernel Bayes' rule are presented, including Bayesian computation without likelihood and filtering with a nonparametric state-space model. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

ICML Conference 2013 Conference Paper

Smooth Operators

Steffen Grünewälder
Arthur Gretton
John Shawe-Taylor

We develop a generic approach to form smooth versions of basic mathematical operations like multiplication, composition, change of measure, and conditional expectation, among others. Operations which result in functions outside the reproducing kernel Hilbert space (such as the product of two RKHS functions) are approximated via a natural cost function, such that the solution is guaranteed to be in the targeted RKHS. This approximation problem is reduced to a regression problem using an adjoint trick, and solved in a vector-valued RKHS, consisting of continuous, linear, smooth operators which map from an input, real-valued RKHS to the desired target RKHS. Important constraints, such as an almost everywhere positive density, can be enforced or approximated naturally in this framework, using convex constraints on the operators. Finally, smooth operators can be composed to accomplish more complex machine learning tasks, such as the sum rule and kernelized approximate Bayesian inference, where state-of-the-art convergence rates are obtained.

JMLR Journal 2012 Journal Article

A Kernel Two-Sample Test

Arthur Gretton
Karsten M. Borgwardt
Malte J. Rasch
Bernhard Schölkopf
Alexander Smola

We propose a framework for analyzing and comparing distributions, which we use to construct statistical tests to determine if two samples are drawn from different distributions. Our test statistic is the largest difference in expectations over functions in the unit ball of a reproducing kernel Hilbert space (RKHS), and is called the maximum mean discrepancy (MMD). We present two distribution-free tests based on large deviation bounds for the MMD, and a third test based on the asymptotic distribution of this statistic. The MMD can be computed in quadratic time, although efficient linear time approximations are available. Our statistic is an instance of an integral probability metric, and various classical metrics on distributions are obtained when alternative function classes are used in place of an RKHS. We apply our two-sample tests to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where they perform strongly. Excellent performance is also obtained when comparing distributions over graphs, for which these are the first such tests. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

ICML Conference 2012 Conference Paper

Conditional mean embeddings as regressors

Steffen Grünewälder
Guy Lever
Arthur Gretton
Luca Baldassarre
Sam Patterson
Massimiliano Pontil

JMLR Journal 2012 Journal Article

Feature Selection via Dependence Maximization

Le Song
Alex Smola
Arthur Gretton
Justin Bedo
Karsten Borgwardt

We introduce a framework for feature selection based on dependence maximization between the selected features and the labels of an estimation problem, using the Hilbert-Schmidt Independence Criterion. The key idea is that good features should be highly dependent on the labels. Our approach leads to a greedy procedure for feature selection. We show that a number of existing feature selectors are special cases of this framework. Experiments on both artificial and real-world data show that our feature selector works well in practice. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

UAI Conference 2012 Conference Paper

Hilbert Space Embeddings of POMDPs

Yu Nishiyama
Abdeslam Boularias
Arthur Gretton
Kenji Fukumizu

A nonparametric approach for policy learning for POMDPs is proposed. The approach represents distributions over the states, observations, and actions as embeddings in feature spaces, which are reproducing kernel Hilbert spaces. Distributions over states given the observations are obtained by applying the kernel Bayes’ rule to these distribution embeddings. Policies and value functions are deﬁned on the feature space over states, which leads to a feature space expression for the Bellman equation. Value iteration may then be used to estimate the optimal value function and associated policy. Experimental results conﬁrm that the correct policy is learned using the feature space representation.

ICML Conference 2012 Conference Paper

Hypothesis testing using pairwise distances and associated kernels

Dino Sejdinovic
Arthur Gretton
Bharath K. Sriperumbudur
Kenji Fukumizu

ICML Conference 2012 Conference Paper

Modelling transition dynamics in MDPs with RKHS embeddings

Steffen Grünewälder
Guy Lever
Luca Baldassarre
Massimiliano Pontil
Arthur Gretton

NeurIPS Conference 2012 Conference Paper

Optimal kernel choice for large-scale two-sample tests

Arthur Gretton
Dino Sejdinovic
Heiko Strathmann
Sivaraman Balakrishnan
Massimiliano Pontil
Kenji Fukumizu
Bharath Sriperumbudur

Abstract Given samples from distributions $p$ and $q$, a two-sample test determines whether to reject the null hypothesis that $p=q$, based on the value of a test statistic measuring the distance between the samples. One choice of test statistic is the maximum mean discrepancy (MMD), which is a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is thus critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability. A means of parameter selection for the two-sample test based on the MMD is proposed. For a given test level (an upper bound on the probability of making a Type I error), the kernel is chosen so as to maximize the test power, and minimize the probability of making a Type II error. The test statistic, test threshold, and optimization over the kernel parameters are obtained with cost linear in the sample size. These properties make the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. In experiments, the new kernel selection approach yields a more powerful test than earlier kernel selection heuristics.

NeurIPS Conference 2011 Conference Paper

Kernel Bayes' Rule

Kenji Fukumizu
Le Song
Arthur Gretton

A nonparametric kernel-based method for realizing Bayes' rule is proposed, based on kernel representations of probabilities in reproducing kernel Hilbert spaces. The prior and conditional probabilities are expressed as empirical kernel mean and covariance operators, respectively, and the kernel mean of the posterior distribution is computed in the form of a weighted sample. The kernel Bayes' rule can be applied to a wide variety of Bayesian inference problems: we demonstrate Bayesian computation without likelihood, and filtering with a nonparametric state-space model. A consistency rate for the posterior estimate is established.

JMLR Journal 2010 Journal Article

Consistent Nonparametric Tests of Independence

Arthur Gretton
László Györfi

Three simple and explicit procedures for testing the independence of two multi-dimensional random variables are described. Two of the associated test statistics ( L 1, log-likelihood) are defined when the empirical distribution of the variables is restricted to finite partitions. A third test statistic is defined as a kernel-based independence measure. Two kinds of tests are provided. Distribution-free strong consistent tests are derived on the basis of large deviation bounds on the test statistics: these tests make almost surely no Type I or Type II error after a random sample size. Asymptotically α -level tests are obtained from the limiting distribution of the test statistics. For the latter tests, the Type I error converges to a fixed non-zero value α, and the Type II error drops to zero, for increasing sample size. All tests reject the null hypothesis of independence if the test statistics become large. The performance of the tests is evaluated experimentally on benchmark data. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

JMLR Journal 2010 Journal Article

Hilbert Space Embeddings and Metrics on Probability Measures

Bharath K. Sriperumbudur
Arthur Gretton
Kenji Fukumizu
Bernhard Schölkopf
Gert R.G. Lanckriet

A Hilbert space embedding for probability measures has recently been proposed, with applications including dimensionality reduction, homogeneity testing, and independence testing. This embedding represents any probability measure as a mean element in a reproducing kernel Hilbert space (RKHS). A pseudometric on the space of probability measures can be defined as the distance between distribution embeddings: we denote this as γ k, indexed by the kernel function k that defines the inner product in the RKHS. We present three theoretical properties of γ k. First, we consider the question of determining the conditions on the kernel k for which γ k is a metric: such k are denoted characteristic kernels. Unlike pseudometrics, a metric is zero only when two distributions coincide, thus ensuring the RKHS embedding maps all distributions uniquely (i.e., the embedding is injective). While previously published conditions may apply only in restricted circumstances (e.g., on compact domains), and are difficult to check, our conditions are straightforward and intuitive: integrally strictly positive definite kernels are characteristic. Alternatively, if a bounded continuous kernel is translation-invariant on ℜ d, then it is characteristic if and only if the support of its Fourier transform is the entire ℜ d. Second, we show that the distance between distributions under γ k results from an interplay between the properties of the kernel and the distributions, by demonstrating that distributions are close in the embedding space when their differences occur at higher frequencies. Third, to understand the nature of the topology induced by γ k, we relate γ k to other popular metrics on probability measures, and present conditions on the kernel k under which γ k metrizes the weak topology. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

A Fast, Consistent Kernel Two-Sample Test

Arthur Gretton
Kenji Fukumizu
Zaïd Harchaoui
Bharath Sriperumbudur

A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P=Q) an infinite weighted sum of $\chi^2$ random variables. The main result of the present work is a novel, consistent estimate of this null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q. This estimate may be computed faster than a previous consistent estimate based on the bootstrap. Another prior approach was to compute the null distribution based on fitting a parametric family with the low order moments of the test statistic: unlike the present work, this heuristic has no guarantee of being accurate or consistent. We verify the performance of our null distribution estimate on both an artificial example and on high dimensional multivariate data.

ICML Conference 2009 Conference Paper

Detecting the direction of causal time series

Jonas Peters
Dominik Janzing
Arthur Gretton
Bernhard Schölkopf

NeurIPS Conference 2009 Conference Paper

Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Kenji Fukumizu
Arthur Gretton
Gert Lanckriet
Bernhard Schölkopf
Bharath Sriperumbudur

Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example.

NeurIPS Conference 2009 Conference Paper

Nonlinear directed acyclic structure learning with weakly additive noise models

Arthur Gretton
Peter Spirtes
Robert Tillman

The recently proposed \emph{additive noise model} has advantages over previous structure learning algorithms, when attempting to recover some true data generating mechanism, since it (i) does not assume linearity or Gaussianity and (ii) can recover a unique DAG rather than an equivalence class. However, its original extension to the multivariate case required enumerating all possible DAGs, and for some special distributions, e. g. linear Gaussian, the model is invertible and thus cannot be used for structure learning. We present a new approach which combines a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. Experiments with synthetic and real data show that this method is more accurate than previous methods when data are nonlinear and/or non-Gaussian.

NeurIPS Conference 2008 Conference Paper

Characteristic Kernels on Groups and Semigroups

Kenji Fukumizu
Arthur Gretton
Bernhard Schölkopf
Bharath Sriperumbudur

Embeddings of random variables in reproducing kernel Hilbert spaces (RKHSs) may be used to conduct statistical inference based on higher order moments. For sufficiently rich (characteristic) RKHSs, each probability distribution has a unique embedding, allowing all statistical properties of the distribution to be taken into consideration. Necessary and sufficient conditions for an RKHS to be characteristic exist for $\R^n$. In the present work, conditions are established for an RKHS to be characteristic on groups and semigroups. Illustrative examples are provided, including characteristic kernels on periodic domains, rotation matrices, and $\R^n_+$.

NeurIPS Conference 2008 Conference Paper

Kernel Measures of Independence for non-iid Data

Xinhua Zhang
Le Song
Arthur Gretton
Alex Smola

Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering.

NeurIPS Conference 2008 Conference Paper

Learning Taxonomies by Dependence Maximization

Matthew Blaschko
Arthur Gretton

We introduce a family of unsupervised algorithms, numerical taxonomy clustering, to simultaneously cluster data, and to learn a taxonomy that encodes the relationship between the clusters. The algorithms work by maximizing the dependence between the taxonomy and the original data. The resulting taxonomy is a more informative visualization of complex data than simple clustering; in addition, taking into account the relations between different clusters is shown to substantially improve the quality of the clustering, when compared with state-of-the-art algorithms in the literature (both spectral clustering and a previous dependence maximization approach). We demonstrate our algorithm on image and text data.

ICML Conference 2008 Conference Paper

Tailoring density estimation via reproducing kernel moment matching

Le Song
Xinhua Zhang
Alexander J. Smola
Arthur Gretton
Bernhard Schölkopf

ICML Conference 2007 Conference Paper

A dependence maximization view of clustering

Le Song
Alexander J. Smola
Arthur Gretton
Karsten M. Borgwardt

AAAI Conference 2007 Conference Paper

A Kernel Approach to Comparing Distributions

Arthur Gretton
Malte Rasch

We describe a technique for comparing distributions without the need for density estimation as an intermediate step. Our approach relies on mapping the distributions into a Reproducing Kernel Hilbert Space. We apply this technique to construct a two-sample test, which is used for determining whether two sets of observations arise from the same distribution. We use this test in attribute matching for databases using the Hungarian marriage method, where it performs strongly. We also demonstrate excellent performance when comparing distributions over graphs, for which no alternative tests currently exist.

NeurIPS Conference 2007 Conference Paper

A Kernel Statistical Test of Independence

Arthur Gretton
Kenji Fukumizu
Choon Teo
Le Song
Bernhard Schölkopf
Alex Smola

Although kernel measures of independence have been widely applied in machine learning (notably in kernel ICA), there is as yet no method to determine whether they have detected statistically signiﬁcant dependence. We provide a novel test of the independence hypothesis for one particular kernel independence measure, the Hilbert-Schmidt independence criterion (HSIC). The resulting test costs O(m2), where m is the sample size. We demonstrate that this test outperforms established contingency table and functional correlation-based tests, and that this advantage is greater for multivariate data. Finally, we show the HSIC test also applies to text (and to structured data more generally), for which no other independence test presently exists.

NeurIPS Conference 2007 Conference Paper

Colored Maximum Variance Unfolding

Le Song
Arthur Gretton
Karsten Borgwardt
Alex Smola

Maximum variance unfolding (MVU) is an effective heuristic for dimensionality reduction. It produces a low-dimensional representation of the data by maximiz- ing the variance of their embeddings while preserving the local distances of the original data. We show that MVU also optimizes a statistical dependence measure which aims to retain the identity of individual observations under the distance- preserving constraints. This general view allows us to design “colored” variants of MVU, which produce low-dimensional representations for a given task, e. g. subject to class labels or other side information.

NeurIPS Conference 2007 Conference Paper

Kernel Measures of Conditional Dependence

Kenji Fukumizu
Arthur Gretton
Xiaohai Sun
Bernhard Schölkopf

We propose a new measure of conditional dependence of random variables, based on normalized cross-covariance operators on reproducing kernel Hilbert spaces. Unlike previous kernel dependence measures, the proposed criterion does not de- pend on the choice of kernel in the limit of inﬁnite data, for a wide class of ker- nels. At the same time, it has a straightforward empirical estimate with good convergence behaviour. We discuss the theoretical properties of the measure, and demonstrate its application in experiments.

JMLR Journal 2007 Journal Article

Statistical Consistency of Kernel Canonical Correlation Analysis

Kenji Fukumizu
Francis R. Bach
Arthur Gretton

While kernel canonical correlation analysis (CCA) has been applied in many contexts, the convergence of finite sample estimates of the associated functions to their population counterparts has not yet been established. This paper gives a mathematical proof of the statistical convergence of kernel CCA, providing a theoretical justification for the method. The proof uses covariance operators defined on reproducing kernel Hilbert spaces, and analyzes the convergence of their empirical estimates of finite rank to their population counterparts, which can have infinite rank. The result also gives a sufficient condition for convergence on the regularization coefficient involved in kernel CCA: this should decrease as n -1/3, where n is the number of data. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

ICML Conference 2007 Conference Paper

Supervised feature selection via dependence estimation

Le Song
Alexander J. Smola
Arthur Gretton
Karsten M. Borgwardt
Justin Bedo

NeurIPS Conference 2006 Conference Paper

A Kernel Method for the Two-Sample-Problem

Arthur Gretton
Karsten Borgwardt
Malte Rasch
Bernhard Schölkopf
Alex Smola

We propose two statistical tests to determine if two samples are from different dis- tributions. Our test statistic is in both cases the distance between the means of the two samples mapped into a reproducing kernel Hilbert space (RKHS). The ﬁrst test is based on a large deviation bound for the test statistic, while the second is based on the asymptotic distribution of this statistic. The test statistic can be com- puted in O(m2) time. We apply our approach to a variety of problems, including attribute matching for databases using the Hungarian marriage method, where our test performs strongly. We also demonstrate excellent performance when compar- ing distributions over graphs, for which no alternative tests currently exist.

NeurIPS Conference 2006 Conference Paper

Correcting Sample Selection Bias by Unlabeled Data

Jiayuan Huang
Arthur Gretton
Karsten Borgwardt
Bernhard Schölkopf
Alex Smola

We consider the scenario where training and test data are drawn from different distributions, commonly referred to as sample selection bias. Most algorithms for this setting try to ﬁrst recover sampling distributions and then make appro- priate corrections based on the distribution estimate. We present a nonparametric method which directly produces resampling weights without distribution estima- tion. Our method works by matching distributions between training and testing sets in feature space. Experimental results demonstrate that our method works well in practice.

JMLR Journal 2005 Journal Article

Kernel Methods for Measuring Independence

Arthur Gretton
Ralf Herbrich
Alexander Smola
Olivier Bousquet
Bernhard Schölkopf

We introduce two new functionals, the constrained covariance and the kernel mutual information, to measure the degree of independence of random variables. These quantities are both based on the covariance between functions of the random variables in reproducing kernel Hilbert spaces (RKHSs). We prove that when the RKHSs are universal, both functionals are zero if and only if the random variables are pairwise independent. We also show that the kernel mutual information is an upper bound near independence on the Parzen window estimate of the mutual information. Analogous results apply for two correlation-based dependence functionals introduced earlier: we show the kernel canonical correlation and the kernel generalised variance to be independence measures for universal kernels, and prove the latter to be an upper bound on the mutual information near independence. The performance of the kernel dependence functionals in measuring independence is verified in the context of independent component analysis. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2005 Conference Paper

Statistical Convergence of Kernel CCA

Kenji Fukumizu
Arthur Gretton
Francis Bach

While kernel canonical correlation analysis (kernel CCA) has been applied in many problems, the asymptotic convergence of the functions estimated from a finite sample to the true functions has not yet been established. This paper gives a rigorous proof of the statistical convergence of kernel CCA and a related method (NOCCO), which provides a theoretical justification for these methods. The result also gives a sufficient condition on the decay of the regularization coefficient in the methods to ensure convergence.

NeurIPS Conference 2003 Conference Paper

Ranking on Data Manifolds

Dengyong Zhou
Jason Weston
Arthur Gretton
Olivier Bousquet
Bernhard Schölkopf

The Google search engine has enjoyed huge success with its web page ranking algorithm, which exploits global, rather than local, hyperlink structure of the web using random walks. Here we propose a simple universal ranking algorithm for data lying in the Euclidean space, such as text or image data. The core idea of our method is to rank the data with respect to the intrinsic manifold structure collectively revealed by a great amount of data. Encouraging experimental results from synthetic, image, and text data illustrate the validity of our method.