Author name cluster

Marco Cuturi

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

60 papers

2 author rows

TMLR Journal 2026 Journal Article

Multivariate Conformal Prediction using Optimal Transport

Michal Klein
Louis Béthune
Eugene Ndiaye
Marco Cuturi

Conformal prediction (CP) provides distribution-free uncertainty quantification by constructing prediction sets whose validity relies on ranking conformity scores. Because ranking requires an ordering, most CP methods use univariate scores; extending them to multivariate settings, where no canonical order for vectors exists, remains challenging. We build on the theory of Monge--Kantorovich quantiles and ranks to propose a geometry-aware scalarization of vector-valued scores: we transport multivariate conformity scores to the spherical uniform distribution on the unit ball via an entropic optimal transport (OT) map and use the transported radius as a scalar score. Standard split conformal calibration then applies directly, preserving finite-sample marginal coverage. The resulting method, OTCP, produces prediction regions that adapt to the empirical geometry of the score distribution, going beyond the ellipsoidal sets imposed by norm-based scalarizations. Across a benchmark of 24 multivariate regression datasets, OTCP improves efficiency and conditional-coverage metrics mainly in low output dimensions ($d \leq 4$), while we also study the computational and statistical trade-offs involved in estimating entropic OT maps.

ICML Conference 2025 Conference Paper

Addressing Misspecification in Simulation-based Inference through Data-driven Calibration

Antoine Wehenkel
Juan L. Gamella
Ozan Sener
Jens Behrmann
Guillermo Sapiro
Jörn-Henrik Jacobsen
Marco Cuturi

Driven by steady progress in deep generative modeling, simulation-based inference (SBI) has emerged as the workhorse for inferring the parameters of stochastic simulators. However, recent work has demonstrated that model misspecification can harm SBI’s reliability, preventing its adoption in important applications where only misspecified simulators are available. This work introduces robust posterior estimation (RoPE), a framework that overcomes model misspecification with a small real-world calibration set of ground truth parameter measurements. We formalize the misspecification gap as the solution of an optimal transport (OT) problem between learned representations of real-world and simulated observations, allowing RoPE to learn a model of the misspecification without placing additional assumptions on its nature. RoPE shows how the calibration set and OT together offer a controllable balance between calibrated uncertainty and informative inference even under severely misspecified simulators. Results on four synthetic tasks and two real-world problems with ground-truth labels demonstrate that RoPE outperforms baselines and consistently returns informative and calibrated credible intervals.

ICLR Conference 2025 Conference Paper

Controlling Language and Diffusion Models by Transporting Activations

Pau Rodríguez
Arno Blaas
Michal Klein
Luca Zappella
Nicholas Apostoloff
Marco Cuturi
Xavier Suau

The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in large language models (LLMs) and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

ICLR Conference 2025 Conference Paper

Disentangled Representation Learning with the Gromov-Monge Gap

Théo Uscidda
Luca Eyring
Karsten Roth
Fabian J. Theis
Zeynep Akata
Marco Cuturi

Learning disentangled representations from unlabelled data is a fundamental challenge in machine learning. Solving it may unlock other problems, such as generalization, interpretability, or fairness. Although remarkably challenging to solve in theory, disentanglement is often achieved in practice through prior matching. Furthermore, recent works have shown that prior matching approaches can be enhanced by leveraging geometrical considerations, e.g., by learning representations that preserve geometric features of the data, such as distances or angles between points. However, matching the prior while preserving geometric features is challenging, as a mapping that *fully* preserves these features while aligning the data distribution with the prior does not exist in general. To address these challenges, we introduce a novel approach to disentangled representation learning based on quadratic optimal transport. We formulate the problem using Gromov-Monge maps that transport one distribution onto another with minimal distortion of predefined geometric features, preserving them *as much as can be achieved*. To compute such maps, we propose the Gromov-Monge-Gap (GMG), a regularizer quantifying whether a map moves a reference distribution with minimal geometry distortion. We demonstrate the effectiveness of our approach for disentanglement across four standard benchmarks, outperforming other methods leveraging geometric considerations.

NeurIPS Conference 2025 Conference Paper

LinEAS: End-to-end Learning of Activation Steering with a Distributional Loss

Pau Rodriguez
Michal Klein
Eleonora Gualdoni
Valentino Maiorca
Arno Blaas
Luca Zappella
Marco Cuturi
Xavier Suau

The growing use of generative models in daily life calls for efficient mechanisms to control their generation, to e. g. produce safe content or provide users with tools to explore style changes. Ideally, such mechanisms should require low volume of unpaired data (\ie without explicit preference), and should be cheap, both at train and inference time, while preserving output quality. Recent research has shown that such mechanisms can be obtained by intervening exclusively on model activations, with the goal of correcting distributional differences between activations seen when using prompts from a source vs. a target set (e. g. toxic and non-toxic sentences). While cheap, these fast methods are inherently crude: their maps are tuned locally, not accounting for their impact on downstream layers, resulting in interventions that cause unintended shifts when used out-of-sample. We propose in this work linear end-to-end activation steering (LinEAS), an approach trained with a global loss that accounts simultaneously for all layer-wise distributional shifts. In addition to being more robust, the loss used to train LinEAS can be regularized with sparsifying norms, which can automatically carry out neuron selection. LinEAS only requires a handful of unpaired samples to be effective, and beats similar baselines on toxicity mitigation in language models, becoming competitive with oracle-dependent methods that have access to strong supervision. LinEAS is modality-agnostic and we empirically find that it outperforms existing activation steering methods at mitigating and including new concepts at the output of single-step text-to-image generation models.

NeurIPS Conference 2025 Conference Paper

Sample and Map from a Single Convex Potential: Generation using Conjugate Moment Measures

Nina Vesseron
Louis Bethune
Marco Cuturi

The canonical approach in generative modeling is to split model fitting into two blocks: define first how to sample noise (e. g. Gaussian) and choose next what to do with it (e. g. using a single map or flows). We explore in this work an alternative route that ties sampling and mapping. We find inspiration in moment measures, a result that states that for any measure $\rho$, there exists a unique convex potential $u$ such that $\rho=\nabla u \sharp e^{-u}$. While this does seem to tie effectively sampling (from log-concave distribution $e^{-u}$) and action (pushing particles through $\nabla u$), we observe on simple examples (e. g. , Gaussians or 1D distributions) that this choice is ill-suited for practical tasks. We study an alternative factorization, where $\rho$ is factorized as $\nabla w^* \sharp e^{-w}$, where $w^\ast$ is the convex conjugate of a convex potential $w$. We call this approach conjugate moment measures, and show far more intuitive results on these examples. Because $\nabla w^*$ is the Monge map between the log-concave distribution $e^{-w}$ and $\rho$, we rely on optimal transport solvers to propose an algorithm to recover $w$ from samples of $\rho$, and parameterize $w$ as an input-convex neural network. We also address the common sampling scenario in which the density of $\rho$ is known only up to a normalizing constant, and propose an algorithm to learn $w$ in this setting.

ICML Conference 2025 Conference Paper

Scaling Laws for Forgetting during Finetuning with Pretraining Data Injection

Louis Béthune
David Grangier
Dan Busbridge
Eleonora Gualdoni
Marco Cuturi
Pierre Ablin

A widespread strategy to obtain a language model that performs well on a target domain is to finetune a pretrained model to perform unsupervised next-token prediction on data from that target domain. Finetuning presents two challenges: (i) if the amount of target data is limited, as in most practical applications, the model will quickly overfit, and (ii) the model will drift away from the original model, forgetting the pretraining data and the generic knowledge that comes with it. Our goal is to derive scaling laws that quantify these two phenomena for various target domains, amounts of available target data, and model scales. We measure the efficiency of injecting pretraining data into the finetuning data mixture to avoid forgetting and mitigate overfitting. A key practical takeaway from our study is that injecting as little as $1%$ of pretraining data in the finetuning data mixture prevents the model from forgetting the pretraining set.

ICML Conference 2025 Conference Paper

Shielded Diffusion: Generating Novel and Diverse Images using Sparse Repellency

Michael Kirchhof 0002
James Thornton
Louis Béthune
Pierre Ablin
Eugène Ndiaye
Marco Cuturi

The adoption of text-to-image diffusion models raises concerns over reliability, drawing scrutiny under the lens of various metrics like calibration, fairness, or compute efficiency. We focus in this work on two issues that arise when deploying these models: a lack of diversity when prompting images, and a tendency to recreate images from the training set. To solve both problems, we propose a method that coaxes the sampled trajectories of pretrained diffusion models to land on images that fall outside of a reference set. We achieve this by adding repellency terms to the diffusion SDE throughout the generation trajectory, which are triggered whenever the path is expected to land too closely to an image in the shielded reference set. Our method is sparse in the sense that these repellency terms are zero and inactive most of the time, and even more so towards the end of the generation trajectory. Our method, named SPELL for sparse repellency, can be used either with a static reference set that contains protected images, or dynamically, by updating the set at each timestep with the expected images concurrently generated within a batch, and with the images of previously generated batches. We show that adding SPELL to popular diffusion models improves their diversity while impacting their FID only marginally, and performs comparatively better than other recent training-free diversity methods. We also demonstrate how SPELL can ensure a shielded generation away from a very large set of protected images by considering all 1. 2M images from ImageNet as the protected set.

ICLR Conference 2025 Conference Paper

Simple ReFlow: Improved Techniques for Fast Flow Models

Beomsu Kim
Yu-Guan Hsieh
Michal Klein
Marco Cuturi
Jong Chul Ye
Bahjat Kawar
James Thornton

Diffusion and flow-matching models achieve remarkable generative performance but at the cost of many neural function evaluations (NFE), which slows inference and limits applicability to time-critical tasks. The ReFlow procedure can accelerate sampling by straightening generation trajectories. But it is an iterative procedure, typically requiring training on simulated data, and results in reduced sample quality. To mitigate sample deterioration, we examine the design space of ReFlow and highlight potential pitfalls in prior heuristic practices. We then propose seven improvements for training dynamics, learning and inference, which are verified with thorough ablation studies on CIFAR10 $32 \times 32$, AFHQv2 $64 \times 64$, and FFHQ $64 \times 64$. Combining all our techniques, we achieve state-of-the-art FID scores (without / with guidance, resp.) for fast generation via neural ODEs: $2.23$ / $1.98$ on CIFAR10, $2.30$ / $1.91$ on AFHQv2, $2.84$ / $2.67$ on FFHQ, and $3.49$ / $1.74$ on ImageNet-64, all with merely $9$ NFEs.

ICML Conference 2024 Conference Paper

Careful with that Scalpel: Improving Gradient Surgery with an EMA

Yu-Guan Hsieh
James Thornton
Eugène Ndiaye
Michal Klein
Marco Cuturi
Pierre Ablin

Beyond minimizing a single training loss, many deep learning estimation pipelines rely on an auxiliary objective to quantify and encourage desirable properties of the model (e. g. performance on another dataset, robustness, agreement with a prior). Although the simplest approach to incorporating an auxiliary loss is to sum it with the training loss as a regularizer, recent works have shown that one can improve performance by blending the gradients beyond a simple sum; this is known as gradient surgery. We cast the problem as a constrained minimization problem where the auxiliary objective is minimized among the set of minimizers of the training loss. To solve this bilevel problem, we follow a parameter update direction that combines the training loss gradient and the orthogonal projection of the auxiliary gradient to the training gradient. In a setting where gradients come from mini-batches, we explain how, using a moving average of the training loss gradients, we can carefully maintain this critical orthogonality property. We demonstrate that our method, Bloop, can lead to much better performances on NLP and vision experiments than other gradient surgery methods without EMA.

ICML Conference 2024 Conference Paper

Contrasting Multiple Representations with the Multi-Marginal Matching Gap

Zoe Piran
Michal Klein
James Thornton
Marco Cuturi

Learning meaningful representations of complex objects that can be seen through multiple ($k\geq 3$) views or modalities is a core task in machine learning. Existing methods use losses originally intended for paired views, and extend them to $k$ views, either by instantiating $\tfrac12k(k-1)$ loss-pairs, or by using reduced embeddings, following a one vs. average-of-rest strategy. We propose the multi-marginal matching gap (M3G), a loss that borrows tools from multi-marginal optimal transport (MM-OT) theory to simultaneously incorporate all $k$ views. Given a batch of $n$ points, each seen as a $k$-tuple of views subsequently transformed into $k$ embeddings, our loss contrasts the cost of matching these $n$ ground-truth $k$-tuples with the MM-OT polymatching cost, which seeks $n$ optimally arranged $k$-tuples chosen within these $n\times k$ vectors. While the exponential complexity $O(n^k$) of the MM-OT problem may seem daunting, we show in experiments that a suitable generalization of the Sinkhorn algorithm for that problem can scale to, e. g. , $k=3\sim 6$ views using mini-batches of size $64 \sim128$. Our experiments demonstrate improved performance over multiview extensions of pairwise losses, for both self-supervised and multimodal tasks.

NeurIPS Conference 2024 Conference Paper

GENOT: Entropic (Gromov) Wasserstein Flow Matching with Applications to Single-Cell Genomics

Dominik Klein
Théo Uscidda
Fabian Theis
Marco Cuturi

Single-cell genomics has significantly advanced our understanding of cellular behavior, catalyzing innovations in treatments and precision medicine. However, single-cell sequencing technologies are inherently destructive and can only measure a limited array of data modalities simultaneously. This limitation underscoresthe need for new methods capable of realigning cells. Optimal transport (OT)has emerged as a potent solution, but traditional discrete solvers are hampered byscalability, privacy, and out-of-sample estimation issues. These challenges havespurred the development of neural network-based solvers, known as neural OTsolvers, that parameterize OT maps. Yet, these models often lack the flexibilityneeded for broader life science applications. To address these deficiencies, ourapproach learns stochastic maps (i. e. transport plans), allows for any cost function, relaxes mass conservation constraints and integrates quadratic solvers to tackle thecomplex challenges posed by the (Fused) Gromov-Wasserstein problem. Utilizingflow matching as a backbone, our method offers a flexible and effective framework. We demonstrate its versatility and robustness through applications in cell development studies, cellular drug response modeling, and cross-modality cell translation, illustrating significant potential for enhancing therapeutic strategies.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Learning Elastic Costs to Shape Monge Displacements

Michal Klein
Aram-Alexandre Pooladian
Pierre Ablin
Eugène Ndiaye
Jonathan Niles-Weed
Marco Cuturi

Given a source and a target probability measure, the Monge problem studies efficient ways to map the former onto the latter. This efficiency is quantified by defining a *cost* function between source and target data. Such a cost is often set by default in the machine learning literature to the squared-Euclidean distance, $\ell^2\_2(\mathbf{x}, \mathbf{y}): =\tfrac12\|\mathbf{x}-\mathbf{y}\|\_2^2$. The benefits of using *elastic* costs, defined using a regularizer $\tau$ as $c(\mathbf{x}, \mathbf{y}): =\ell^2_2(\mathbf{x}, \mathbf{y})+\tau(\mathbf{x}-\mathbf{y})$, was recently highlighted in (Cuturi et al. 2023). Such costs shape the *displacements* of Monge maps $T$, namely the difference between a source point and its image $T(\mathbf{x})-\mathbf{x}$, by giving them a structure that matches that of the proximal operator of $\tau$. In this work, we make two important contributions to the study of elastic costs: *(i)* For any elastic cost, we propose a numerical method to compute Monge maps that are provably optimal. This provides a much-needed routine to create synthetic problems where the ground-truth OT map is known, by analogy to the Brenier theorem, which states that the gradient of any convex potential is always a valid Monge map for the $\ell_2^2$ cost; *(ii)* We propose a loss to *learn* the parameter $\theta$ of a parameterized regularizer $\tau_\theta$, and apply it in the case where $\tau_{A}({\bf z}): =\|A^\perp {\bf z}\|^2_2$. This regularizer promotes displacements that lie on a low-dimensional subspace of $\mathbb{R}^d$, spanned by the $p$ rows of $A\in\mathbb{R}^{p\times d}$. We illustrate the soundness of our procedure on synthetic data, generated using our first contribution, in which we show near-perfect recovery of $A$'s subspace using only samples. We demonstrate the applicability of this method by showing predictive improvements on single-cell data tasks.

PDF Details DOI

ICML Conference 2024 Conference Paper

On a Neural Implementation of Brenier's Polar Factorization

Nina Vesseron
Marco Cuturi

In 1991, Brenier proved a theorem that generalizes the polar decomposition for square matrices – factored as PSD $\times$ unitary – to any vector field $F: \mathbb{R}^d\rightarrow \mathbb{R}^d$. The theorem, known as the polar factorization theorem, states that any field $F$ can be recovered as the composition of the gradient of a convex function $u$ with a measure-preserving map $M$, namely $F=\nabla u \circ M$. We propose a practical implementation of this far-reaching theoretical result, and explore possible uses within machine learning. The theorem is closely related to optimal transport (OT) theory, and we borrow from recent advances in the field of neural optimal transport to parameterize the potential $u$ as an input convex neural network. The map $M$ can be either evaluated pointwise using $u^*$, the convex conjugate of $u$, through the identity $M=\nabla u^* \circ F$, or learned as an auxiliary network. Because $M$ is, in general, not injective, we consider the additional task of estimating the ill-posed inverse map that can approximate the pre-image measure $M^{-1}$ using a stochastic generator. We illustrate possible applications of Brenier’s polar factorization to non-convex optimization problems, as well as sampling of densities that are not log-concave.

NeurIPS Conference 2024 Conference Paper

Progressive Entropic Optimal Transport Solvers

Parnian Kassraie
Aram-Alexandre Pooladian
Michal Klein
James Thornton
Jonathan Niles-Weed
Marco Cuturi

Optimal transport (OT) has profoundly impacted machine learning by providing theoretical and computational tools to realign datasets. In this context, given two large point clouds of sizes $n$ and $m$ in $\mathbb{R}^d$, entropic OT (EOT) solvers have emerged as the most reliable tool to either solve the Kantorovich problem and output a $n\times m$ coupling matrix, or to solve the Monge problem and learn a vector-valued push-forward map. While the robustness of EOT couplings/maps makes them a go-to choice in practical applications, EOT solvers remain difficult to tune because of a small but influential set of hyperparameters, notably the omnipresent entropic regularization strength $\varepsilon$. Setting $\varepsilon$ can be difficult, as it simultaneously impacts various performance metrics, such as compute speed, statistical performance, generalization, and bias. In this work, we propose a new class of EOT solvers (ProgOT), that can estimate both plans and transport maps. We take advantage of several opportunities to optimize the computation of EOT solutions by *dividing* mass displacement using a time discretization, borrowing inspiration from dynamic OT formulations, and *conquering* each of these steps using EOT with properly scheduled parameters. We provide experimental evidence demonstrating that ProgOT is a faster and more robust alternative to *standard solvers* when computing couplings at large scales, even outperforming neural network-based approaches. We also prove statistical consistency of our approach for estimating OT maps.

PDF Details DOI

ICML Conference 2023 Conference Paper

Monge, Bregman and Occam: Interpretable Optimal Transport in High-Dimensions with Feature-Sparse Maps

Marco Cuturi
Michal Klein
Pierre Ablin

Optimal transport (OT) theory focuses, among all maps $T: \mathbb{R}^d\rightarrow \mathbb{R}^d$ that can morph a probability measure $\mu$ onto another $\nu$, on those that are the “thriftiest”, i. e. such that the average cost $c(x, T(x))$ between $x$ and its image $T(x)$ is as small as possible. Many computational approaches have been proposed to estimate such Monge maps when $c$ is the squared-Euclidean distance, e. g. , using entropic maps [Pooladian+2021], or input convex neural networks [Makkuva+2020, Korotin+2020]. We propose a new research direction, that leverages a specific translation invariant cost $c(x, y): =h(x-y)$ inspired by the elastic net. Here, $h: =\tfrac{1}{2}\|\cdot\|_2^2+\tau(\cdot)$, where $\tau$ is a convex function. We highlight a surprising link tying together a generalized entropic map for $h$, Bregman centroids induced by $h$, and the proximal operator of $\tau$. We show how setting $\tau$ to be a sparsity-inducing norm results in the first application of Occam ’s razor to transport. These maps yield, mechanically, displacement vectors $\Delta(x): = T(x)-x$ that are sparse, with sparsity patterns that vary depending on $x$. We showcase the ability of our method to estimate meaningful OT maps for high-dimensional single-cell transcription data. We use our methods in the $34000$-d space of gene counts for cells, without using a prior dimensionality reduction, thus retaining the ability to interpret all displacements at the gene level.

ICML Conference 2023 Conference Paper

The Monge Gap: A Regularizer to Learn All Transport Maps

Théo Uscidda
Marco Cuturi

Optimal transport (OT) theory has been used in machine learning to study and characterize maps that can push-forward efficiently a probability measure onto another. Recent works have drawn inspiration from Brenier’s theorem, which states that when the ground cost is the squared-Euclidean distance, the “best” map to morph a continuous measure in $\mathcal{P}(\mathbb{R}^d)$ into another must be the gradient of a convex function. To exploit that result, Makkuva et. al (2020); Korotin et. al (2020) consider maps $T=\nabla f_\theta$, where $f_\theta$ is an input convex neural network (ICNN), as defined by Amos et. al (2017), and fit $\theta$ with SGD using samples. Despite their mathematical elegance, fitting OT maps with ICNNs raises many challenges, due notably to the many constraints imposed on $\theta$; the need to approximate the conjugate of $f_\theta$; or the limitation that they only work for the squared-Euclidean cost. More generally, we question the relevance of using Brenier’s result, which only applies to densities, to constrain the architecture of candidate maps fitted on samples. Motivated by these limitations, we propose a radically different approach to estimating OT maps: Given a cost $c$ and a reference measure $\rho$, we introduce a regularizer, the Monge gap $\mathcal{M}^c_{\rho}(T)$ of a map $T$. That gap quantifies how far a map $T$ deviates from the ideal properties we expect from a $c$-OT map. In practice, we drop all architecture requirements for $T$ and simply minimize a distance (e. g. , the Sinkhorn divergence) between $T\sharp\mu$ and $\nu$, regularized by $\mathcal{M}^c_\rho(T)$. We study $\mathcal{M}^c_{\rho}$ and show how our simple pipeline significantly outperforms other baselines in practice.

NeurIPS Conference 2023 Conference Paper

Unbalanced Low-rank Optimal Transport Solvers

Meyer Scetbon
Michal Klein
Giovanni Palla
Marco Cuturi

The relevance of optimal transport methods to machine learning has long been hindered by two salient limitations. First, the $O(n^3)$ computational cost of standard sample-based solvers (when used on batches of $n$ samples) is prohibitive. Second, the mass conservation constraint makes OT solvers too rigid in practice: because they must match \textit{all} points from both measures, their output can be heavily influenced by outliers. A flurry of recent works in OT has addressed these computational and modelling limitations, but has resulted in two separate strains of methods: While the computational outlook was much improved by entropic regularization, more recent $O(n)$ linear-time \textit{low-rank} solvers hold the promise to scale up OT further. On the other hand, modelling rigidities have been eased owing to unbalanced variants of OT, that rely on penalization terms to promote, rather than impose, mass conservation. The goal of this paper is to merge these two strains, to achieve the promise of \textit{both} versatile/scalable unbalanced/low-rank OT solvers. We propose custom algorithms to implement these extensions for the linear OT problem and its Fused-Gromov-Wasserstein generalization, and demonstrate their practical relevance to challenging spatial transcriptomics matching problems.

ICML Conference 2022 Conference Paper

Debiaser Beware: Pitfalls of Centering Regularized Transport Maps

Aram-Alexandre Pooladian
Marco Cuturi
Jonathan Niles-Weed

Estimating optimal transport (OT) maps (a. k. a. Monge maps) between two measures P and Q is a problem fraught with computational and statistical challenges. A promising approach lies in using the dual potential functions obtained when solving an entropy-regularized OT problem between samples P_n and Q_n, which can be used to recover an approximately optimal map. The negentropy penalization in that scheme introduces, however, an estimation bias that grows with the regularization strength. A well-known remedy to debias such estimates, which has gained wide popularity among practitioners of regularized OT, is to center them, by subtracting auxiliary problems involving P_n and itself, as well as Q_n and itself. We do prove that, under favorable conditions on P and Q, debiasing can yield better approximations to the Monge map. However, and perhaps surprisingly, we present a few cases in which debiasing is provably detrimental in a statistical sense, notably when the regularization strength is large or the number of samples is small. These claims are validated experimentally on synthetic and real datasets, and should reopen the debate on whether debiasing is needed when using entropic OT.

NeurIPS Conference 2022 Conference Paper

Efficient and Modular Implicit Differentiation

Mathieu Blondel
Quentin Berthet
Marco Cuturi
Roy Frostig
Stephan Hoyer
Felipe Llinares-Lopez
Fabian Pedregosa
Jean-Philippe Vert

Automatic differentiation (autodiff) has revolutionized machine learning. Itallows to express complex computations by composing elementary ones in creativeways and removes the burden of computing their derivatives by hand. Morerecently, differentiation of optimization problem solutions has attractedwidespread attention with applications such as optimization layers, and inbi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use forpractitioners, as it often required case-by-case tedious mathematicalderivations and implementations. In this paper, we proposeautomatic implicit differentiation, an efficientand modular approach for implicit differentiation of optimization problems. Inour approach, the user defines directly in Python a function $F$ capturing theoptimality conditions of the problem to be differentiated. Once this is done, weleverage autodiff of $F$ and the implicit function theorem to automaticallydifferentiate the optimization problem. Our approach thus combines the benefitsof implicit differentiation and autodiff. It is efficient as it can be added ontop of any state-of-the-art solver and modular as the optimality conditionspecification is decoupled from the implicit differentiation mechanism. We showthat seemingly simple principles allow to recover many existing implicitdifferentiation methods and create new ones easily. We demonstrate the ease offormulating and solving bi-level optimization problems using our framework. Wealso showcase an application to the sensitivity analysis of molecular dynamics.

ICML Conference 2022 Conference Paper

Linear-Time Gromov Wasserstein Distances using Low Rank Couplings and Costs

Meyer Scetbon
Gabriel Peyré
Marco Cuturi

The ability to align points across two related yet incomparable point clouds (e. g. living in different spaces) plays an important role in machine learning. The Gromov-Wasserstein (GW) framework provides an increasingly popular answer to such problems, by seeking a low-distortion, geometry-preserving assignment between these points. As a non-convex, quadratic generalization of optimal transport (OT), GW is NP-hard. While practitioners often resort to solving GW approximately as a nested sequence of entropy-regularized OT problems, the cubic complexity (in the number $n$ of samples) of that approach is a roadblock. We show in this work how a recent variant of the OT problem that restricts the set of admissible couplings to those having a low-rank factorization is remarkably well suited to the resolution of GW: when applied to GW, we show that this approach is not only able to compute a stationary point of the GW problem in time $O(n^2)$, but also uniquely positioned to benefit from the knowledge that the initial cost matrices are low-rank, to yield a linear time $O(n)$ GW approximation. Our approach yields similar results, yet orders of magnitude faster computation than the SoTA entropic GW approaches, on both simulated and real data.

NeurIPS Conference 2022 Conference Paper

Low-rank Optimal Transport: Approximation, Statistics and Debiasing

Meyer Scetbon
Marco Cuturi

The matching principles behind optimal transport (OT) play an increasingly important role in machine learning, a trend which can be observed when OT is used to disambiguate datasets in applications (e. g. single-cell genomics) or used to improve more complex methods (e. g. balanced attention in transformers or self-supervised learning). To scale to more challenging problems, there is a growing consensus that OT requires solvers that can operate on millions, not thousands, of points. The low-rank optimal transport (LOT) approach advocated in \cite{scetbon2021lowrank} holds several promises in that regard, and was shown to complement more established entropic regularization approaches, being able to insert itself in more complex pipelines, such as quadratic OT. LOT restricts the search for low-cost couplings to those that have a low-nonnegative rank, yielding linear time algorithms in cases of interest. However, these promises can only be fulfilled if the LOT approach is seen as a legitimate contender to entropic regularization when compared on properties of interest, where the scorecard typically includes theoretical properties (statistical complexity and relation to other methods) or practical aspects (debiasing, hyperparameter tuning, initialization). We target each of these areas in this paper in order to cement the impact of low-rank approaches in computational OT.

JMLR Journal 2022 Journal Article

On the Complexity of Approximating Multimarginal Optimal Transport

Tianyi Lin
Nhat Ho
Marco Cuturi
Michael I. Jordan

We study the complexity of approximating the multimarginal optimal transport (MOT) distance, a generalization of the classical optimal transport distance, considered here between $m$ discrete probability distributions supported each on $n$ support points. First, we show that the standard linear programming (LP) representation of the MOT problem is not a minimum-cost flow problem when $m \geq 3$. This negative result implies that some combinatorial algorithms, e.g., network simplex method, are not suitable for approximating the MOT problem, while the worst-case complexity bound for the deterministic interior-point algorithm remains a quantity of $\tilde{\mathcal{O}}(n^{3m})$. We then propose two simple and deterministic algorithms for approximating the MOT problem. The first algorithm, which we refer to as multimarginal Sinkhorn algorithm, is a provably efficient multimarginal generalization of the Sinkhorn algorithm. We show that it achieves a complexity bound of $\tilde{\mathcal{O}}(m^3n^m\varepsilon^{-2})$ for a tolerance $\varepsilon \in (0, 1)$. This provides a first near-linear time complexity bound guarantee for approximating the MOT problem and matches the best known complexity bound for the Sinkhorn algorithm in the classical OT setting when $m = 2$. The second algorithm, which we refer to as accelerated multimarginal Sinkhorn algorithm, achieves the acceleration by incorporating an estimate sequence and the complexity bound is $\tilde{\mathcal{O}}(m^3n^{m+1/3}\varepsilon^{-4/3})$. This bound is better than that of the first algorithm in terms of $1/\varepsilon$, and accelerated alternating minimization algorithm (Tupitsa et al., 2020) in terms of $n$. Finally, we compare our new algorithms with the commercial LP solver Gurobi. Preliminary results on synthetic data and real images demonstrate the effectiveness and efficiency of our algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Supervised Training of Conditional Monge Maps

Charlotte Bunne
Andreas Krause
Marco Cuturi

Optimal transport (OT) theory describes general principles to define and select, among many possible choices, the most efficient way to map a probability measure onto another. That theory has been mostly used to estimate, given a pair of source and target probability measures $(\mu, \nu)$, a parameterized map $T_\theta$ that can efficiently map $\mu$ onto $\nu$. In many applications, such as predicting cell responses to treatments, pairs of input/output data measures $(\mu, \nu)$ that define optimal transport problems do not arise in isolation but are associated with a context $c$, as for instance a treatment when comparing populations of untreated and treated cells. To account for that context in OT estimation, we introduce CondOT, a multi-task approach to estimate a family of OT maps conditioned on a context variable, using several pairs of measures $(\mu_i, \nu_i)$ tagged with a context label $c_i$. CondOT learns a global map $\mathcal{T}_{\theta}$ conditioned on context that is not only expected to fit all labeled pairs in the dataset $\{(c_i, (\mu_i, \nu_i))\}$, i. e. , $\mathcal{T}_{\theta}(c_i) \sharp\mu_i \approx \nu_i$, but should also generalize to produce meaningful maps $\mathcal{T}_{\theta}(c_{\text{new}})$ when conditioned on unseen contexts $c_{\text{new}}$. Our approach harnesses and provides a novel usage for partially input convex neural networks, for which we introduce a robust and efficient initialization strategy inspired by Gaussian approximations. We demonstrate the ability of CondOT to infer the effect of an arbitrary combination of genetic or therapeutic perturbations on single cells, using only observations of the effects of said perturbations separately.

ICML Conference 2021 Conference Paper

Low-Rank Sinkhorn Factorization

Meyer Scetbon
Marco Cuturi
Gabriel Peyré

Several recent applications of optimal transport (OT) theory to machine learning have relied on regularization, notably entropy and the Sinkhorn algorithm. Because matrix-vector products are pervasive in the Sinkhorn algorithm, several works have proposed to \textit{approximate} kernel matrices appearing in its iterations using low-rank factors. Another route lies instead in imposing low-nonnegative rank constraints on the feasible set of couplings considered in OT problems, with no approximations on cost nor kernel matrices. This route was first explored by \citet{forrow2018statistical}, who proposed an algorithm tailored for the squared Euclidean ground cost, using a proxy objective that can be solved through the machinery of regularized 2-Wasserstein barycenters. Building on this, we introduce in this work a generic approach that aims at solving, in full generality, the OT problem under low-nonnegative rank constraints with arbitrary costs. Our algorithm relies on an explicit factorization of low-rank couplings as a product of \textit{sub-coupling} factors linked by a common marginal; similar to an NMF approach, we alternatively updates these factors. We prove the non-asymptotic stationary convergence of this algorithm and illustrate its efficiency on benchmark experiments.

ICML Conference 2020 Conference Paper

Debiased Sinkhorn barycenters

Hicham Janati
Marco Cuturi
Alexandre Gramfort

Entropy regularization in optimal transport (OT) has been the driver of many recent interests for Wasserstein metrics and barycenters in machine learning. It allows to keep the appealing geometrical properties of the unregularized Wasserstein distance while having a significantly lower complexity thanks to Sinkhorn’s algorithm. However, entropy brings some inherent smoothing bias, resulting for example in blurred barycenters. This side effect has prompted an increasing temptation in the community to settle for a slower algorithm such as log-domain stabilized Sinkhorn which breaks the parallel structure that can be leveraged on GPUs, or even go back to unregularized OT. Here we show how this bias is tightly linked to the reference measure that defines the entropy regularizer and propose debiased Sinkhorn barycenters that preserve the best of worlds: fast Sinkhorn-like iterations without entropy smoothing. Theoretically, we prove that this debiasing is perfect for Gaussian distributions with equal variance. Empirically, we illustrate the reduced blurring and the computational advantage.

NeurIPS Conference 2020 Conference Paper

Entropic Optimal Transport between Unbalanced Gaussian Measures has a Closed Form

Hicham Janati
Boris Muzellec
Gabriel Peyré
Marco Cuturi

Although optimal transport (OT) problems admit closed form solutions in a very few notable cases, e. g. in 1D or between Gaussians, these closed forms have proved extremely fecund for practitioners to define tools inspired from the OT geometry. On the other hand, the numerical resolution of OT problems using entropic regularization has given rise to many applications, but because there are no known closed-form solutions for entropic regularized OT problems, these approaches are mostly algorithmic, not informed by elegant closed forms. In this paper, we propose to fill the void at the intersection between these two schools of thought in OT by proving that the entropy-regularized optimal transport problem between two Gaussian measures admits a closed form. Contrary to the unregularized case, for which the explicit form is given by the Wasserstein-Bures distance, the closed form we obtain is differentiable everywhere, even for Gaussians with degenerate covariance matrices. We obtain this closed form solution by solving the fixed-point equation behind Sinkhorn's algorithm, the default method for computing entropic regularized OT. Remarkably, this approach extends to the generalized unbalanced case --- where Gaussian measures are scaled by positive constants. This extension leads to a closed form expression for unbalanced Gaussians as well, and highlights the mass transportation / destruction trade-off seen in unbalanced optimal transport. Moreover, in both settings, we show that the optimal transportation plans are (scaled) Gaussians and provide analytical formulas of their parameters. These formulas constitute the first non-trivial closed forms for entropy-regularized optimal transport, thus providing a ground truth for the analysis of entropic OT and Sinkhorn's algorithm.

NeurIPS Conference 2020 Conference Paper

Fixed-Support Wasserstein Barycenters: Computational Hardness and Fast Algorithm

Tianyi Lin
Nhat Ho
Xi Chen
Marco Cuturi
Michael Jordan

We study the fixed-support Wasserstein barycenter problem (FS-WBP), which consists in computing the Wasserstein barycenter of $m$ discrete probability measures supported on a finite metric space of size $n$. We show first that the constraint matrix arising from the standard linear programming (LP) representation of the FS-WBP is \textit{not totally unimodular} when $m \geq 3$ and $n \geq 3$. This result resolves an open question pertaining to the relationship between the FS-WBP and the minimum-cost flow (MCF) problem since it proves that the FS-WBP in the standard LP form is not an MCF problem when $m \geq 3$ and $n \geq 3$. We also develop a provably fast \textit{deterministic} variant of the celebrated iterative Bregman projection (IBP) algorithm, named \textsc{FastIBP}, with a complexity bound of $\tilde{O}(mn^{7/3}\varepsilon^{-4/3})$, where $\varepsilon \in (0, 1)$ is the desired tolerance. This complexity bound is better than the best known complexity bound of $\tilde{O}(mn^2\varepsilon^{-2})$ for the IBP algorithm in terms of $\varepsilon$, and that of $\tilde{O}(mn^{5/2}\varepsilon^{-1})$ from accelerated alternating minimization algorithm or accelerated primal-dual adaptive gradient algorithm in terms of $n$. Finally, we conduct extensive experiments with both synthetic data and real images and demonstrate the favorable performance of the \textsc{FastIBP} algorithm in practice.

NeurIPS Conference 2020 Conference Paper

Learning with Differentiable Pertubed Optimizers

Quentin Berthet
Mathieu Blondel
Olivier Teboul
Marco Cuturi
Jean-Philippe Vert
Francis Bach

Machine learning pipelines often rely on optimizers procedures to make discrete decisions (e. g. , sorting, picking closest neighbors, or shortest paths). Although these discrete decisions are easily computed in a forward manner, they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform optimizers into operations that are differentiable and never locally constant. Our approach relies on stochastically perturbed optimizers, and can be used readily within existing solvers. Their derivatives can be evaluated efficiently, and smoothness tuned via the chosen noise amplitude. We also show how this framework can be connected to a family of losses developed in structured prediction, and give theoretical guarantees for their use in learning tasks. We demonstrate experimentally the performance of our approach on various tasks.

NeurIPS Conference 2020 Conference Paper

Linear Time Sinkhorn Divergences using Positive Features

Meyer Scetbon
Marco Cuturi

Although Sinkhorn divergences are now routinely used in data sciences to compare probability distributions, the computational effort required to compute them remains expensive, growing in general quadratically in the size $n$ of the support of these distributions. Indeed, solving optimal transport (OT) with an entropic regularization requires computing a $n\times n$ kernel matrix (the neg-exponential of a $n\times n$ pairwise ground cost matrix) that is repeatedly applied to a vector. We propose to use instead ground costs of the form $c(x, y)=-\log\dotp{\varphi(x)}{\varphi(y)}$ where $\varphi$ is a map from the ground space onto the positive orthant $\RR^r_+$, with $r\ll n$. This choice yields, equivalently, a kernel $k(x, y)=\dotp{\varphi(x)}{\varphi(y)}$, and ensures that the cost of Sinkhorn iterations scales as $O(nr)$. We show that usual cost functions can be approximated using this form. Additionaly, we take advantage of the fact that our approach yields approximation that remain fully differentiable with respect to input distributions, as opposed to previously proposed adaptive low-rank approximations of the kernel matrix, to train a faster variant of OT-GAN~\cite{salimans2018improving}.

ICML Conference 2020 Conference Paper

Missing Data Imputation using Optimal Transport

Boris Muzellec
Julie Josse
Claire Boyer
Marco Cuturi

Missing data is a crucial issue when applying machine learning algorithms to real-world datasets. Starting from the simple assumption that two batches extracted randomly from the same dataset should share the same distribution, we leverage optimal transport distances to quantify that criterion and turn it into a loss function to impute missing data values. We propose practical methods to minimize these losses using end-to-end learning, that can exploit or not parametric assumptions on the underlying distributions of values. We evaluate our methods on datasets from the UCI repository, in MCAR, MAR and MNAR settings. These experiments show that OT-based methods match or out-perform state-of-the-art imputation methods, even for high percentages of missing values.

YNIMG Journal 2020 Journal Article

Multi-subject MEG/EEG source imaging with sparse multi-task regression

Hicham Janati
Thomas Bazeille
Bertrand Thirion
Marco Cuturi
Alexandre Gramfort

Magnetoencephalography and electroencephalography (M/EEG) are non-invasive modalities that measure the weak electromagnetic fields generated by neural activity. Estimating the location and magnitude of the current sources that generated these electromagnetic fields is an inverse problem. Although it can be cast as a linear regression, this problem is severely ill-posed as the number of observations, which equals the number of sensors, is small. When considering a group study, a common approach consists in carrying out the regression tasks independently for each subject using techniques such as MNE or sLORETA. An alternative is to jointly localize sources for all subjects taken together, while enforcing some similarity between them. By pooling S subjects in a single joint regression, the number of observations is S times larger, potentially making the problem better posed and offering the ability to identify more sources with greater precision. Here we show how the coupling of the different regression problems can be done through a multi-task regularization that promotes focal source estimates. To take into account intersubject variabilities, we propose the Minimum Wasserstein Estimates (MWE). Thanks to a new joint regression method based on optimal transport (OT) metrics, MWE does not enforce perfect overlap of activation foci for all subjects but rather promotes spatial proximity on the cortical mantle. Besides, by estimating the noise level of each subject, MWE copes with the subject-specific signal-to-noise ratios with only one regularization parameter. On realistic simulations, MWE decreases the localization error by up to 4 mm per source compared to individual solutions. Experiments on the Cam-CAN dataset show improvements in spatial specificity in population imaging compared to individual models such as dSPM as well as a state-of-the-art Bayesian group level model. Our analysis of a multimodal dataset shows how multi-subject source localization reduces the gap between MEG and fMRI for brain mapping.

NeurIPS Conference 2020 Conference Paper

Projection Robust Wasserstein Distance and Riemannian Optimization

Tianyi Lin
Chenyou Fan
Nhat Ho
Marco Cuturi
Michael Jordan

Projection robust Wasserstein (PRW) distance, or Wasserstein projection pursuit (WPP), is a robust variant of the Wasserstein distance. Recent work suggests that this quantity is more robust than the standard Wasserstein distance, in particular when comparing probability measures in high-dimensions. However, it is ruled out for practical application because the optimization model is essentially non-convex and non-smooth which makes the computation intractable. Our contribution in this paper is to revisit the original motivation behind WPP/PRW, but take the hard route of showing that, despite its non-convexity and lack of nonsmoothness, and even despite some hardness results proved by~\citet{Niles-2019-Estimation} in a minimax sense, the original formulation for PRW/WPP \textit{can} be efficiently computed in practice using Riemannian optimization, yielding in relevant cases better behavior than its convex relaxation. More specifically, we provide three simple algorithms with solid theoretical guarantee on their complexity bound (one in the appendix), and demonstrate their effectiveness and efficiency by conducing extensive experiments on synthetic and real data. This paper provides a first step into a computational theory of the PRW distance and provides the links between optimal transport and Riemannian optimization.

ICML Conference 2020 Conference Paper

Regularized Optimal Transport is Ground Cost Adversarial

François-Pierre Paty
Marco Cuturi

Regularizing the optimal transport (OT) problem has proven crucial for OT theory to impact the field of machine learning. For instance, it is known that regularizing OT problems with entropy leads to faster computations and better differentiation using the Sinkhorn algorithm, as well as better sample complexity bounds than classic OT. In this work we depart from this practical perspective and propose a new interpretation of regularization as a robust mechanism, and show using Fenchel duality that any convex regularization of OT can be interpreted as ground cost adversarial. This incidentally gives access to a robust dissimilarity measure on the ground space, which can in turn be used in other applications. We propose algorithms to compute this robust cost, and illustrate the interest of this approach empirically.

ICML Conference 2020 Conference Paper

Supervised Quantile Normalization for Low Rank Matrix Factorization

Marco Cuturi
Olivier Teboul
Jonathan Niles-Weed
Jean-Philippe Vert

Low rank matrix factorization is a fundamental building block in machine learning, used for instance to summarize gene expression profile data or word-document counts. To be robust to outliers and differences in scale across features, a matrix factorization step is usually preceded by ad-hoc feature normalization steps, such as tf-idf scaling or data whitening. We propose in this work to learn these normalization operators jointly with the factorization itself. More precisely, given a $d\times n$ matrix $X$ of $d$ features measured on $n$ individuals, we propose to learn the parameters of quantile normalization operators that can operate row-wise on the values of $X$ and/or of its factorization $UV$ to improve the quality of the low-rank representation of $X$ itself. This optimization is facilitated by the introduction of a new differentiable quantile normalization operator built using optimal transport, providing new results on top of existing work by Cuturi et al. (2019). We demonstrate the applicability of these techniques on synthetic and genomics datasets.

NeurIPS Conference 2019 Conference Paper

Differentiable Ranking and Sorting using Optimal Transport

Marco Cuturi
Olivier Teboul
Jean-Philippe Vert

Sorting is used pervasively in machine learning, either to define elementary algorithms, such as $k$-nearest neighbors ($k$-NN) rules, or to define test-time metrics, such as top-$k$ classification accuracy or ranking losses. Sorting is however a poor match for the end-to-end, automatically differentiable pipelines of deep learning. Indeed, sorting procedures output two vectors, neither of which is differentiable: the vector of sorted values is piecewise linear, while the sorting permutation itself (or its inverse, the vector of ranks) has no differentiable properties to speak of, since it is integer-valued. We propose in this paper to replace the usual \texttt{sort} procedure with a differentiable proxy. Our proxy builds upon the fact that sorting can be seen as an optimal assignment problem, one in which the $n$ values to be sorted are matched to an \emph{auxiliary} probability measure supported on any \emph{increasing} family of $n$ target values. From this observation, we propose extended rank and sort operators by considering optimal transport (OT) problems (the natural relaxation for assignments) where the auxiliary measure can be any weighted measure supported on $m$ increasing values, where $m \ne n$. We recover differentiable operators by regularizing these OT problems with an entropic penalty, and solve them by applying Sinkhorn iterations. Using these smoothed rank and sort operators, we propose differentiable proxies for the classification 0/1 loss as well as for the quantile regression loss.

ICML Conference 2019 Conference Paper

Stochastic Deep Networks

Gwendoline de Bie
Gabriel Peyré
Marco Cuturi

Machine learning is increasingly targeting areas where input data cannot be accurately described by a single vector, but can be modeled instead using the more flexible concept of random vectors, namely probability measures or more simply point clouds of varying cardinality. Using deep architectures on measures poses, however, many challenging issues. Indeed, deep architectures are originally designed to handle fixed-length vectors, or, using recursive mechanisms, ordered sequences thereof. In sharp contrast, measures describe a varying number of weighted observations with no particular order. We propose in this work a deep framework designed to handle crucial aspects of measures, namely permutation invariances, variations in weights and cardinality. Architectures derived from this pipeline can (i) map measures to measures - using the concept of push-forward operators; (ii) bridge the gap between measures and Euclidean spaces - through integration steps. This allows to design discriminative networks (to classify or reduce the dimensionality of input measures), generative architectures (to synthesize measures) and recurrent pipelines (to predict measure dynamics). We provide a theoretical analysis of these building blocks, review our architectures’ approximation abilities and robustness w. r. t. perturbation, and try them on various discriminative and generative tasks.

NeurIPS Conference 2019 Conference Paper

Subspace Detours: Building Transport Plans that are Optimal on Subspace Projections

Boris Muzellec
Marco Cuturi

Computing optimal transport (OT) between measures in high dimensions is doomed by the curse of dimensionality. A popular approach to avoid this curse is to project input measures on lower-dimensional subspaces (1D lines in the case of sliced Wasserstein distances), solve the OT problem between these reduced measures, and settle for the Wasserstein distance between these reductions, rather than that between the original measures. This approach is however difficult to extend to the case in which one wants to compute an OT map (a Monge map) between the original measures. Since computations are carried out on lower-dimensional projections, classical map estimation techniques can only produce maps operating in these reduced dimensions. We propose in this work two methods to extrapolate, from an transport map that is optimal on a subspace, one that is nearly optimal in the entire space. We prove that the best optimal transport plan that takes such "subspace detours" is a generalization of the Knothe-Rosenblatt transport. We show that these plans can be explicitly formulated when comparing Gaussian measures (between which the Wasserstein distance is commonly referred to as the Bures or Fréchet distance). We provide an algorithm to select optimal subspaces given pairs of Gaussian measures, and study scenarios in which that mediating subspace can be selected using prior information. We consider applications to semantic mediation between elliptic word embeddings and domain adaptation with Gaussian mixture models.

ICML Conference 2019 Conference Paper

Subspace Robust Wasserstein Distances

François-Pierre Paty
Marco Cuturi

Making sense of Wasserstein distances between discrete measures in high-dimensional settings remains a challenge. Recent work has advocated a two-step approach to improve robustness and facilitate the computation of optimal transport, using for instance projections on random real lines, or a preliminary quantization of the measures to reduce the size of their support. We propose in this work a “max-min” robust variant of the Wasserstein distance by considering the maximal possible distance that can be realized between two measures, assuming they can be projected orthogonally on a lower k-dimensional subspace. Alternatively, we show that the corresponding “min-max” OT problem has a tight convex relaxation which can be cast as that of finding an optimal transport plan with a low transportation cost, where the cost is alternatively defined as the sum of the k largest eigenvalues of the second order moment matrix of the displacements (or matchings) corresponding to that plan (the usual OT definition only considers the trace of that matrix). We show that both quantities inherit several favorable properties from the OT geometry. We propose two algorithms to compute the latter formulation using entropic regularization, and illustrate the interest of this approach empirically.

NeurIPS Conference 2019 Conference Paper

Tree-Sliced Variants of Wasserstein Distances

Tam Le
Makoto Yamada
Kenji Fukumizu
Marco Cuturi

Optimal transport (\OT) theory defines a powerful set of tools to compare probability distributions. \OT~suffers however from a few drawbacks, computational and statistical, which have encouraged the proposal of several regularized variants of OT in the recent literature, one of the most notable being the \textit{sliced} formulation, which exploits the closed-form formula between univariate distributions by projecting high-dimensional measures onto random lines. We consider in this work a more general family of ground metrics, namely \textit{tree metrics}, which also yield fast closed-form computations and negative definite, and of which the sliced-Wasserstein distance is a particular case (the tree is a chain). We propose the tree-sliced Wasserstein distance, computed by averaging the Wasserstein distance between these measures using random tree metrics, built adaptively in either low or high-dimensional spaces. Exploiting the negative definiteness of that distance, we also propose a positive definite kernel, and test it against other baselines on a few benchmark tasks.

NeurIPS Conference 2018 Conference Paper

Generalizing Point Embeddings using the Wasserstein Space of Elliptical Distributions

Boris Muzellec
Marco Cuturi

Embedding complex objects as vectors in low dimensional spaces is a longstanding problem in machine learning. We propose in this work an extension of that approach, which consists in embedding objects as elliptical probability distributions, namely distributions whose densities have elliptical level sets. We endow these measures with the 2-Wasserstein metric, with two important benefits: (i) For such measures, the squared 2-Wasserstein metric has a closed form, equal to a weighted sum of the squared Euclidean distance between means and the squared Bures metric between covariance matrices. The latter is a Riemannian metric between positive semi-definite matrices, which turns out to be Euclidean on a suitable factor representation of such matrices, which is valid on the entire geodesic between these matrices. (ii) The 2-Wasserstein distance boils down to the usual Euclidean metric when comparing Diracs, and therefore provides a natural framework to extend point embeddings. We show that for these reasons Wasserstein elliptical embeddings are more intuitive and yield tools that are better behaved numerically than the alternative choice of Gaussian embeddings with the Kullback-Leibler divergence. In particular, and unlike previous work based on the KL geometry, we learn elliptical distributions that are not necessarily diagonal. We demonstrate the advantages of elliptical embeddings by using them for visualization, to compute embeddings of words, and to reflect entailment or hypernymy.

NeurIPS Conference 2018 Conference Paper

Large Scale computation of Means and Clusters for Persistence Diagrams using Optimal Transport

Theo Lacombe
Marco Cuturi
Steve OUDOT

Persistence diagrams (PDs) are now routinely used to summarize the underlying topology of complex data. Despite several appealing properties, incorporating PDs in learning pipelines can be challenging because their natural geometry is not Hilbertian. Indeed, this was recently exemplified in a string of papers which show that the simple task of averaging a few PDs can be computationally prohibitive. We propose in this article a tractable framework to carry out standard tasks on PDs at scale, notably evaluating distances, estimating barycenters and performing clustering. This framework builds upon a reformulation of PD metrics as optimal transport (OT) problems. Doing so, we can exploit recent computational advances: the OT problem on a planar grid, when regularized with entropy, is convex can be solved in linear time using the Sinkhorn algorithm and convolutions. This results in scalable computations that can stream on GPUs. We demonstrate the efficiency of our approach by carrying out clustering with diagrams metrics on several thousands of PDs, a scale never seen before in the literature.

ICML Conference 2017 Conference Paper

Sliced Wasserstein Kernel for Persistence Diagrams

Mathieu Carrière
Marco Cuturi
Steve Y. Oudot

Persistence diagrams (PDs) play a key role in topological data analysis (TDA), in which they are routinely used to describe succinctly complex topological properties of complicated shapes. PDs enjoy strong stability properties and have proven their utility in various learning contexts. They do not, however, live in a space naturally endowed with a Hilbert structure and are usually compared with specific distances, such as the bottleneck distance. To incorporate PDs in a learning pipeline, several kernels have been proposed for PDs with a strong emphasis on the stability of the RKHS distance w. r. t. perturbations of the PDs. In this article, we use the Sliced Wasserstein approximation of the Wasserstein distance to define a new kernel for PDs, which is not only provably stable but also provably discriminative w. r. t. the Wasserstein distance $W^1_\infty$ between PDs. We also demonstrate its practicality, by developing an approximation technique to reduce kernel computation time, and show that our proposal compares favorably to existing kernels for PDs on several benchmarks.

ICML Conference 2017 Conference Paper

Soft-DTW: a Differentiable Loss Function for Time-Series

Marco Cuturi
Mathieu Blondel

We propose in this paper a differentiable learning loss between time series, building upon the celebrated dynamic time warping (DTW) discrepancy. Unlike the Euclidean distance, DTW can compare time series of variable size and is robust to shifts or dilatations across the time dimension. To compute DTW, one typically solves a minimal-cost alignment problem between two time series using dynamic programming. Our work takes advantage of a smoothed formulation of DTW, called soft-DTW, that computes the soft-minimum of all alignment costs. We show in this paper that soft-DTW is a differentiable loss function, and that both its value and gradient can be computed with quadratic time/space complexity (DTW has quadratic time but linear space complexity). We show that this regularization is particularly well suited to average and cluster time series under the DTW geometry, a task for which our proposal significantly outperforms existing baselines (Petitjean et al. , 2011). Next, we propose to tune the parameters of a machine that outputs time series by minimizing its fit with ground-truth labels in a soft-DTW sense. Source code is available at https: //github. com/mblondel/soft-dtw

ICML Conference 2016 Conference Paper

Gromov-Wasserstein Averaging of Kernel and Distance Matrices

Gabriel Peyré
Marco Cuturi
Justin Solomon 0001

This paper presents a new technique for computing the barycenter of a set of distance or kernel matrices. These matrices, which define the inter-relationships between points sampled from individual domains, are not required to have the same size or to be in row-by-row correspondence. We compare these matrices using the softassign criterion, which measures the minimum distortion induced by a probabilistic map from the rows of one similarity matrix to the rows of another; this criterion amounts to a regularized version of the Gromov-Wasserstein (GW) distance between metric-measure spaces. The barycenter is then defined as a Fréchet mean of the input matrices with respect to this criterion, minimizing a weighted sum of softassign values. We provide a fast iterative algorithm for the resulting nonconvex optimization problem, built upon state-of- the-art tools for regularized optimal transportation. We demonstrate its application to the computation of shape barycenters and to the prediction of energy levels from molecular configurations in quantum chemistry.

NeurIPS Conference 2016 Conference Paper

Stochastic Optimization for Large-scale Optimal Transport

Aude Genevay
Marco Cuturi
Gabriel Peyré
Francis Bach

Optimal transport (OT) defines a powerful framework to compare probability distributions in a geometrically faithful way. However, the practical impact of OT is still limited because of its computational burden. We propose a new class of stochastic optimization algorithms to cope with large-scale problems routinely encountered in machine learning applications. These methods are able to manipulate arbitrary distributions (either discrete or continuous) by simply requiring to be able to draw samples from them, which is the typical setup in high-dimensional learning problems. This alleviates the need to discretize these densities, while giving access to provably convergent methods that output the correct distance without discretization error. These algorithms rely on two main ideas: (a) the dual OT problem can be re-cast as the maximization of an expectation; (b) entropic regularization of the primal OT problem results in a smooth dual optimization optimization which can be addressed with algorithms that have a provably faster convergence. We instantiate these ideas in three different computational setups: (i) when comparing a discrete distribution to another, we show that incremental stochastic optimization schemes can beat the current state of the art finite dimensional OT solver (Sinkhorn's algorithm); (ii) when comparing a discrete distribution to a continuous density, a re-formulation (semi-discrete) of the dual program is amenable to averaged stochastic gradient descent, leading to better performance than approximately solving the problem by discretization; (iii) when dealing with two continuous densities, we propose a stochastic gradient descent over a reproducing kernel Hilbert space (RKHS). This is currently the only known method to solve this problem, and is more efficient than discretizing beforehand the two densities. We backup these claims on a set of discrete, semi-discrete and continuous benchmark problems.

NeurIPS Conference 2016 Conference Paper

Wasserstein Training of Restricted Boltzmann Machines

Grégoire Montavon
Klaus-Robert Müller
Marco Cuturi

Boltzmann machines are able to learn highly complex, multimodal, structured and multiscale real-world data distributions. Parameters of the model are usually learned by minimizing the Kullback-Leibler (KL) divergence from training samples to the learned model. We propose in this work a novel approach for Boltzmann machine training which assumes that a meaningful metric between observations is known. This metric between observations can then be used to define the Wasserstein distance between the distribution induced by the Boltzmann machine on the one hand, and that given by the training sample on the other hand. We derive a gradient of that distance with respect to the model parameters. Minimization of this new objective leads to generative models with different statistical properties. We demonstrate their practical potential on data completion and denoising, for which the metric between observations plays a crucial role.

NeurIPS Conference 2015 Conference Paper

Principal Geodesic Analysis for Probability Measures under the Optimal Transport Metric

Vivien Seguy
Marco Cuturi

We consider in this work the space of probability measures $P(X)$ on a Hilbert space $X$ endowed with the 2-Wasserstein metric. Given a finite family of probability measures in $P(X)$, we propose an iterative approach to compute geodesic principal components that summarize efficiently that dataset. The 2-Wasserstein metric provides $P(X)$ with a Riemannian structure and associated concepts (Fr\'echet mean, geodesics, tangent vectors) which prove crucial to follow the intuitive approach laid out by standard principal component analysis. To make our approach feasible, we propose to use an alternative parameterization of geodesics proposed by \citet[\S 9. 2]{ambrosio2006gradient}. These \textit{generalized} geodesics are parameterized with two velocity fields defined on the support of the Wasserstein mean of the data, each pointing towards an ending point of the generalized geodesic. The resulting optimization problem of finding principal components is solved by adapting a projected gradient descend method. Experiment results show the ability of the computed principal components to capture axes of variability on histograms and probability measures data.

ICML Conference 2015 Conference Paper

Unsupervised Riemannian Metric Learning for Histograms Using Aitchison Transformations

Tam Le
Marco Cuturi

Many applications in machine learning handle bags of features or histograms rather than simple vectors. In that context, defining a proper geometry to compare histograms can be crucial for many machine learning algorithms. While one might be tempted to use a default metric such as the Euclidean metric, empirical evidence shows this may not be the best choice when dealing with observations that lie in the probability simplex. Additionally, it might be desirable to choose a metric adaptively based on data. We consider in this paper the problem of learning a Riemannian metric on the simplex given unlabeled histogram data. We follow the approach of Lebanon(2006), who proposed to estimate such a metric within a parametric family by maximizing the inverse volume of a given data set of points under that metric. The metrics we consider on the multinomial simplex are pull-back metrics of the Fisher information parameterized by operations within the simplex known as Aitchison(1982) transformations. We propose an algorithmic approach to maximize inverse volumes using sampling and contrastive divergences. We provide experimental evidence that the metric obtained under our proposal outperforms alternative approaches.

ICML Conference 2014 Conference Paper

Fast Computation of Wasserstein Barycenters

Marco Cuturi
Arnaud Doucet

We present new algorithms to compute the mean of a set of $N$ empirical probability measures under the optimal transport metric. This mean, known as the Wasserstein barycenter (Agueh and Carlier, 2011; Rabin et al, 2012), is the measure that minimizes the sum of its Wasserstein distances to each element in that set. We argue through a simple example that Wasserstein barycenters have appealing properties that differentiate them from other barycenters proposed recently, which all build on kernel smoothing and/or Bregman divergences. Two original algorithms are proposed that require the repeated computation of primal and dual optimal solutions of transport problems. However direct implementation of these algorithms is too costly as optimal transports are notoriously computationally expensive. Extending the work of Cuturi (2013), we smooth both the primal and dual of the optimal transport problem to recover fast approximations of the primal and dual optimal solutions. We apply these algorithms to the visualization of perturbed images and to a clustering problem.

JMLR Journal 2014 Journal Article

Ground Metric Learning

Marco Cuturi
David Avis

Optimal transport distances have been used for more than a decade in machine learning to compare histograms of features. They have one parameter: the ground metric, which can be any metric between the features themselves. As is the case for all parameterized distances, optimal transport distances can only prove useful in practice when this parameter is carefully chosen. To date, the only option available to practitioners to set the ground metric parameter was to rely on a priori knowledge of the features, which limited considerably the scope of application of optimal transport distances. We propose to lift this limitation and consider instead algorithms that can learn the ground metric using only a training set of labeled histograms. We call this approach ground metric learning. We formulate the problem of learning the ground metric as the minimization of the difference of two convex polyhedral functions over a convex set of metric matrices. We follow the presentation of our algorithms with promising experimental results which show that this approach is useful both for retrieval and binary/multiclass classification tasks. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

ICML Conference 2013 Conference Paper

Mean Reversion with a Variance Threshold

Marco Cuturi
Alexandre d'Aspremont

Starting from a multivariate data set, we study several techniques to isolate affine combinations of the variables with a maximum amount of mean reversion, while constraining the variance to be larger than a given threshold. We show that many of the optimization problems arising in this context can be solved exactly using semidefinite programming and some variant of the \mathcalS-lemma. In finance, these methods are used to isolate statistical arbitrage opportunities, i. e. mean reverting portfolios with enough variance to overcome market friction. In a more general setting, mean reversion and its generalizations are also used as a proxy for stationarity, while variance simply measures signal strength.

NeurIPS Conference 2013 Conference Paper

Sinkhorn Distances: Lightspeed Computation of Optimal Transport

Marco Cuturi

Optimal transportation distances are a fundamental family of parameterized distances for histograms in the probability simplex. Despite their appealing theoretical properties, excellent performance and intuitive formulation, their computation involves the resolution of a linear program whose cost is prohibitive whenever the histograms' dimension exceeds a few hundreds. We propose in this work a new family of optimal transportation distances that look at transportation problems from a maximum-entropy perspective. We smooth the classical optimal transportation problem with an entropic regularization term, and show that the resulting optimum is also a distance which can be computed through Sinkhorn's matrix scaling algorithm at a speed that is several orders of magnitude faster than that of transportation solvers. We also report improved performance on the MNIST benchmark problem over competing distances.

ICML Conference 2011 Conference Paper

Fast Global Alignment Kernels

Marco Cuturi

ICML Conference 2011 Conference Paper

Mapping kernels for trees

Kilho Shin 0001
Marco Cuturi
Tetsuji Kuboyama

NeurIPS Conference 2009 Conference Paper

White Functionals for Anomaly Detection in Dynamical Systems

Marco Cuturi
Jean-Philippe Vert
Alexandre d'Aspremont

We propose new methodologies to detect anomalies in discrete-time processes taking values in a set. The method is based on the inference of functionals whose evaluations on successive states visited by the process have low autocorrelations. Deviations from this behavior are used to flag anomalies. The candidate functionals are estimated in a subset of a reproducing kernel Hilbert space associated with the set where the process takes values. We provide experimental results which show that these techniques compare favorably with other algorithms.

IJCAI Conference 2007 Conference Paper

Marco Cuturi

For two integral histograms r=(r 1, .. ., r d ) and c=(c 1, .. ., c d ) of equal sum N, the Monge-Kantorovich distance d MK (r, c) between r and c parameterized by a d-times-d distance matrix T is the minimum of all costs <F, T> taken over matrices F of the transportation polytope U(r, c). Recent results suggest that this distance is not negative definite, and hence, through Schoenberg's well-known result, exp(-t d MK ) may not be a positive definite kernel for all t > 0. Rather than using directly d MK to define a similarity between r and c, we propose in this paper to investigate kernels on r and c based on the whole transportation polytope U(r, c). We prove that when r and c have binary counts, which is equivalent to stating that r and c represent clouds of points of equal size, the permanent of an adequate Gram matrix induced by the distance matrix T is a positive definite kernel under favorable conditions on T. We also show that the volume of the polytope U(r, c), that is the number of integral transportation plans, is a positive definite quantity in r and c through the Robinson-Schensted-Knuth correspondence between transportation matrices and Young Tableaux. We follow by proposing a family of positive definite kernels related to the generating function of the polytope through recent results obtained separately by A. Barvinok on the one hand, and C. Berg and A. J. Duran on the other hand. We finally present preliminary results led on a subset of the MNIST database to compare clouds of points through the permanent kernel.

NeurIPS Conference 2006 Conference Paper

Kernels on Structured Objects Through Nested Histograms

Marco Cuturi
Kenji Fukumizu

We propose a family of kernels for structured objects which is based on the bag-ofcomponents paradigm. However, rather than decomposing each complex object into the single histogram of its components, we use for each object a family of nested histograms, where each histogram in this hierarchy describes the object seen from an increasingly granular perspective. We use this hierarchy of histograms to define elementary kernels which can detect coarse and fine similarities between the objects. We compute through an efficient averaging trick a mixture of such specific kernels, to propose a final kernel value which weights efficiently local and global matches. We propose experimental results on an image retrieval experiment which show that this mixture is an effective template procedure to be used with kernels on histograms.

JMLR Journal 2005 Journal Article

Semigroup Kernels on Measures

Marco Cuturi
Kenji Fukumizu
Jean-Philippe Vert

We present a family of positive definite kernels on measures, characterized by the fact that the value of the kernel between two measures is a function of their sum. These kernels can be used to derive kernels on structured objects, such as images and texts, by representing these objects as sets of components, such as pixels or words, or more generally as measures on the space of components. Several kernels studied in this work make use of common quantities defined on measures such as entropy or generalized variance to detect similarities. Given an a priori kernel on the space of components itself, the approach is further extended by restating the previous results in a more efficient and flexible framework using the "kernel trick". Finally, a constructive approach to such positive definite kernels through an integral representation theorem is proved, before presenting experimental results on a benchmark experiment of handwritten digits classification to illustrate the validity of the approach. [abs] [ pdf ][ bib ] &copy JMLR 2005. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Semigroup Kernels on Finite Sets

Marco Cuturi
Jean-Philippe Vert

Complex objects can often be conveniently represented by finite sets of simpler components, such as images by sets of patches or texts by bags of words. We study the class of positive definite (p. d. ) kernels for two such objects that can be expressed as a function of the merger of their respective sets of components. We prove a general integral representa- tion of such kernels and present two particular examples. One of them leads to a kernel for sets of points living in a space endowed itself with a positive definite kernel. We provide experimental results on a benchmark experiment of handwritten digits image classification which illustrate the validity of the approach.