Author name cluster

Mathieu Blondel

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

30 papers

2 author rows

ICML Conference 2025 Conference Paper

Joint Learning of Energy-based Models and their Partition Function

Michael Eli Sander
Vincent Roulet
Tianlin Liu
Mathieu Blondel

Energy-based models (EBMs) offer a flexible framework for parameterizing probability distributions using neural networks. However, learning EBMs by exact maximum likelihood estimation (MLE) is generally intractable, due to the need to compute the partition function. In this paper, we propose a novel min-min formulation for approximately learning probabilistic EBMs in combinatorially-large discrete spaces, such as sets or permutations. Our key idea is to jointly learn both an energy model and its log-partition, parameterized as a neural network. Our approach not only provides a novel tractable objective criterion to learn EBMs by stochastic gradient descent (without relying on MCMC), but also a novel means to estimate the log-partition function on unseen data points. On the theoretical side, we show that our approach recovers the optimal MLE solution when optimizing in the space of continuous functions. Furthermore, we show that our approach naturally extends to the broader family of Fenchel-Young losses, allowing us to obtain the first tractable method for optimizing the sparsemax loss in combinatorially-large spaces. We demonstrate our approach on multilabel classification and label ranking.

ICML Conference 2025 Conference Paper

Loss Functions and Operators Generated by f-Divergences

Vincent Roulet
Tianlin Liu
Nino Vieillard
Michael Eli Sander
Mathieu Blondel

The logistic loss (a. k. a. cross-entropy loss) is one of the most popular loss functions used for multiclass classification. It is also the loss function of choice for next-token prediction in language modeling. It is associated with the Kullback-Leibler (KL) divergence and the softargmax operator. In this work, we propose to construct new convex loss functions based on $f$-divergences. Our loss functions generalize the logistic loss in two directions: i) by replacing the KL divergence with $f$-divergences and ii) by allowing non-uniform reference measures. We instantiate our framework for numerous $f$-divergences, recovering existing losses and creating new ones. By analogy with the logistic loss, the loss function generated by an $f$-divergence is associated with an operator, that we dub $f$-softargmax. We derive a novel parallelizable bisection algorithm for computing the $f$-softargmax associated with any $f$-divergence. On the empirical side, one of the goals of this paper is to determine the effectiveness of loss functions beyond the classical cross-entropy in a language model setting, including on pre-training, post-training (SFT) and distillation. We show that the loss function generated by the $\alpha$-divergence (which is equivalent to Tsallis $\alpha$-negentropy in the case of unit reference measures) with $\alpha=1. 5$ performs well across several tasks.

ICML Conference 2025 Conference Paper

On Teacher Hacking in Language Model Distillation

Daniil Tiapkin
Daniele Calandriello
Johan Ferret
Sarah Perrin
Nino Vieillard
Alexandre Ramé
Mathieu Blondel

Post-training of language models (LMs) increasingly relies on the following two stages: (i) knowledge distillation, where the LM is trained to imitate a larger teacher LM, and (ii) reinforcement learning from human feedback (RLHF), where the LM is aligned by optimizing a reward model. In the second RLHF stage, a well-known challenge is reward hacking, where the LM over-optimizes the reward model, leading to degraded performance on the true objective, in line with Goodhart’s law. In this paper, we investigate whether a similar phenomenon, that we call teacher hacking, can occur during knowledge distillation. This could arise because the teacher LM is itself an imperfect approximation of the true distribution. To study this, we propose a controlled experimental setup involving: (i) an oracle LM representing the ground-truth distribution, (ii) a teacher LM distilled from the oracle, and (iii) a student LM distilled from the teacher. Our experiments reveal the following insights. When using a fixed offline dataset for distillation, teacher hacking occurs; moreover, we can detect it by observing when the optimization process deviates from polynomial convergence laws. In contrast, employing online data generation techniques effectively mitigates teacher hacking. More precisely, we identify data diversity as the key factor in preventing hacking. Overall, our findings provide a deeper understanding of the benefits and limitations of distillation for building robust LMs.

ICML Conference 2024 Conference Paper

Decoding-time Realignment of Language Models

Tianlin Liu
Shangmin Guo
Leonardo Bianco
Daniele Calandriello
Quentin Berthet
Felipe Llinares-López
Jessica Hoffmann
Lucas Dixon

Aligning language models with human preferences is crucial for reducing errors and biases in these models. Alignment techniques, such as reinforcement learning from human feedback (RLHF), are typically cast as optimizing a tradeoff between human preference rewards and a proximity regularization term that encourages staying close to the unaligned model. Selecting an appropriate level of regularization is critical: insufficient regularization can lead to reduced model capabilities due to reward hacking, whereas excessive regularization hinders alignment. Traditional methods for finding the optimal regularization level require retraining multiple models with varying regularization strengths. This process, however, is resource-intensive, especially for large models. To address this challenge, we propose decoding-time realignment (DeRa), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. DeRa enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models. It also enhances the efficiency of hyperparameter tuning by enabling the identification of effective regularization strengths using a validation dataset.

ICML Conference 2024 Conference Paper

How do Transformers Perform In-Context Autoregressive Learning?

Michael Eli Sander
Raja Giryes
Taiji Suzuki
Mathieu Blondel
Gabriel Peyré

Transformers have achieved state-of-the-art performance in language modeling tasks. However, the reasons behind their tremendous success are still unclear. In this paper, towards a better understanding, we train a Transformer model on a simple next token prediction task, where sequences are generated as a first-order autoregressive process $s_{t+1} = W s_t$. We show how a trained Transformer predicts the next token by first learning $W$ in-context, then applying a prediction mapping. We call the resulting procedure in-context autoregressive learning. More precisely, focusing on commuting orthogonal matrices $W$, we first show that a trained one-layer linear Transformer implements one step of gradient descent for the minimization of an inner objective function, when considering augmented tokens. When the tokens are not augmented, we characterize the global minima of a one-layer diagonal linear multi-head Transformer. Importantly, we exhibit orthogonality between heads and show that positional encoding captures trigonometric relations in the data. On the experimental side, we consider the general case of non-commuting orthogonal matrices and generalize our theoretical findings.

NeurIPS Conference 2024 Conference Paper

Learning with Fitzpatrick Losses

Seta Rakotomandimby
Jean-Philippe Chancelier
Michel De Lara
Mathieu Blondel

Fenchel-Young losses are a family of loss functions, encompassing the squared, logistic and sparsemax losses, among others. They are convex w. r. t. the modeloutput and the target, separately. Each Fenchel-Young loss is implicitly associatedwith a link function, that maps model outputs to predictions. For instance, thelogistic loss is associated with the soft argmax link function. Can we build newloss functions associated with the same link function as Fenchel-Young losses? In this paper, we introduce Fitzpatrick losses, a new family of separately convexloss functions based on the Fitzpatrick function. A well-known theoretical tool inmaximal monotone operator theory, the Fitzpatrick function naturally leads to arefined Fenchel-Young inequality, making Fitzpatrick losses tighter than Fenchel-Young losses, while maintaining the same link function for prediction. As anexample, we introduce the Fitzpatrick logistic loss and the Fitzpatrick sparsemaxloss, counterparts of the logistic and the sparsemax losses. This yields two newtighter losses associated with the soft argmax and the sparse argmax, two of themost ubiquitous output layers used in machine learning. We study in details theproperties of Fitzpatrick losses and, in particular, we show that they can be seen asFenchel-Young losses using a modified, target-dependent generating function. Wedemonstrate the effectiveness of Fitzpatrick losses for label proportion estimation.

PDF Details DOI

TMLR Journal 2024 Journal Article

Routers in Vision Mixture of Experts: An Empirical Study

Tianlin Liu
Mathieu Blondel
Carlos Riquelme Ruiz
Joan Puigcerver

Mixture-of-Experts (MoE) models are a promising way to scale up model capacity without significantly increasing computational cost. A key component of MoEs is the router, which decides which subset of parameters (experts) process which feature embeddings (tokens). In this paper, we present a comprehensive study of routers in MoEs for computer vision tasks. We introduce a unified MoE formulation that subsumes different MoEs with two parametric routing tensors. This formulation covers both sparse MoE, which uses a binary or hard assignment between experts and tokens, and soft MoE, which uses a soft assignment between experts and weighted combinations of tokens. Routers for sparse MoEs can be further grouped into two variants: Token Choice, which matches experts to each token, and Expert Choice, which matches tokens to each expert. We conduct head-to-head experiments with 6 different routers, including existing routers from prior work and new ones we introduce. We show that (i) many routers originally developed for language modeling can be adapted to perform strongly in vision tasks, (ii) in sparse MoE, Expert Choice routers generally outperform Token Choice routers, and (iii) soft MoEs generally outperform sparse MoEs with a fixed compute budget. These results provide new insights regarding the crucial role of routers in vision MoE models.

NeurIPS Conference 2024 Conference Paper

Stepping on the Edge: Curvature Aware Learning Rate Tuners

Vincent Roulet
Atish Agarwala
Jean-Bastien Grill
Grzegorz Swirszcz
Mathieu Blondel
Fabian Pedregosa

Curvature information -- particularly, the largest eigenvalue of the lossHessian, known as the sharpness -- often forms the basis for learning ratetuners. However, recent work has shown that the curvature information undergoescomplex dynamics during training, going from a phase of increasing sharpness toeventual stabilization. We analyze the closed-loop feedback effect betweenlearning rate tuning and curvature. We find that classical learning rate tunersmay yield greater one-step loss reduction, yet they ultimately underperform inthe long term when compared to constant learning rates in the full batch regime. These models break the stabilization of the sharpness, which we explain using asimplified model of the joint dynamics of the learning rate and the curvature. To further investigate these effects, we introduce a new learning rate tuningmethod, Curvature Dynamics Aware Tuning (CDAT), which prioritizes long termcurvature stabilization over instantaneous progress on the objective. In thefull batch regime, CDAT shows behavior akin to prefixed warm-up schedules on deeplearning objectives, outperforming tuned constant learning rates. In the minibatch regime, we observe that stochasticity introduces confounding effects thatexplain the previous success of some learning rate tuners at appropriate batchsizes. Our findings highlight the critical role of understanding the jointdynamics of the learning rate and curvature, beyond greedy minimization, todiagnose failures and design effective adaptive learning rate tuners.

PDF Details DOI

ICML Conference 2023 Conference Paper

Fast, Differentiable and Sparse Top-k: a Convex Analysis Perspective

Michael Eli Sander
Joan Puigcerver
Josip Djolonga
Gabriel Peyré
Mathieu Blondel

The top-$k$ operator returns a $k$-sparse vector, where the non-zero values correspond to the $k$ largest values of the input. Unfortunately, because it is a discontinuous function, it is difficult to incorporate in neural networks trained end-to-end with backpropagation. Recent works have considered differentiable relaxations, based either on regularization or perturbation techniques. However, to date, no approach is fully differentiable and sparse. In this paper, we propose new differentiable and sparse top-$k$ operators. We view the top-$k$ operator as a linear program over the permutahedron, the convex hull of permutations. We then introduce a $p$-norm regularization term to smooth out the operator, and show that its computation can be reduced to isotonic optimization. Our framework is significantly more general than the existing one and allows for example to express top-$k$ operators that select values in magnitude. On the algorithmic side, in addition to pool adjacent violator (PAV) algorithms, we propose a new GPU/TPU-friendly Dykstra algorithm to solve isotonic optimization problems. We successfully use our operators to prune weights in neural networks, to fine-tune vision transformers, and as a router in sparse mixture of experts.

ICLR Conference 2023 Conference Paper

Sparsity-Constrained Optimal Transport

Tianlin Liu
Joan Puigcerver
Mathieu Blondel

Regularized optimal transport (OT) is now increasingly used as a loss or as a matching layer in neural networks. Entropy-regularized OT can be computed using the Sinkhorn algorithm but it leads to fully-dense transportation plans, meaning that all sources are (fractionally) matched with all targets. To address this issue, several works have investigated quadratic regularization instead. This regularization preserves sparsity and leads to unconstrained and smooth (semi) dual objectives, that can be solved with off-the-shelf gradient methods. Unfortunately, quadratic regularization does not give direct control over the cardinality (number of nonzeros) of the transportation plan. We propose in this paper a new approach for OT with explicit cardinality constraints on the transportation plan. Our work is motivated by an application to sparse mixture of experts, where OT can be used to match input tokens such as image patches with expert models such as neural networks. Cardinality constraints ensure that at most $k$ tokens are matched with an expert, which is crucial for computational performance reasons. Despite the nonconvexity of cardinality constraints, we show that the corresponding (semi) dual problems are tractable and can be solved with first-order gradient methods. Our method can be thought as a middle ground between unregularized OT (recovered in the limit case $k=1$) and quadratically-regularized OT (recovered when $k$ is large enough). The smoothness of the objectives increases as $k$ increases, giving rise to a trade-off between convergence speed and sparsity of the optimal plan.

NeurIPS Conference 2022 Conference Paper

Efficient and Modular Implicit Differentiation

Mathieu Blondel
Quentin Berthet
Marco Cuturi
Roy Frostig
Stephan Hoyer
Felipe Llinares-Lopez
Fabian Pedregosa
Jean-Philippe Vert

Automatic differentiation (autodiff) has revolutionized machine learning. Itallows to express complex computations by composing elementary ones in creativeways and removes the burden of computing their derivatives by hand. Morerecently, differentiation of optimization problem solutions has attractedwidespread attention with applications such as optimization layers, and inbi-level problems such as hyper-parameter optimization and meta-learning. However, so far, implicit differentiation remained difficult to use forpractitioners, as it often required case-by-case tedious mathematicalderivations and implementations. In this paper, we proposeautomatic implicit differentiation, an efficientand modular approach for implicit differentiation of optimization problems. Inour approach, the user defines directly in Python a function $F$ capturing theoptimality conditions of the problem to be differentiated. Once this is done, weleverage autodiff of $F$ and the implicit function theorem to automaticallydifferentiate the optimization problem. Our approach thus combines the benefitsof implicit differentiation and autodiff. It is efficient as it can be added ontop of any state-of-the-art solver and modular as the optimality conditionspecification is decoupled from the implicit differentiation mechanism. We showthat seemingly simple principles allow to recover many existing implicitdifferentiation methods and create new ones easily. We demonstrate the ease offormulating and solving bi-level optimization problems using our framework. Wealso showcase an application to the sensitivity analysis of molecular dynamics.

JMLR Journal 2022 Journal Article

Implicit Differentiation for Fast Hyperparameter Selection in Non-Smooth Convex Learning

Quentin Bertrand
Quentin Klopfenstein
Mathurin Massias
Mathieu Blondel
Samuel Vaiter
Alexandre Gramfort
Joseph Salmon

Finding the optimal hyperparameters of a model can be cast as a bilevel optimization problem, typically solved using zero-order techniques. In this work we study first-order methods when the inner optimization problem is convex but non-smooth. We show that the forward-mode differentiation of proximal gradient descent and proximal coordinate descent yield sequences of Jacobians converging toward the exact Jacobian. Using implicit differentiation, we show it is possible to leverage the non-smoothness of the inner problem to speed up the computation. Finally, we provide a bound on the error made on the hypergradient when the inner optimization problem is solved approximately. Results on regression and classification problems reveal computational benefits for hyperparameter optimization, especially when multiple hyperparameters are required. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

Learning Energy Networks with Generalized Fenchel-Young Losses

Mathieu Blondel
Felipe Llinares-Lopez
Robert Dadashi
Leonard Hussenot
Matthieu Geist

Energy-based models, a. k. a. energy networks, perform inference by optimizing an energy function, typically parametrized by a neural network. This allows one to capture potentially complex relationships between inputs andoutputs. To learn the parameters of the energy function, the solution to thatoptimization problem is typically fed into a loss function. The key challenge for training energy networks lies in computing loss gradients, as this typically requires argmin/argmax differentiation. In this paper, building upon a generalized notion of conjugate function, which replaces the usual bilinear pairing with a general energy function, we propose generalized Fenchel-Young losses, a natural loss construction forlearning energy networks. Our losses enjoy many desirable properties and theirgradients can be computed efficiently without argmin/argmax differentiation. We also prove the calibration of their excess risk in the case of linear-concaveenergies. We demonstrate our losses on multilabel classification and imitation learning tasks.

JMLR Journal 2022 Journal Article

Sparse Continuous Distributions and Fenchel-Young Losses

André F. T. Martins
Marcos Treviso
António Farinhas
Pedro M. Q. Aguiar
Mário A. T. Figueiredo
Mathieu Blondel
Vlad Niculae

Exponential families are widely used in machine learning, including many distributions in continuous and discrete domains (e.g., Gaussian, Dirichlet, Poisson, and categorical distributions via the softmax transformation). Distributions in each of these families have fixed support. In contrast, for finite domains, recent work on sparse alternatives to softmax (e.g., sparsemax, $\alpha$-entmax, and fusedmax), has led to distributions with varying support. This paper develops sparse alternatives to continuous distributions, based on several technical contributions: First, we define $\Omega$-regularized prediction maps and Fenchel-Young losses for arbitrary domains (possibly countably infinite or continuous). For linearly parametrized families, we show that minimization of Fenchel-Young losses is equivalent to moment matching of the statistics, generalizing a fundamental property of exponential families. When $\Omega$ is a Tsallis negentropy with parameter $\alpha$, we obtain “deformed exponential families,” which include $\alpha$-entmax and sparsemax ($\alpha=2$) as particular cases. For quadratic energy functions, the resulting densities are $\beta$-Gaussians, an instance of elliptical distributions that contain as particular cases the Gaussian, biweight, triweight, and Epanechnikov densities, and for which we derive closed-form expressions for the variance, Tsallis entropy, and Fenchel-Young loss. When $\Omega$ is a total variation or Sobolev regularizer, we obtain a continuous version of the fusedmax. Finally, we introduce continuous-domain attention mechanisms, deriving efficient gradient backpropagation algorithms for $\alpha \in \{1,\frac{4}{3}, \frac{3}{2}, 2\}$. Using these algorithms, we demonstrate our sparse continuous distributions for attention-based audio classification and visual question answering, showing that they allow attending to time intervals and compact regions. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

ICML Conference 2021 Conference Paper

Momentum Residual Neural Networks

Michael Eli Sander
Pierre Ablin
Mathieu Blondel
Gabriel Peyré

The training of deep residual neural networks (ResNets) with backpropagation has a memory cost that increases linearly with respect to the depth of the network. A simple way to circumvent this issue is to use reversible architectures. In this paper, we propose to change the forward rule of a ResNet by adding a momentum term. The resulting networks, momentum residual neural networks (MomentumNets), are invertible. Unlike previous invertible architectures, they can be used as a drop-in replacement for any existing ResNet block. We show that MomentumNets can be interpreted in the infinitesimal step size regime as second-order ordinary differential equations (ODEs) and exactly characterize how adding momentum progressively increases the representation capabilities of MomentumNets: they can learn any linear mapping up to a multiplicative factor, while ResNets cannot. In a learning to optimize setting, where convergence to a fixed point is required, we show theoretically and empirically that our method succeeds while existing invertible architectures fail. We show on CIFAR and ImageNet that MomentumNets have the same accuracy as ResNets, while having a much smaller memory footprint, and show that pre-trained MomentumNets are promising for fine-tuning models.

ICML Conference 2020 Conference Paper

Fast Differentiable Sorting and Ranking

Mathieu Blondel
Olivier Teboul
Quentin Berthet
Josip Djolonga

The sorting operation is one of the most commonly used building blocks in computer programming. In machine learning, it is often used for robust statistics. However, seen as a function, it is piecewise linear and as a result includes many kinks where it is non-differentiable. More problematic is the related ranking operator, often used for order statistics and ranking metrics. It is a piecewise constant function, meaning that its derivatives are null or undefined. While numerous works have proposed differentiable proxies to sorting and ranking, they do not achieve the $O(n \log n)$ time complexity one would expect from sorting and ranking operations. In this paper, we propose the first differentiable sorting and ranking operators with $O(n \log n)$ time and $O(n)$ space complexity. Our proposal in addition enjoys exact computation and differentiation. We achieve this feat by constructing differentiable operators as projections onto the permutahedron, the convex hull of permutations, and using a reduction to isotonic optimization. Empirically, we confirm that our approach is an order of magnitude faster than existing approaches and showcase two novel applications: differentiable Spearman’s rank correlation coefficient and least trimmed squares.

ICML Conference 2020 Conference Paper

Implicit differentiation of Lasso-type models for hyperparameter optimization

Quentin Bertrand
Quentin Klopfenstein
Mathieu Blondel
Samuel Vaiter
Alexandre Gramfort
Joseph Salmon

Setting regularization parameters for Lasso-type estimators is notoriously difficult, though crucial for obtaining the best accuracy. The most popular hyperparameter optimization approach is grid-search on a held-out dataset. However, grid-search requires to choose a predefined grid of parameters and scales exponentially in the number of parameters. Another class of approaches casts hyperparameter optimization as a bi-level optimization problem, typically solved by gradient descent. The key challenge for these approaches is the estimation of the gradient w. r. t. the hyperparameters. Computing that gradient via forward or backward automatic differentiation usually suffers from high memory consumption, while implicit differentiation typically involves solving a linear system which can be prohibitive and numerically unstable. In addition, implicit differentiation usually assumes smooth loss functions, which is not the case of Lasso-type problems. This work introduces an efficient implicit differentiation algorithm, without matrix inversion, tailored for Lasso-type problems. Our proposal scales to high-dimensional data by leveraging the sparsity of the solutions. Empirically, we demonstrate that the proposed method outperforms a large number of standard methods for hyperparameter optimization.

NeurIPS Conference 2020 Conference Paper

Learning with Differentiable Pertubed Optimizers

Quentin Berthet
Mathieu Blondel
Olivier Teboul
Marco Cuturi
Jean-Philippe Vert
Francis Bach

Machine learning pipelines often rely on optimizers procedures to make discrete decisions (e. g. , sorting, picking closest neighbors, or shortest paths). Although these discrete decisions are easily computed in a forward manner, they break the back-propagation of computational graphs. In order to expand the scope of learning problems that can be solved in an end-to-end fashion, we propose a systematic method to transform optimizers into operations that are differentiable and never locally constant. Our approach relies on stochastically perturbed optimizers, and can be used readily within existing solvers. Their derivatives can be evaluated efficiently, and smoothness tuned via the chosen noise amplitude. We also show how this framework can be connected to a family of losses developed in structured prediction, and give theoretical guarantees for their use in learning tasks. We demonstrate experimentally the performance of our approach on various tasks.

JMLR Journal 2020 Journal Article

Learning with Fenchel-Young losses

Mathieu Blondel
André F.T. Martins
Vlad Niculae

Over the past decades, numerous loss functions have been been proposed for a variety of supervised learning tasks, including regression, classification, ranking, and more generally structured prediction. Understanding the core principles and theoretical properties underpinning these losses is key to choose the right loss for the right problem, as well as to create new losses which combine their strengths. In this paper, we introduce Fenchel-Young losses, a generic way to construct a convex loss function for a regularized prediction function. We provide an in-depth study of their properties in a very broad setting, covering all the aforementioned supervised learning tasks, and revealing new connections between sparsity, generalized entropies, and separation margins. We show that Fenchel-Young losses unify many well-known loss functions and allow to create useful new ones easily. Finally, we derive efficient predictive and training algorithms, making Fenchel-Young losses appealing both in theory and practice. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2020. ( edit, beta )

ICML Conference 2019 Conference Paper

Geometric Losses for Distributional Learning

Arthur Mensch
Mathieu Blondel
Gabriel Peyré

Building upon recent advances in entropy-regularized optimal transport, and upon Fenchel duality between measures and continuous functions, we propose a generalization of the logistic loss that incorporates a metric or cost between classes. Unlike previous attempts to use optimal transport distances for learning, our loss results in unconstrained convex objective functions, supports infinite (or very large) class spaces, and naturally defines a geometric generalization of the softmax operator. The geometric properties of this loss make it suitable for predicting sparse and singular distributions, for instance supported on curves or hyper-surfaces. We study the theoretical properties of our loss and showcase its effectiveness on two applications: ordinal regression and drawing generation.

NeurIPS Conference 2019 Conference Paper

Structured Prediction with Projection Oracles

Mathieu Blondel

We propose in this paper a general framework for deriving loss functions for structured prediction. In our framework, the user chooses a convex set including the output space and provides an oracle for projecting onto that set. Given that oracle, our framework automatically generates a corresponding convex and smooth loss function. As we show, adding a projection as output layer provably makes the loss smaller. We identify the marginal polytope, the output space's convex hull, as the best convex set on which to project. However, because the projection onto the marginal polytope can sometimes be expensive to compute, we allow to use any convex superset instead, with potentially cheaper-to-compute projection. Since efficient projection algorithms are available for numerous convex sets, this allows us to construct loss functions for a variety of tasks. On the theoretical side, when combined with calibrated decoding, we prove that our loss functions can be used as a consistent surrogate for a (potentially non-convex) target loss function of interest. We demonstrate our losses on label ranking, ordinal regression and multilabel classification, confirming the improved accuracy enabled by projections.

ICML Conference 2018 Conference Paper

Differentiable Dynamic Programming for Structured Prediction and Attention

Arthur Mensch
Mathieu Blondel

Dynamic programming (DP) solves a variety of structured combinatorial problems by iteratively breaking them down into smaller subproblems. In spite of their versatility, many DP algorithms are non-differentiable, which hampers their use as a layer in neural networks trained by backpropagation. To address this issue, we propose to smooth the max operator in the dynamic programming recursion, using a strongly convex regularizer. This allows to relax both the optimal value and solution of the original combinatorial problem, and turns a broad class of DP algorithms into differentiable operators. Theoretically, we provide a new probabilistic perspective on backpropagating through these DP operators, and relate them to inference in graphical models. We derive two particular instantiations of our framework, a smoothed Viterbi algorithm for sequence prediction and a smoothed DTW algorithm for time-series alignment. We showcase these instantiations on structured prediction (audio-to-score alignment, NER) and on structured and sparse attention for translation.

ICML Conference 2018 Conference Paper

SparseMAP: Differentiable Sparse Structured Inference

Vlad Niculae
André F. T. Martins
Mathieu Blondel
Claire Cardie

Structured prediction requires searching over a combinatorial number of structures. To tackle it, we introduce SparseMAP, a new method for sparse structured inference, together with corresponding loss functions. SparseMAP inference is able to automatically select only a few global structures: it is situated between MAP inference, which picks a single structure, and marginal inference, which assigns probability mass to all structures, including implausible ones. Importantly, SparseMAP can be computed using only calls to a MAP oracle, hence it is applicable even to problems where marginal inference is intractable, such as linear assignment. Moreover, thanks to the solution sparsity, gradient backpropagation is efficient regardless of the structure. SparseMAP thus enables us to augment deep neural networks with generic and sparse structured hidden layers. Experiments in dependency parsing and natural language inference reveal competitive accuracy, improved interpretability, and the ability to capture natural language ambiguities, which is attractive for pipeline systems.

NeurIPS Conference 2017 Conference Paper

A Regularized Framework for Sparse and Structured Neural Attention

Vlad Niculae
Mathieu Blondel

Modern neural networks are often augmented with an attention mechanism, which tells the network where to focus within the input. We propose in this paper a new framework for sparse and structured attention, building upon a smoothed max operator. We show that the gradient of this operator defines a mapping from real values to probabilities, suitable as an attention mechanism. Our framework includes softmax and a slight generalization of the recently-proposed sparsemax as special cases. However, we also show how our framework can incorporate modern structured penalties, resulting in more interpretable attention mechanisms, that focus on entire segments or groups of an input. We derive efficient algorithms to compute the forward and backward passes of our attention mechanisms, enabling their use in a neural network trained with backpropagation. To showcase their potential as a drop-in replacement for existing ones, we evaluate our attention mechanisms on three large-scale tasks: textual entailment, machine translation, and sentence summarization. Our attention mechanisms improve interpretability without sacrificing performance; notably, on textual entailment and summarization, we outperform the standard attention mechanisms based on softmax and sparsemax.

NeurIPS Conference 2017 Conference Paper

Multi-output Polynomial Networks and Factorization Machines

Mathieu Blondel
Vlad Niculae
Takuma Otsuka
Naonori Ueda

Factorization machines and polynomial networks are supervised polynomial models based on an efficient low-rank decomposition. We extend these models to the multi-output setting, i. e. , for learning vector-valued functions, with application to multi-class or multi-task problems. We cast this as the problem of learning a 3-way tensor whose slices share a common basis and propose a convex formulation of that problem. We then develop an efficient conditional gradient algorithm and prove its global convergence, despite the fact that it involves a non-convex basis selection step. On classification tasks, we show that our algorithm achieves excellent accuracy with much sparser models than existing methods. On recommendation system tasks, we show how to combine our algorithm with a reduction from ordinal regression to multi-output classification and show that the resulting algorithm outperforms simple baselines in terms of ranking accuracy.

ICML Conference 2017 Conference Paper

Soft-DTW: a Differentiable Loss Function for Time-Series

Marco Cuturi
Mathieu Blondel

We propose in this paper a differentiable learning loss between time series, building upon the celebrated dynamic time warping (DTW) discrepancy. Unlike the Euclidean distance, DTW can compare time series of variable size and is robust to shifts or dilatations across the time dimension. To compute DTW, one typically solves a minimal-cost alignment problem between two time series using dynamic programming. Our work takes advantage of a smoothed formulation of DTW, called soft-DTW, that computes the soft-minimum of all alignment costs. We show in this paper that soft-DTW is a differentiable loss function, and that both its value and gradient can be computed with quadratic time/space complexity (DTW has quadratic time but linear space complexity). We show that this regularization is particularly well suited to average and cluster time series under the DTW geometry, a task for which our proposal significantly outperforms existing baselines (Petitjean et al. , 2011). Next, we propose to tune the parameters of a machine that outputs time series by minimizing its fit with ground-truth labels in a soft-DTW sense. Source code is available at https: //github. com/mblondel/soft-dtw

IJCAI Conference 2017 Conference Paper

SVD-Based Screening for the Graphical Lasso

Yasuhiro Fujiwara
Naoki Marumo
Mathieu Blondel
Koh Takeuchi
Hideaki Kim
Tomoharu Iwata
Naonori Ueda

The graphical lasso is the most popular approach to estimating the inverse covariance matrix of high-dimension data. It iteratively estimates each row and column of the matrix in a round-robin style until convergence. However, the graphical lasso is infeasible due to its high computation cost for large size of datasets. This paper proposes Sting, a fast approach to the graphical lasso. In order to reduce the computation cost, it efficiently identifies blocks in the estimated matrix that have nonzero elements before entering the iterations by exploiting the singular value decomposition of data matrix. In addition, it selectively updates elements of the estimated matrix expected to have nonzero values. Theoretically, it guarantees to converge to the same result as the original algorithm of the graphical lasso. Experiments show that our approach is faster than existing approaches.

NeurIPS Conference 2016 Conference Paper

Higher-Order Factorization Machines

Mathieu Blondel
Akinori Fujino
Naonori Ueda
Masakazu Ishihata

Factorization machines (FMs) are a supervised learning approach that can use second-order feature combinations even when the data is very high-dimensional. Unfortunately, despite increasing interest in FMs, there exists to date no efficient training algorithm for higher-order FMs (HOFMs). In this paper, we present the first generic yet efficient algorithms for training arbitrary-order HOFMs. We also present new variants of HOFMs with shared parameters, which greatly reduce model size and prediction times while maintaining similar accuracy. We demonstrate the proposed approaches on four different link prediction tasks.

ICML Conference 2016 Conference Paper

Polynomial Networks and Factorization Machines: New Insights and Efficient Training Algorithms

Mathieu Blondel
Masakazu Ishihata
Akinori Fujino
Naonori Ueda

Polynomial networks and factorization machines are two recently-proposed models that can efficiently use feature interactions in classification and regression tasks. In this paper, we revisit both models from a unified perspective. Based on this new view, we study the properties of both models and propose new efficient training algorithms. Key to our approach is to cast parameter learning as a low-rank symmetric tensor estimation problem, which we solve by multi-convex optimization. We demonstrate our approach on regression and recommender system tasks.

JMLR Journal 2011 Journal Article

Scikit-learn: Machine Learning in Python

Fabian Pedregosa
Gaël Varoquaux
Alexandre Gramfort
Vincent Michel
Bertrand Thirion
Olivier Grisel
Mathieu Blondel
Peter Prettenhofer

Scikit-learn is a Python module integrating a wide range of state-of-the-art machine learning algorithms for medium-scale supervised and unsupervised problems. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Emphasis is put on ease of use, performance, documentation, and API consistency. It has minimal dependencies and is distributed under the simplified BSD license, encouraging its use in both academic and commercial settings. Source code, binaries, and documentation can be downloaded from http://scikit-learn.sourceforge.net. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2011. ( edit, beta )