Author name cluster

Stephan Clémençon

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers

1 author row

AAAI Conference 2026 Conference Paper

Best Arm Identification with Biased Contexts

James Cheshire
Stephan Clémençon

We study active mitigation of selection bias in statistical learning. That is sequential maximization over a set A of the expectation of a reward function R(a,X) w.r.t. a r.v. X drawn from a target distribution PT possibly different from the (supposedly dominating) source distribution PS under which rewards are observed. The importance function dPT/dPS (x) with which the sequentially observed biased rewards should be ideally weighted being unknown in practice, auxiliary information is assumed to be available in the form of known moments of the target distribution PT for debiasing purposes. In the batch setting, this problem has already been studied and can be solved under certain conditions in two successive steps: 1) identify a weight function so as to approximate the moments 2) maximize the resulting (empirical version of the) weighted reward. In the active setting, if the problem boils down to identifying the best arm in a stochastic multi armed bandit (MAB) model, the presence of selection bias strongly affects the complexity of the sequential optimization problem and requires the development of a new algorithmic approach, as we show here. In a fixed confidence setting, we introduce a novel notion of complexity, which accounts for the balance between arm evaluation and (parametric) weight function estimation, establish lower bounds and propose an algorithm proved to be near optimal. Theoretical guarantees are backed up by numerical results.

PDF Details DOI

TMLR Journal 2026 Journal Article

On Gossip Algorithms for Machine Learning with Pairwise Objectives

Igor Colin
Aurélien Bellet
Stephan Clémençon
Joseph Salmon

In the IoT era, information is more and more frequently picked up by connected smart sensors with increasing, though limited, storage, communication and computation abilities. Whether due to privacy constraints or to the structure of the distributed system, the development of statistical learning methods dedicated to data that are shared over a network is now a major issue. Gossip-based algorithms have been developed for the purpose of solving a wide variety of statistical learning tasks, ranging from data aggregation over sensor networks to decentralized multi-agent optimization. Whereas the vast majority of contributions consider situations where the function to be estimated or optimized is a basic average of individual observations, it is the goal of this article to investigate the case where the latter is of pairwise nature, taking the form of a $U$-statistic of degree two. Motivated by various problems such as similarity learning, ranking or clustering for instance, we revisit gossip algorithms specifically designed for pairwise objective functions and provide a comprehensive theoretical framework for their convergence. This analysis fills a gap in the literature by establishing conditions under which these methods succeed, and by identifying the graph properties that critically affect their efficiency. In particular, a refined analysis of the convergence upper and lower bounds is performed.

PDF Details

NeurIPS Conference 2025 Conference Paper

Robust Distributed Estimation: Extending Gossip Algorithms to Ranking and Trimmed Means

Anna van Elst
Igor Colin
Stephan Clémençon

This paper addresses the problem of robust estimation in gossip algorithms over arbitrary communication graphs. Gossip algorithms are fully decentralized, relying only on local neighbor-to-neighbor communication, making them well-suited for situations where communication is constrained. A fundamental challenge in existing mean-based gossip algorithms is their vulnerability to malicious or corrupted nodes. In this paper, we show that an outlier-robust mean can be computed by globally estimating a robust statistic. More specifically, we propose a novel gossip algorithm for rank estimation, referred to as \textsc{GoRank}, and leverage it to design a gossip procedure dedicated to trimmed mean estimation, coined \textsc{GoTrim}. In addition to a detailed description of the proposed methods, a key contribution of our work is a precise convergence analysis: we establish an $\mathcal{O}(1/t)$ rate for rank estimation and an $\mathcal{O}(1 / {t})$ rate for trimmed mean estimation, where by $t$ is meant the number of iterations. Moreover, we provide a breakdown point analysis of \textsc{GoTrim}. We empirically validate our theoretical results through experiments on diverse network topologies, data distributions and contamination schemes.

PDF Details

TMLR Journal 2024 Journal Article

A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions

Guillaume Staerman
Pavlo Mozharovskyi
Pierre Colombo
Stephan Clémençon
Florence d'Alché-Buc

The design of a metric between probability distributions is a longstanding problem motivated by numerous applications in machine learning. Focusing on continuous probability distributions in the Euclidean space $\mathbb{R}^d$, we introduce a novel pseudo-metric between probability distributions by leveraging the extension of univariate quantiles to multivariate spaces. Data depth is a nonparametric statistical tool that measures the centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a dataset. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Thus, its upper-level sets---the depth-trimmed regions---give rise to a definition of multivariate quantiles. The new pseudo-metric relies on the average of the Hausdorff distance between the depth-based quantile regions for each distribution. Its good behavior regarding major transformation groups, as well as its ability to factor out translations, are depicted. Robustness, an appealing feature of this pseudo-metric, is studied through the finite sample breakdown point. Moreover, we propose an efficient approximation method with linear time complexity w.r.t. the size of the dataset and its dimension. The quality of this approximation and the performance of the proposed approach are illustrated in numerical experiments.

PDF Details

NeurIPS Conference 2023 Conference Paper

Active Bipartite Ranking

James Cheshire
Vincent Laurent
Stephan Clémençon

In this paper, we develop an active learning framework for the bipartite ranking problem. Motivated by numerous applications, ranging from supervised anomaly detection to credit-scoring through the design of medical diagnosis support systems, and usually formulated as the problem of optimizing (a scalar summary of) the ROC curve, bipartite ranking has been the subject of much attention in the passive context. Various dedicated algorithms have been recently proposed and studied by the machine-learning community. In contrast, active bipartite ranking rule is poorly documented in the literature. Due to its global nature, a strategy for labeling sequentially data points that are difficult to rank w. r. t. to the others is required. This learning task is much more complex than binary classification, for which many active algorithms have been designed. It is the goal of this article to provide a rigorous formulation of such a selective sampling approach. We propose a dedicated algorithm, referred to as active-rank, which aims to minimise the distance between the ROC curve of the ranking function built and the optimal one, w. r. t. the sup norm. We show that, for a fixed confidence level $\epsilon$ and probability $\delta$, active-rank is PAC$(\epsilon, \delta)$. In addition, we provide a problem dependent upper bound on the expected sampling time of active-rank and also demonstrate a problem dependent lower bound on the expected sampling time of any PAC$(\epsilon, \delta)$ algorithm. Beyond the theoretical analysis carried out, numerical results are presented, providing strong empirical evidence of the performance of the algorithm proposed, which compares favorably with more naive approaches.

PDF Details

JMLR Journal 2022 Journal Article

Empirical Risk Minimization under Random Censorship

Guillaume Ausset
Stephan Clémençon
François Portier

We consider the classic supervised learning problem where a continuous non-negative random label $Y$ (e.g. a random duration) is to be predicted based upon observing a random vector $X$ valued in $\mathbb{R}^d$ with $d\geq 1$ by means of a regression rule with minimum least square error. In various applications, ranging from industrial quality control to public health through credit risk analysis for instance, training observations can be right censored, meaning that, rather than on independent copies of $(X,Y)$, statistical learning relies on a collection of $n\geq 1$ independent realizations of the triplet $(X, \; \min\{Y,\; C\},\; \delta)$, where $C$ is a nonnegative random variable with unknown distribution, modelling censoring and $\delta=\mathbb{I}\{Y\leq C\}$ indicates whether the duration is right censored or not. As ignoring censoring in the risk computation may clearly lead to a severe underestimation of the target duration and jeopardize prediction, we consider a plug-in estimate of the true risk based on a Kaplan-Meier estimator of the conditional survival function of the censoring $C$ given $X$, referred to as Beran risk, in order to perform empirical risk minimization. It is established, under mild conditions, that the learning rate of minimizers of this biased/weighted empirical risk functional is of order $O_{\mathbb{P}}(\sqrt{\log(n)/n})$ when ignoring model bias issues inherent to plug-in estimation, as can be attained in absence of censoring. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

PDF Details

NeurIPS Conference 2022 Conference Paper

What are the best Systems? New Perspectives on NLP Benchmarking

Pierre Colombo
Nathan Noiry
Ekhine Irurozki
Stephan Clémençon

In Machine Learning, a benchmark refers to an ensemble of datasets associated with one or multiple metrics together with a way to aggregate different systems performances. They are instrumental in {\it (i)} assessing the progress of new methods along different axes and {\it (ii)} selecting the best systems for practical use. This is particularly the case for NLP with the development of large pre-trained models (\textit{e. g. } GPT, BERT) that are expected to generalize well on a variety of tasks. While the community mainly focused on developing new datasets and metrics, there has been little interest in the aggregation procedure, which is often reduced to a simple average over various performance measures. However, this procedure can be problematic when the metrics are on a different scale, which may lead to spurious conclusions. This paper proposes a new procedure to rank systems based on their performance across different tasks. Motivated by the social choice theory, the final system ordering is obtained through aggregating the rankings induced by each task and is theoretically grounded. We conduct extensive numerical experiments (on over 270k scores) to assess the soundness of our approach both on synthetic and real scores (\textit{e. g. } GLUE, EXTREM, SEVAL, TAC, FLICKR). In particular, we show that our method yields different conclusions on state-of-the-art systems than the mean-aggregation procedure while being both more reliable and robust.

PDF Details

NeurIPS Conference 2018 Conference Paper

On Binary Classification in Extreme Regions

Hamid Jalalzai
Stephan Clémençon
Anne Sabourin

In pattern recognition, a random label Y is to be predicted based upon observing a random vector X valued in $\mathbb{R}^d$ with d>1 by means of a classification rule with minimum probability of error. In a wide variety of applications, ranging from finance/insurance to environmental sciences through teletraffic data analysis for instance, extreme (i. e. very large) observations X are of crucial importance, while contributing in a negligible manner to the (empirical) error however, simply because of their rarity. As a consequence, empirical risk minimizers generally perform very poorly in extreme regions. It is the purpose of this paper to develop a general framework for classification in the extremes. Precisely, under non-parametric heavy-tail assumptions for the class distributions, we prove that a natural and asymptotic notion of risk, accounting for predictive performance in extreme regions of the input space, can be defined and show that minimizers of an empirical version of a non-asymptotic approximant of this dedicated risk, based on a fraction of the largest observations, lead to classification rules with good generalization capacity, by means of maximal deviation inequalities in low probability regions. Beyond theoretical results, numerical experiments are presented in order to illustrate the relevance of the approach developed.

PDF Details

NeurIPS Conference 2016 Conference Paper

On Graph Reconstruction via Empirical Risk Minimization: Fast Learning Rates and Scalability

Guillaume Papa
Aurélien Bellet
Stephan Clémençon

The problem of predicting connections between a set of data points finds many applications, in systems biology and social network analysis among others. This paper focuses on the \textit{graph reconstruction} problem, where the prediction rule is obtained by minimizing the average error over all n(n-1)/2 possible pairs of the n nodes of a training graph. Our first contribution is to derive learning rates of order O(log n / n) for this problem, significantly improving upon the slow rates of order O(1/√n) established in the seminal work of Biau & Bleakley (2006). Strikingly, these fast rates are universal, in contrast to similar results known for other statistical learning problems (e. g. , classification, density level set estimation, ranking, clustering) which require strong assumptions on the distribution of the data. Motivated by applications to large graphs, our second contribution deals with the computational complexity of graph reconstruction. Specifically, we investigate to which extent the learning rates can be preserved when replacing the empirical reconstruction risk by a computationally cheaper Monte-Carlo version, obtained by sampling with replacement B << n² pairs of nodes. Finally, we illustrate our theoretical results by numerical experiments on synthetic and real graphs.

PDF Details

JMLR Journal 2016 Journal Article

Scaling-up Empirical Risk Minimization: Optimization of Incomplete $U$-statistics

Stephan Clémençon
Igor Colin
Aurélien Bellet

In a wide range of statistical learning problems such as ranking, clustering or metric learning among others, the risk is accurately estimated by $U$-statistics of degree $d\geq 1$, i.e. functionals of the training data with low variance that take the form of averages over $k$-tuples. From a computational perspective, the calculation of such statistics is highly expensive even for a moderate sample size $n$, as it requires averaging $O(n^d)$ terms. This makes learning procedures relying on the optimization of such data functionals hardly feasible in practice. It is the major goal of this paper to show that, strikingly, such empirical risks can be replaced by drastically computationally simpler Monte-Carlo estimates based on $O(n)$ terms only, usually referred to as incomplete $U$-statistics, without damaging the $O_{\mathbb{P}}(1/\sqrt{n})$ learning rate of Empirical Risk Minimization (ERM) procedures. For this purpose, we establish uniform deviation results describing the error made when approximating a $U$-process by its incomplete version under appropriate complexity assumptions. Extensions to model selection, fast rate situations and various sampling techniques are also considered, as well as an application to stochastic gradient descent for ERM. Finally, numerical examples are displayed in order to provide strong empirical evidence that the approach we promote largely surpasses more naive subsampling techniques. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

PDF Details