Arrow Research search

Author name cluster

Samory Kpotufe

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

NeurIPS Conference 2025 Conference Paper

Mixed-Sample SGD: an End-to-end Analysis of Supervised Transfer Learning

  • Yuyang Deng
  • Samory Kpotufe

Theoretical works on supervised transfer learning (STL)---where the learner has access to labeled samples from both source and target distributions---have for the most part focused on statistical aspects of the problem, while efficient optimization has received less attention. We consider the problem of designing an SGD procedure for STL that alternates sampling between source and target data, while maintaining statistical transfer guarantees without prior knowledge of the quality of the source data. A main algorithmic difficulty is in understanding how to design such an adaptive sub-sampling mechanism at each SGD step, to automatically gain from the source when it is informative, or bias towards the target and avoid negative transfer when the source is less informative. We show that, such a mixed-sample SGD procedure is feasible for general prediction tasks with convex losses, rooted in tracking an abstract sequence of constrained convex programs that serve to maintain the desired transfer guarantees. We instantiate these results in the concrete setting of linear regression with square loss, and show that the procedure converges, with $1/\sqrt{T}$ rate, to a solution whose statistical performance on the target is adaptive to the a priori unknown quality of the source. Experiments with synthetic and real datasets support the theory.

JMLR Journal 2024 Journal Article

Regimes of No Gain in Multi-class Active Learning

  • Gan Yuan
  • Yunfan Zhao
  • Samory Kpotufe

We consider nonparametric classification with smooth regression functions, where it is well known that notions of margin in $\mathbb{P}(Y=y|X=x)$ determine fast or slow rates in both active and passive learning. Here we elucidate a striking distinction---most relevant in multi-class settings---between active and passive learning. Namely, we show that some seemingly benign nuances in notions of margin---involving the uniqueness of the Bayes classes, which have no apparent effect on rates in passive learning---determine whether or not any active learner can outperform passive learning rates. While a shorter conference version of this work already alluded to these nuances, it focused on the binary case and thus failed to be conclusive as to the source of difficulty in the multi-class setting: we show here that it suffices that the Bayes classifier fails to be unique, as opposed to needing all classes to be Bayes optimal, for active learning to yield no gain over passive learning. [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

ICLR Conference 2024 Conference Paper

Tight Rates in Supervised Outlier Transfer Learning

  • Mohammadreza M. Kalan
  • Samory Kpotufe

A critical barrier to learning an accurate decision rule for outlier detection is the scarcity of outlier data. As such, practitioners often turn to the use of similar but imperfect outlier data from which they might \emph{transfer} information to the target outlier detection task. Despite the recent empirical success of transfer learning in outlier detection, a fundamental understanding of when and how knowledge can be transferred from a source to a target in outlier detection remains elusive. In this work, we adopt the traditional framework of Neyman-Pearson classification---which formalizes \emph{supervised outlier detection}, i.e., unbalanced classification---with the added assumption that we have access to both source and (some or no) target outlier data. Our main results are then as follows: We first determine the information-theoretic limits of the problem under a measure of discrepancy that extends some existing notions from traditional balanced classification; interestingly, unlike in balanced classification, seemingly very dissimilar sources can provide much information about a target, thus resulting in fast transfer. We then show that, in principle, these information-theoretic limits are achievable by \emph{adaptive} procedures, i.e., procedures with no a priori information on the discrepancy between source and target distributions.

NeurIPS Conference 2023 Conference Paper

Tracking Most Significant Shifts in Nonparametric Contextual Bandits

  • Joe Suk
  • Samory Kpotufe

We study nonparametric contextual bandits where Lipschitz mean reward functions may change over time. We first establish the minimax dynamic regret rate in this less understood setting in terms of number of changes $L$ and total-variation $V$, both capturing all changes in distribution over context space, and argue that state-of-the-art procedures are suboptimal in this setting. Next, we tend to the question of an _adaptivity_ for this setting, i. e. achieving the minimax rate without knowledge of $L$ or $V$. Quite importantly, we posit that the bandit problem, viewed locally at a given context $X_t$, should not be affected by reward changes in other parts of context space $\cal X$. We therefore propose a notion of _change_, which we term _experienced significant shifts_, that better accounts for locality, and thus counts considerably less changes than $L$ and $V$. Furthermore, similar to recent work on non-stationary MAB (Suk & Kpotufe, 2022), _experienced significant shifts_ only count the most _significant_ changes in mean rewards, e. g. , severe best-arm changes relevant to observed contexts. Our main result is to show that this more tolerant notion of change can in fact be adapted to.

TMLR Journal 2022 Journal Article

An Efficient One-Class SVM for Novelty Detection in IoT

  • Kun Yang
  • Samory Kpotufe
  • Nick Feamster

One-Class Support Vector Machines (OCSVM) are a common approach for novelty detection, due to their flexibility in fitting complex nonlinear boundaries between {normal} and {novel} data. Novelty detection is important in the Internet of Things (``IoT'') due to the threats these devices can present, and OCSVM often performs well in these environments due to the variety of devices, traffic patterns, and anomalies that IoT devices present. Unfortunately, conventional OCSVMs can introduce prohibitive memory and computational overhead at detection time. This work designs, implements and evaluates an efficient OCSVM for such practical settings. We extend Nystr\"om and (Gaussian) Sketching approaches to OCSVM, combining these methods with clustering and Gaussian mixture models to achieve 15-30x speedup in prediction time and 30-40x reduction in memory requirements without sacrificing detection accuracy. Here, the very nature of IoT devices is crucial: they tend to admit few modes of \emph{normal} operation, allowing for efficient pattern compression.

NeurIPS Conference 2019 Conference Paper

On the Value of Target Data in Transfer Learning

  • Steve Hanneke
  • Samory Kpotufe

We aim to understand the value of additional labeled or unlabeled target data in transfer learning, for any given amount of source data; this is motivated by practical questions around minimizing sampling costs, whereby, target data is usually harder or costlier to acquire than source data, but can yield better accuracy. To this aim, we establish the first minimax-rates in terms of both source and target sample sizes, and show that performance limits are captured by new notions of discrepancy between source and target, which we refer to as transfer exponents. Interestingly, we find that attaining minimax performance is akin to ignoring one of the source or target samples, provided distributional parameters were known a priori. Moreover, we show that practical decisions -- w. r. t. minimizing sampling costs -- can be made in a minimax-optimal way without knowledge or estimation of distributional parameters nor of the discrepancy between source and target.

NeurIPS Conference 2018 Conference Paper

PAC-Bayes Tree: weighted subtrees with guarantees

  • Tin Nguyen
  • Samory Kpotufe

We present a weighted-majority classification approach over subtrees of a fixed tree, which provably achieves excess-risk of the same order as the best tree-pruning. Furthermore, the computational efficiency of pruning is maintained at both training and testing time despite having to aggregate over an exponential number of subtrees. We believe this is the first subtree aggregation approach with such guarantees.

ICML Conference 2018 Conference Paper

Quickshift++: Provably Good Initializations for Sample-Based Mean Shift

  • Heinrich Jiang
  • Jennifer Jang
  • Samory Kpotufe

We provide initial seedings to the Quick Shift clustering algorithm, which approximate the locally high-density regions of the data. Such seedings act as more stable and expressive cluster-cores than the singleton modes found by Quick Shift. We establish statistical consistency guarantees for this modification. We then show strong clustering performance on real datasets as well as promising applications to image segmentation.

JMLR Journal 2017 Journal Article

Time-Accuracy Tradeoffs in Kernel Prediction: Controlling Prediction Quality

  • Samory Kpotufe
  • Nakul Verma

Kernel regression or classification (also referred to as weighted $\epsilon$-NN methods in Machine Learning) are appealing for their simplicity and therefore ubiquitous in data analysis. However, practical implementations of kernel regression or classification consist of quantizing or sub- sampling data for improving time efficiency, often at the cost of prediction quality. While such tradeoffs are necessary in practice, their statistical implications are generally not well understood, hence practical implementations come with few performance guarantees. In particular, it is unclear whether it is possible to maintain the statistical accuracy of kernel prediction---crucial in some applications---while improving prediction time. The present work provides guiding principles for combining kernel prediction with data- quantization so as to guarantee good tradeoffs between prediction time and accuracy, and in particular so as to approximately maintain the good accuracy of vanilla kernel prediction. Furthermore, our tradeoff guarantees are worked out explicitly in terms of a tuning parameter which acts as a knob that favors either time or accuracy depending on practical needs. On one end of the knob, prediction time is of the same order as that of single -nearest-neighbor prediction (which is statistically inconsistent) while maintaining consistency; on the other end of the knob, the prediction risk is nearly minimax-optimal (in terms of the original data size) while still reducing time complexity. The analysis thus reveals the interaction between the data- quantization approach and the kernel prediction method, and most importantly gives explicit control of the tradeoff to the practitioner rather than fixing the tradeoff in advance or leaving it opaque. The theoretical results are validated on data from a range of real-world application domains; in particular we demonstrate that the theoretical knob performs as expected. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

JMLR Journal 2016 Journal Article

Gradients Weights improve Regression and Classification

  • Samory Kpotufe
  • Abdeslam Boularias
  • Thomas Schultz
  • Kyoungok Kim

In regression problems over $\mathbb{R}^d$, the unknown function $f$ often varies more in some coordinates than in others. We show that weighting each coordinate $i$ according to an estimate of the variation of $f$ along coordinate $i$ -- e.g. the $L_1$ norm of the $i$th-directional derivative of $f$ -- is an efficient way to significantly improve the performance of distance-based regressors such as kernel and $k$-NN regressors. The approach, termed Gradient Weighting (GW), consists of a first pass regression estimate $f_n$ which serves to evaluate the directional derivatives of $f$, and a second-pass regression estimate on the re-weighted data. The GW approach can be instantiated for both regression and classification, and is grounded in strong theoretical principles having to do with the way regression bias and variance are affected by a generic feature-weighting scheme. These theoretical principles provide further technical foundation for some existing feature-weighting heuristics that have proved successful in practice. We propose a simple estimator of these derivative norms and prove its consistency. The proposed estimator computes efficiently and easily extends to run online. We then derive a classification version of the GW approach which evaluates on real-worlds datasets with as much success as its regression counterpart. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

UAI Conference 2014 Conference Paper

A Consistent Estimator of the Expected Gradient Outerproduct

  • Shubhendu Trivedi
  • Jialei Wang
  • Samory Kpotufe
  • Gregory Shakhnarovich

In high-dimensional classification or regression problems, the expected gradient outerproduct (EGOP) of the unknown regression function f, namely EX ∇f(X) · ∇f(X)>, is known to recover those directions v ∈ Rd most relevant to predicting the output Y. However, just as in gradient estimation, optimal estimators of the EGOP can be expensive in practice. We show that a simple rough estimator, much cheaper in practice, suffices to obtain significant improvements on real-world nonparametric classification and regression tasks. Furthermore, we prove that, despite its simplicity, this rough estimator remains statistically consistent under mild conditions.

ICML Conference 2014 Conference Paper

Consistency of Causal Inference under the Additive Noise Model

  • Samory Kpotufe
  • Eleni Sgouritsa
  • Dominik Janzing
  • Bernhard Schölkopf

We analyze a family of methods for statistical causal inference from sample under the so-called Additive Noise Model. While most work on the subject has concentrated on establishing the soundness of the Additive Noise Model, the statistical consistency of the resulting inference methods has received little attention. We derive general conditions under which the given family of inference methods consistently infers the causal direction in a nonparametric setting.

NeurIPS Conference 2014 Conference Paper

Optimal rates for k-NN density and mode estimation

  • Sanjoy Dasgupta
  • Samory Kpotufe

We present two related contributions of independent interest: (1) high-probability finite sample rates for $k$-NN density estimation, and (2) practical mode estimators -- based on $k$-NN -- which attain minimax-optimal rates under surprisingly general distributional conditions.

NeurIPS Conference 2013 Conference Paper

Adaptivity to Local Smoothness and Dimension in Kernel Regression

  • Samory Kpotufe
  • Vikas Garg

We present the first result for kernel regression where the procedure adapts locally at a point $x$ to both the unknown local dimension of the metric and the unknown H\{o}lder-continuity of the regression function at $x$. The result holds with high probability simultaneously at all points $x$ in a metric space of unknown structure. "

NeurIPS Conference 2013 Conference Paper

Regression-tree Tuning in a Streaming Setting

  • Samory Kpotufe
  • Francesco Orabona

We consider the problem of maintaining the data-structures of a partition-based regression procedure in a setting where the training data arrives sequentially over time. We prove that it is possible to maintain such a structure in time $O(\log n)$ at any time step $n$ while achieving a nearly-optimal regression rate of $\tilde{O}(n^{-2/(2+d)})$ in terms of the unknown metric dimension $d$. Finally we prove a new regression lower-bound which is independent of a given data size, and hence is more appropriate for the streaming setting.

NeurIPS Conference 2012 Conference Paper

Gradient Weights help Nonparametric Regressors

  • Samory Kpotufe
  • Abdeslam Boularias

In regression problems over $\real^d$, the unknown function $f$ often varies more in some coordinates than in others. We show that weighting each coordinate $i$ with the estimated norm of the $i$th derivative of $f$ is an efficient way to significantly improve the performance of distance-based regressors, e. g. kernel and $k$-NN regressors. We propose a simple estimator of these derivative norms and prove its consistency. Moreover, the proposed estimator is efficiently learned online.

NeurIPS Conference 2011 Conference Paper

k-NN Regression Adapts to Local Intrinsic Dimension

  • Samory Kpotufe

Many nonparametric regressors were recently shown to converge at rates that depend only on the intrinsic dimension of data. These regressors thus escape the curse of dimension when high-dimensional data has low intrinsic dimension (e. g. a manifold). We show that $k$-NN regression is also adaptive to intrinsic dimension. In particular our rates are local to a query $x$ and depend only on the way masses of balls centered at $x$ vary with radius. Furthermore, we show a simple way to choose $k = k(x)$ locally at any $x$ so as to nearly achieve the minimax rate at $x$ in terms of the unknown intrinsic dimension in the vicinity of $x$. We also establish that the minimax rate does not depend on a particular choice of metric space or distribution, but rather that this minimax rate holds for any metric space and doubling measure.

NeurIPS Conference 2009 Conference Paper

Fast, smooth and adaptive regression in metric spaces

  • Samory Kpotufe

It was recently shown that certain nonparametric regressors can escape the curse of dimensionality in the sense that their convergence rates adapt to the intrinsic dimension of data (\cite{BL: 65, SK: 77}). We prove some stronger results in more general settings. In particular, we consider a regressor which, by combining aspects of both tree-based regression and kernel regression, operates on a general metric space, yields a smooth function, and evaluates in time $O(\log n)$. We derive a tight convergence rate of the form $n^{-2/(2+d)}$ where $d$ is the Assouad dimension of the input space.

UAI Conference 2009 Conference Paper

Which Spatial Partition Trees are Adaptive to Intrinsic Dimension?

  • Nakul Verma
  • Samory Kpotufe
  • Sanjoy Dasgupta

Recent theory work has found that a special type of spatial partition tree – called a random projection tree – is adaptive to the intrinsic dimension of the data from which it is built. Here we examine this same question, with a combination of theory and experiments, for a broader class of trees that includes k-d trees, dyadic trees, and PCA trees. Our motivation is to get a feel for (i) the kind of intrinsic low dimensional structure that can be empirically verified, (ii) the extent to which a spatial partition can exploit such structure, and (iii) the implications for standard statistical tasks such as regression, vector quantization, and nearest neighbor search.