Arrow Research search

Author name cluster

Bin Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

47 papers
2 author rows

Possible papers

47

AAAI Conference 2026 Conference Paper

T4NMTD: Transition-Centric Reinforcement Learning for Non-Markovian Task Decomposition

  • Ruixuan Miao
  • Xu Lu
  • Cong Tian
  • Bin Yu
  • Zhenhua Duan

Non-Markovian Tasks (NMTs) are distinguished by their dependence on long-term memory and state-dependent dynamics, setting them apart from the traditional Markovian models typically employed in Reinforcement Learning (RL). NMTs not only suffer from reward sparseness but also rely on historical information, making their resolution considerably more challenging. In this paper, we propose a novel RL framework T4NMTD (Transition-centric framework for NMT Decomposition), designed specifically for learning NMTs which are specified by temporal logic. The core of T4NMTD is a task decomposition mechanism along with a parallel training approach for NMTs. An NMT is first decomposed as basic units based on the transitions of the automata which are derived from temporal logic formulae. The units are then modularized into sub-tasks according to their semantic similarity under logical interpretation. The training strategy of T4NMTD adopts a dual-level structure: the high-level learns to shape the boundaries and coordinate arrangement of the sub-tasks from a global perspective, while the low-level learns those sub-tasks in parallel. In addition, we invent a dynamic policy intervention scheme to mitigate the policy myopic issue during parallel training. A comprehensive evaluation is conducted on benchmark problems with respect to various metrics. The experimental results demonstrate that T4NMTD effectively addresses NMTs, achieving significant performance improvements compared with related studies.

ICML Conference 2025 Conference Paper

Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

  • Jingfeng Wu
  • Peter L. Bartlett
  • Matus Telgarsky
  • Bin Yu

In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution—a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.

JMLR Journal 2025 Journal Article

Instability, Computational Efficiency and Statistical Accuracy

  • Nhat Ho
  • Koulik Khamaru
  • Raaz Dwivedi
  • Martin J. Wainwright
  • Michael I. Jordan
  • Bin Yu

Many statistical estimators are defined as the fixed point of a data-dependent operator, with estimators based on minimizing a cost function being an important special case. The limiting performance of such estimators depends on the properties of the population-level operator in the idealized limit of infinitely many samples. We develop a general framework that yields bounds on statistical accuracy based on the interplay between the deterministic convergence rate of the algorithm at the population level, and its degree of (in)stability when applied to an empirical object based on $n$ samples. Using this framework, we analyze both stable forms of gradient descent and some higher-order and unstable algorithms, including Newton's method and its cubic-regularized variant, as well as the EM algorithm. We provide applications of our general results to several concrete classes of models, including Gaussian mixture estimation, non-linear regression models, and informative non-response models. We exhibit cases in which an unstable algorithm can achieve the same statistical accuracy as a stable algorithm in exponentially fewer steps---namely, with the number of iterations being reduced from polynomial to logarithmic in sample size $n$. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

JMLR Journal 2025 Journal Article

Prominent Roles of Conditionally Invariant Components in Domain Adaptation: Theory and Algorithms

  • Keru Wu
  • Yuansi Chen
  • Wooseok Ha
  • Bin Yu

Domain adaptation (DA) is a statistical learning problem that arises when the distribution of the source data used to train a model differs from that of the target data used to evaluate the model. While many DA algorithms have demonstrated considerable empirical success, blindly applying these algorithms can often lead to worse performance on new datasets. To address this, it is crucial to clarify the assumptions under which a DA algorithm has good target performance. In this work, we focus on the assumption of the presence of conditionally invariant components (CICs), which are relevant for prediction and remain conditionally invariant across the source and target data. We demonstrate that CICs, which can be estimated through conditional invariant penalty (CIP), play three prominent roles in providing target risk guarantees in DA. First, we propose a new algorithm based on CICs, importance-weighted conditional invariant penalty (IW-CIP), which has target risk guarantees beyond simple settings such as covariate shift and label shift. Second, we show that CICs help identify large discrepancies between source and target risks of other DA algorithms. Finally, we demonstrate that incorporating CICs into the domain invariant projection (DIP) algorithm can address its failure scenario caused by label-flipping features. We support our new algorithms and theoretical findings via numerical experiments on synthetic data, MNIST, CelebA, Camelyon17, and DomainNet datasets. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2025 Conference Paper

ProxySPEX: Inference-Efficient Interpretability via Sparse Feature Interactions in LLMs

  • Landon Butler
  • Abhineet Agarwal
  • Justin Kang
  • Yigit Efe Erginbas
  • Bin Yu
  • Kannan Ramchandran

Large Language Models (LLMs) have achieved remarkable performance by capturing complex interactions between input features. To identify these interactions, most existing approaches require enumerating all possible combinations of features up to a given order, causing them to scale poorly with the number of inputs $n$. Recently, Kang et al. (2025) proposed SPEX, an information-theoretic approach that uses interaction sparsity to scale to $n \approx 10^3$ features. SPEX greatly improves upon prior methods but requires tens of thousands of model inferences, which can be prohibitive for large models. In this paper, we observe that LLM feature interactions are often *hierarchical*—higher-order interactions are accompanied by their lower-order subsets—which enables more efficient discovery. To exploit this hierarchy, we propose ProxySPEX, an interaction attribution algorithm that first fits gradient boosted trees to masked LLM outputs and then extracts the important interactions. Experiments across four challenging high-dimensional datasets show that ProxySPEX more faithfully reconstructs LLM outputs by 20\% over marginal attribution approaches while using *$10\times$ fewer inferences* than SPEX. By accounting for interactions, ProxySPEX efficiently identifies the most influential features, providing a scalable approximation of their Shapley values. Further, we apply ProxySPEX to two interpretability tasks. *Data attribution*, where we identify interactions among CIFAR-10 training samples that influence test predictions, and *mechanistic interpretability*, where we uncover interactions between attention heads, both within and across layers, on a question-answering task. The ProxySPEX algorithm is available at.

JMLR Journal 2025 Journal Article

The Effect of SGD Batch Size on Autoencoder Learning: Sparsity, Sharpness, and Feature Learning

  • Nikhil Ghosh
  • Spencer Frei
  • Wooseok Ha
  • Bin Yu

In this work, we investigate the dynamics of stochastic gradient descent (SGD) when training a single-neuron autoencoder with linear or ReLU activation on orthogonal data. We show that for this non-convex problem, randomly initialized SGD with a constant step size successfully finds a global minimum for any batch size choice. However, the particular global minimum found depends upon the batch size. In the full-batch setting, we show that the solution is dense (i.e., not sparse) and is highly aligned with its initialized direction, showing that relatively little feature learning occurs. On the other hand, for any batch size strictly smaller than the number of samples, SGD finds a global minimum that is sparse and nearly orthogonal to its initialization, showing that the randomness of stochastic gradients induces a qualitatively different type of "feature selection" in this setting. Moreover, if we measure the sharpness of the minimum by the trace of the Hessian, the minima found with full-batch gradient descent are flatter than those found with strictly smaller batch sizes, in contrast to previous works which suggest that large batches lead to sharper minima. To prove convergence of SGD with a constant step size, we introduce a powerful tool from the theory of non-homogeneous random walks which may be of independent interest. [abs] [ pdf ][ bib ] &copy JMLR 2025. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

The Impact of Initialization on LoRA Finetuning Dynamics

  • Soufiane Hayou
  • Nikhil Ghosh
  • Bin Yu

In this paper, we study the role of initialization in Low Rank Adaptation (LoRA) as originally introduced in Hu et al. (2021). Essentially, to start from the pretrained model, one can either initialize $B$ to zero and $A$ to random, or vice-versa. In both cases, the product $BA$ is equal to zero at initialization, which makes finetuning starts from the pretrained model. These two initialization schemes are seemingly similar. They should in-principle yield the same performance and share the same optimal learning rate. We demonstrate that this is an *incorrect intuition* and that the first scheme (of initializing $B$ to zero and $A$ to random) on average in our experiments yields better performance compared to the other scheme. Our theoretical analysis shows that the reason behind this might be that the first initialization allows the use of larger learning rates (without causing output instability) compared to the second initialization, resulting in more efficient learning of the first scheme. We validate our results with extensive experiments on LLMs.

NeurIPS Conference 2023 Conference Paper

Bridging Discrete and Backpropagation: Straight-Through and Beyond

  • Liyuan Liu
  • Chengyu Dong
  • Xiaodong Liu
  • Bin Yu
  • Jianfeng Gao

Backpropagation, the cornerstone of deep learning, is limited to computing gradients for continuous variables. This limitation poses challenges for problems involving discrete latent variables. To address this issue, we propose a novel approach to approximate the gradient of parameters involved in generating discrete latent variables. First, we examine the widely used Straight-Through (ST) heuristic and demonstrate that it works as a first-order approximation of the gradient. Guided by our findings, we propose ReinMax, which achieves second-order accuracy by integrating Heun’s method, a second-order numerical method for solving ODEs. ReinMax does not require Hessian or other second-order derivatives, thus having negligible computation overheads. Extensive experimental results on various tasks demonstrate the superiority of ReinMax over the state of the art.

JMLR Journal 2023 Journal Article

Revisiting minimum description length complexity in overparameterized models

  • Raaz Dwivedi
  • Chandan Singh
  • Bin Yu
  • Martin Wainwright

Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen's principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for overparameterized models. MDL-COMP is defined via an optimality criterion over the encodings induced by a good Ridge estimator class. We provide an extensive theoretical characterization of MDL-COMP for linear models and kernel methods and show that it is not just a function of parameter count, but rather a function of the singular values of the design or the kernel matrix and the signal-to-noise ratio. For a linear model with $n$ observations, $d$ parameters, and i.i.d. Gaussian predictors, MDL-COMP scales linearly with $d$ when $d n$. For kernel methods, we show that MDL-COMP informs minimax in-sample error, and can decrease as the dimensionality of the input increases. We also prove that MDL-COMP upper bounds the in-sample mean squared error (MSE). Via an array of simulations and real-data experiments, we show that a data-driven Prac-MDL-COMP informs hyper-parameter tuning for optimizing test MSE with ridge regression in limited data settings, sometimes improving upon cross-validation and (always) saving computational costs. Finally, our findings also suggest that the recently observed double decent phenomenons in overparameterized models might be a consequence of the choice of non-ideal estimators. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

NeurIPS Conference 2021 Conference Paper

Adaptive wavelet distillation from neural networks through interpretations

  • Wooseok Ha
  • Chandan Singh
  • Francois Lanusse
  • Srigokul Upadhyayula
  • Bin Yu

Recent deep-learning models have achieved impressive prediction performance, but often sacrifice interpretability and computational efficiency. Interpretability is crucial in many disciplines, such as science and medicine, where models must be carefully vetted or where interpretation is the goal itself. Moreover, interpretable models are concise and often yield computational efficiency. Here, we propose adaptive wavelet distillation (AWD), a method which aims to distill information from a trained neural network into a wavelet transform. Specifically, AWD penalizes feature attributions of a neural network in the wavelet domain to learn an effective multi-resolution wavelet transform. The resulting model is highly predictive, concise, computationally efficient, and has properties (such as a multi-scale structure) which make it easy to interpret. In close collaboration with domain experts, we showcase how AWD addresses challenges in two real-world settings: cosmological parameter inference and molecular-partner prediction. In both cases, AWD yields a scientifically interpretable and concise model which gives predictive performance better than state-of-the-art neural networks. Moreover, AWD identifies predictive features that are scientifically meaningful in the context of respective domains. All code and models are released in a full-fledged package available on Github.

JMLR Journal 2020 Journal Article

Fast mixing of Metropolized Hamiltonian Monte Carlo: Benefits of multi-step gradients

  • Yuansi Chen
  • Raaz Dwivedi
  • Martin J. Wainwright
  • Bin Yu

Hamiltonian Monte Carlo (HMC) is a state-of-the-art Markov chain Monte Carlo sampling algorithm for drawing samples from smooth probability densities over continuous spaces. We study the variant most widely used in practice, Metropolized HMC with the Stormer-Verlet or leapfrog integrator, and make two primary contributions. First, we provide a non-asymptotic upper bound on the mixing time of the Metropolized HMC with explicit choices of step-size and number of leapfrog steps. This bound gives a precise quantification of the faster convergence of Metropolized HMC relative to simpler MCMC algorithms such as the Metropolized random walk, or Metropolized Langevin algorithm. Second, we provide a general framework for sharpening mixing time bounds of Markov chains initialized at a substantial distance from the target distribution over continuous spaces. We apply this sharpening device to the Metropolized random walk and Langevin algorithms, thereby obtaining improved mixing time bounds from a non-warm initial distribution. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

JMLR Journal 2020 Journal Article

Unique Sharp Local Minimum in L1-minimization Complete Dictionary Learning

  • Yu Wang
  • Siqi Wu
  • Bin Yu

We study the problem of globally recovering a dictionary from a set of signals via $\ell_1$-minimization. We assume that the signals are generated as i.i.d. random linear combinations of the $K$ atoms from a complete reference dictionary $D^*\in \mathbb R^{K\times K}$, where the linear combination coefficients are from either a Bernoulli type model or exact sparse model. First, we obtain a necessary and sufficient norm condition for the reference dictionary $D^*$ to be a sharp local minimum of the expected $\ell_1$ objective function. Our result substantially extends that of Wu and Yu (2015) and allows the combination coefficient to be non-negative. Secondly, we obtain an explicit bound on the region within which the objective value of the reference dictionary is minimal. Thirdly, we show that the reference dictionary is the unique sharp local minimum, thus establishing the first known global property of $\ell_1$-minimization dictionary learning. Motivated by the theoretical results, we introduce a perturbation based test to determine whether a dictionary is a sharp local minimum of the objective function. In addition, we also propose a new dictionary learning algorithm based on Block Coordinate Descent, called DL-BCD, which is guaranteed to decrease the obective function monotonically. Simulation studies show that DL-BCD has competitive performance in terms of recovery rate compared to other state-of-the-art dictionary learning algorithms when the reference dictionary is generated from random Gaussian matrices. [abs] [ pdf ][ bib ] &copy JMLR 2020. ( edit, beta )

NeurIPS Conference 2019 Conference Paper

A Debiased MDI Feature Importance Measure for Random Forests

  • Xiao Li
  • Yu Wang
  • Sumanta Basu
  • Karl Kumbier
  • Bin Yu

Tree ensembles such as Random Forests have achieved impressive empirical success across a wide variety of applications. To understand how these models make predictions, people routinely turn to feature importance measures calculated from tree ensembles. It has long been known that Mean Decrease Impurity (MDI), one of the most widely used measures of feature importance, incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. In this paper, we address the feature selection bias of MDI from both theoretical and methodological perspectives. Based on the original definition of MDI by Breiman et al. \cite{Breiman1984} for a single tree, we derive a tight non-asymptotic bound on the expected bias of MDI importance of noisy features, showing that deep trees have higher (expected) feature selection bias than shallow ones. However, it is not clear how to reduce the bias of MDI using its existing analytical expression. We derive a new analytical expression for MDI, and based on this new expression, we are able to propose a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob. For both the simulated data and a genomic ChIP dataset, MDI-oob achieves state-of-the-art performance in feature selection from Random Forests for both deep and shallow trees.

JMLR Journal 2019 Journal Article

Log-concave sampling: Metropolis-Hastings algorithms are fast

  • Raaz Dwivedi
  • Yuansi Chen
  • Martin J. Wainwright
  • Bin Yu

We study the problem of sampling from a strongly log-concave density supported on $\mathbb{R}^d$, and prove a non-asymptotic upper bound on the mixing time of the Metropolis-adjusted Langevin algorithm (MALA). The method draws samples by simulating a Markov chain obtained from the discretization of an appropriate Langevin diffusion, combined with an accept-reject step. Relative to known guarantees for the unadjusted Langevin algorithm (ULA), our bounds show that the use of an accept-reject step in MALA leads to an exponentially improved dependence on the error-tolerance. Concretely, in order to obtain samples with TV error at most $\delta$ for a density with condition number $\kappa$, we show that MALA requires $\mathcal{O} (\kappa d \log(1/\delta) )$ steps from a warm start, as compared to the $\mathcal{O} (\kappa^2 d/\delta^2 )$ steps established in past work on ULA. We also demonstrate the gains of a modified version of MALA over ULA for weakly log-concave densities. Furthermore, we derive mixing time bounds for the Metropolized random walk (MRW) and obtain $\mathcal{O}(\kappa)$ mixing time slower than MALA. We provide numerical examples that support our theoretical findings, and demonstrate the benefits of Metropolis-Hastings adjustment for Langevin-type sampling algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

JMLR Journal 2018 Journal Article

Fast MCMC Sampling Algorithms on Polytopes

  • Yuansi Chen
  • Raaz Dwivedi
  • Martin J. Wainwright
  • Bin Yu

We propose and analyze two new MCMC sampling algorithms, the Vaidya walk and the John walk, for generating samples from the uniform distribution over a polytope. Both random walks are sampling algorithms derived from interior point methods. The former is based on volumetric-logarithmic barrier introduced by Vaidya whereas the latter uses John's ellipsoids. We show that the Vaidya walk mixes in significantly fewer steps than the logarithmic-barrier based Dikin walk studied in past work. For a polytope in $\mathbb{R}^d$ defined by $n > d$ linear constraints, we show that the mixing time from a warm start is bounded as $\mathcal{O}(n^{0.5}d^{1.5})$, compared to the $\mathcal{O}(nd)$ mixing time bound for the Dikin walk. The cost of each step of the Vaidya walk is of the same order as the Dikin walk, and at most twice as large in terms of constant pre-factors. For the John walk, we prove an $\mathcal{O}(d^{2.5}\cdot\log^4(n/d))$ bound on its mixing time and conjecture that an improved variant of it could achieve a mixing time of $\mathcal{O}(d^{2}\cdot\text{poly-log}(n/d))$. Additionally, we propose variants of the Vaidya and John walks that mix in polynomial time from a deterministic starting point. The speed-up of the Vaidya walk over the Dikin walk are illustrated in numerical examples. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

JMLR Journal 2018 Journal Article

Local Identifiability of $\ell_1$-minimization Dictionary Learning: a Sufficient and Almost Necessary Condition

  • Siqi Wu
  • Bin Yu

We study the theoretical properties of learning a dictionary from $N$ signals $\mathbf{x}_i\in \mathbb R^K$ for $i=1,\ldots,N$ via $\ell_1$-minimization. We assume that $\mathbf{x}_i$'s are $i.i.d.$ random linear combinations of the $K$ columns from a complete (i.e., square and invertible) reference dictionary $\mathbf{D}_0 \in \mathbb R^{K\times K}$. Here, the random linear coefficients are generated from either the $s$-sparse Gaussian model or the Bernoulli-Gaussian model. First, for the population case, we establish a sufficient and almost necessary condition for the reference dictionary $\mathbf{D}_0$ to be locally identifiable, i.e., a strict local minimum of the expected $\ell_1$-norm objective function. Our condition covers both sparse and dense cases of the random linear coefficients and significantly improves the sufficient condition by Gribonval and Schnass (2010). In addition, we show that for a complete $\mu$-coherent reference dictionary, i.e., a dictionary with absolute pairwise column inner-product at most $\mu\in[0,1)$, local identifiability holds even when the random linear coefficient vector has up to $O(\mu^{-2})$ nonzero entries. Moreover, our local identifiability results also translate to the finite sample case with high probability provided that the number of signals $N$ scales as $O(K\log K)$. [abs] [ pdf ][ bib ] &copy JMLR 2018. ( edit, beta )

JMLR Journal 2015 Journal Article

A Statistical Perspective on Algorithmic Leveraging

  • Ping Ma
  • Michael W. Mahoney
  • Bin Yu

One popular method for dealing with large-scale data sets is sampling. For example, by using the empirical statistical leverage scores as an importance sampling distribution, the method of algorithmic leveraging samples and rescales rows/columns of data matrices to reduce the data size before performing computations on the subproblem. This method has been successful in improving computational efficiency of algorithms for matrix problems such as least-squares approximation, least absolute deviations approximation, and low-rank matrix approximation. Existing work has focused on algorithmic issues such as worst-case running times and numerical issues associated with providing high-quality implementations, but none of it addresses statistical aspects of this method. In this paper, we provide a simple yet effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model with a fixed number of predictors. In particular, for several versions of leverage-based sampling, we derive results for the bias and variance, both conditional and unconditional on the observed data. We show that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. This result is particularly striking, given the well-known result that, from the algorithmic perspective of worst-case analysis, leverage-based sampling provides uniformly superior worst-case algorithmic results, when compared with uniform sampling. Based on these theoretical results, we propose and analyze two new leveraging algorithms: one constructs a smaller least-squares problem with "shrinkage" leverage scores (SLEV), and the other solves a smaller and unweighted (or biased) least-squares problem (LEVUNW). A detailed empirical evaluation of existing leverage-based methods as well as these two new methods is carried out on both synthetic and real data sets. The empirical results indicate that our theory is a good predictor of practical performance of existing and new leverage- based algorithms and that the new algorithms achieve improved performance. For example, with the same computation reduction as in the original algorithmic leveraging approach, our proposed SLEV typically leads to improved biases and variances both unconditionally and conditionally (on the observed data), and our proposed LEVUNW typically yields improved unconditional biases and variances. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

JMLR Journal 2015 Journal Article

Counting and Exploring Sizes of Markov Equivalence Classes of Directed Acyclic Graphs

  • Yangbo He
  • Jinzhu Jia
  • Bin Yu

When learning a directed acyclic graph (DAG) model via observational data, one generally cannot identify the underlying DAG, but can potentially obtain a Markov equivalence class. The size (the number of DAGs) of a Markov equivalence class is crucial to infer causal effects or to learn the exact causal DAG via further interventions. Given a set of Markov equivalence classes, the distribution of their sizes is a key consideration in developing learning methods. However, counting the size of an equivalence class with many vertices is usually computationally infeasible, and the existing literature reports the size distributions only for equivalence classes with ten or fewer vertices. In this paper, we develop a method to compute the size of a Markov equivalence class. We first show that there are five types of Markov equivalence classes whose sizes can be formulated as five functions of the number of vertices respectively. Then we introduce a new concept of a rooted sub- class. The graph representations of rooted subclasses of a Markov equivalence class are used to partition this class recursively until the sizes of all rooted sub-classes can be computed via the five functions. The proposed size counting is efficient for Markov equivalence classes of sparse DAGs with hundreds of vertices. Finally, we explore the size and edge distributions of Markov equivalence classes and find experimentally that, in general, (1) most Markov equivalence classes are half completed and their average sizes are small, and (2) the sizes of sparse classes grow approximately exponentially with the numbers of vertices. [abs] [ pdf ][ bib ] &copy JMLR 2015. ( edit, beta )

JMLR Journal 2014 Journal Article

Early Stopping and Non-parametric Regression: An Optimal Data-dependent Stopping Rule

  • Garvesh Raskutti
  • Martin J. Wainwright
  • Bin Yu

Early stopping is a form of regularization based on choosing when to stop running an iterative algorithm. Focusing on non- parametric regression in a reproducing kernel Hilbert space, we analyze the early stopping strategy for a form of gradient- descent applied to the least-squares loss function. We propose a data-dependent stopping rule that does not involve hold-out or cross-validation data, and we prove upper bounds on the squared error of the resulting function estimate, measured in either the $L^2(\mathbb{P})$ and $L^2(\mathbb{P}_n)$ norm. These upper bounds lead to minimax-optimal rates for various kernel classes, including Sobolev smoothness classes and other forms of reproducing kernel Hilbert spaces. We show through simulation that our stopping rule compares favorably to two other stopping rules, one based on hold-out data and the other based on Stein's unbiased risk estimate. We also establish a tight connection between our early stopping strategy and the solution path of a kernel ridge regression estimator. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

JMLR Journal 2013 Journal Article

Supervised Feature Selection in Graphs with Path Coding Penalties and Network Flows

  • Julien Mairal
  • Bin Yu

We consider supervised learning problems where the features are embedded in a graph, such as gene expressions in a gene network. In this context, it is of much interest to automatically select a subgraph with few connected components; by exploiting prior knowledge, one can indeed improve the prediction performance or obtain results that are easier to interpret. Regularization or penalty functions for selecting features in graphs have recently been proposed, but they raise new algorithmic challenges. For example, they typically require solving a combinatorially hard selection problem among all connected subgraphs. In this paper, we propose computationally feasible strategies to select a sparse and well-connected subset of features sitting on a directed acyclic graph (DAG). We introduce structured sparsity penalties over paths on a DAG called “path coding” penalties. Unlike existing regularization functions that model long-range interactions between features in a graph, path coding penalties are tractable. The penalties and their proximal operators involve path selection problems, which we efficiently solve by leveraging network flow optimization. We experimentally show on synthetic, image, and genomic data that our approach is scalable and leads to more connected subgraphs than other regularization functions for graphs. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

AAAI Conference 2010 Conference Paper

Multi-Task Sparse Discriminant Analysis (MtSDA) with Overlapping Categories

  • Yahong Han
  • Fei Wu
  • Jinzhu Jia
  • Yueting Zhuang
  • Bin Yu

Multi-task learning aims at combining information across tasks to boost prediction performance, especially when the number of training samples is small and the number of predictors is very large. In this paper, we first extend the Sparse Discriminate Analysis (SDA) of Clemmensen et al. . We call this Multi-task Sparse Discriminate Analysis (MtSDA). MtSDA formulates multi-label prediction as a quadratic optimization problem whereas SDA obtains single labels via a nearest class mean rule. Second, we propose a class of equicorrelation matrices to use in MtSDA which includes the identity matrix. MtSDA with both matrices are compared with singletask learning (SVM and LDA+SVM) and multi-task learning (HSML). The comparisons are made on real data sets in terms of AUC and F-measure. The data results show that MtSDA outperforms other methods substantially almost all the time and in some cases MtSDA with the equicorrelation matrix substantially outperforms MtSDA with identity matrix.

NeurIPS Conference 2010 Conference Paper

Predicting Execution Time of Computer Programs Using Sparse Polynomial Regression

  • Ling Huang
  • Jinzhu Jia
  • Bin Yu
  • Byung-Gon Chun
  • Petros Maniatis
  • Mayur Naik

Predicting the execution time of computer programs is an important but challenging problem in the community of computer systems. Existing methods require experts to perform detailed analysis of program code in order to construct predictors or select important features. We recently developed a new system to automatically extract a large number of features from program execution on sample inputs, on which prediction models can be constructed without expert knowledge. In this paper we study the construction of predictive models for this problem. We propose the SPORE (Sparse POlynomial REgression) methodology to build accurate prediction models of program performance using feature data collected from program execution on sample inputs. Our two SPORE algorithms are able to build relationships between responses (e. g. , the execution time of a computer program) and features, and select a few from hundreds of the retrieved features to construct an explicitly sparse and non-linear model to predict the response variable. The compact and explicitly polynomial form of the estimated model could reveal important insights into the computer program (e. g. , features and their non-linear combinations that dominate the execution time), enabling a better understanding of the program’s behavior. Our evaluation on three widely used computer programs shows that SPORE methods can give accurate prediction with relative error less than 7% by using a moderate number of training data samples. In addition, we compare SPORE algorithms to state-of-the-art sparse regression algorithms, and show that SPORE methods, motivated by real applications, outperform the other methods in terms of both interpretability and prediction accuracy.

JMLR Journal 2010 Journal Article

Restricted Eigenvalue Properties for Correlated Gaussian Designs

  • Garvesh Raskutti
  • Martin J. Wainwright
  • Bin Yu

Methods based on l 1 -relaxation, such as basis pursuit and the Lasso, are very popular for sparse regression in high dimensions. The conditions for success of these methods are now well-understood: (1) exact recovery in the noiseless setting is possible if and only if the design matrix X satisfies the restricted nullspace property, and (2) the squared l 2 -error of a Lasso estimate decays at the minimax optimal rate k log p / n, where k is the sparsity of the p -dimensional regression problem with additive Gaussian noise, whenever the design satisfies a restricted eigenvalue condition. The key issue is thus to determine when the design matrix X satisfies these desirable properties. Thus far, there have been numerous results showing that the restricted isometry property, which implies both the restricted nullspace and eigenvalue conditions, is satisfied when all entries of X are independent and identically distributed (i.i.d.), or the rows are unitary. This paper proves directly that the restricted nullspace and eigenvalue conditions hold with high probability for quite general classes of Gaussian matrices for which the predictors may be highly dependent, and hence restricted isometry conditions can be violated with high probability. In this way, our results extend the attractive theoretical guarantees on l 1 -relaxations to a much broader class of problems than the case of completely independent or unitary designs. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

A unified framework for high-dimensional analysis of $M$-estimators with decomposable regularizers

  • Sahand Negahban
  • Bin Yu
  • Martin Wainwright
  • Pradeep Ravikumar

The estimation of high-dimensional parametric models requires imposing some structure on the models, for instance that they be sparse, or that matrix structured parameters have low rank. A general approach for such structured parametric model estimation is to use regularized M-estimation procedures, which regularize a loss function that measures goodness of fit of the parameters to the data with some regularization function that encourages the assumed structure. In this paper, we aim to provide a unified analysis of such regularized M-estimation procedures. In particular, we report the convergence rates of such estimators in any metric norm. Using just our main theorem, we are able to rederive some of the many existing results, but also obtain a wide range of novel convergence rates results. Our analysis also identifies key properties of loss and regularization functions such as restricted strong convexity, and decomposability, that ensure the corresponding regularized M-estimators have good convergence rates.

NeurIPS Conference 2009 Conference Paper

Lower bounds on minimax rates for nonparametric regression with additive sparsity and smoothness

  • Garvesh Raskutti
  • Bin Yu
  • Martin Wainwright

This paper uses information-theoretic techniques to determine minimax rates for estimating nonparametric sparse additive regression models under high-dimensional scaling. We assume an additive decomposition of the form $f^*(X_1, \ldots, X_p) = \sum_{j \in S} h_j(X_j)$, where each component function $h_j$ lies in some Hilbert Space $\Hilb$ and $S \subset \{1, \ldots, \pdim \}$ is an unknown subset with cardinality $\s = |S$. Given $\numobs$ i. i. d. observations of $f^*(X)$ corrupted with white Gaussian noise where the covariate vectors $(X_1, X_2, X_3, .. ., X_{\pdim})$ are drawn with i. i. d. components from some distribution $\mP$, we determine tight lower bounds on the minimax rate for estimating the regression function with respect to squared $\LTP$ error. The main result shows that the minimax rates are $\max{\big(\frac{\s \log \pdim / \s}{n}, \LowerRateSq \big)}$. The first term reflects the difficulty of performing \emph{subset selection} and is independent of the Hilbert space $\Hilb$; the second term $\LowerRateSq$ is an \emph{\s-dimensional estimation} term, depending only on the low dimension $\s$ but not the ambient dimension $\pdim$, that captures the difficulty of estimating a sum of $\s$ univariate functions in the Hilbert space $\Hilb$. As a special case, if $\Hilb$ corresponds to the $\m$-th order Sobolev space $\SobM$ of functions that are $m$-times differentiable, the $\s$-dimensional estimation term takes the form $\LowerRateSq \asymp \s \; n^{-2\m/(2\m+1)}$. The minimax rates are compared with rates achieved by an $\ell_1$-penalty based approach, it can be shown that a certain $\ell_1$-based approach achieves the minimax optimal rate.

NeurIPS Conference 2008 Conference Paper

Model Selection in Gaussian Graphical Models: High-Dimensional Consistency of \boldmath$\ell_1$-regularized MLE

  • Garvesh Raskutti
  • Bin Yu
  • Martin Wainwright
  • Pradeep Ravikumar

We consider the problem of estimating the graph structure associated with a Gaussian Markov random field (GMRF) from i. i. d. samples. We study the performance of study the performance of the 1 -regularized maximum likelihood estimator in the high-dimensional setting, where the number of nodes in the graph p, the number of edges in the graph s and the maximum node degree d, are allowed to grow as a function of the number of samples n. Our main result provides sufficient conditions on (n, p, d) for the 1 -regularized MLE estimator to recover all the edges of the graph with high probability. Under some conditions on the model covariance, we show that model selection can be achieved for sample sizes n = (d2 log(p)), with the error decaying as O(exp(-c log(p))) for some constant c. We illustrate our theoretical results via simulations and show good correspondences between the theoretical predictions and behavior in simulations.

NeurIPS Conference 2008 Conference Paper

Nonparametric sparse hierarchical models describe V1 fMRI responses to natural images

  • Vincent Vu
  • Bin Yu
  • Thomas Naselaris
  • Kendrick Kay
  • Jack Gallant
  • Pradeep Ravikumar

We propose a novel hierarchical, nonlinear model that predicts brain activity in area V1 evoked by natural images. In the study reported here brain activity was measured by means of functional magnetic resonance imaging (fMRI), a noninvasive technique that provides an indirect measure of neural activity pooled over a small volume (~ 2mm cube) of brain tissue. Our model, which we call the SpAM V1 model, is based on the reasonable assumption that fMRI measurements reflect the (possibly nonlinearly) pooled, rectified output of a large population of simple and complex cells in V1. It has a hierarchical filtering stage that consists of three layers: model simple cells, model complex cells, and a third layer in which the complex cells are linearly pooled (called “pooled-complex” cells). The pooling stage then obtains the measured fMRI signals as a sparse additive model (SpAM) in which a sparse nonparametric (nonlinear) combination of model complex cell and model pooled-complex cell outputs are summed. Our results show that the SpAM V1 model predicts fMRI responses evoked by natural images better than a benchmark model that only provides linear pooling of model complex cells. Furthermore, the spatial receptive fields, frequency tuning and orientation tuning curves of the SpAM V1 model estimated for each voxel appears to be consistent with the known properties of V1, and with previous analyses of this data set. A visualization procedure applied to the SpAM V1 model shows that most of the nonlinear pooling consists of simple compressive or saturating nonlinearities.

AAMAS Conference 2007 Conference Paper

An Incentive Mechanism for Message Relaying in Unstructured Peer-to-Peer Systems

  • Cuihong Li
  • Bin Yu
  • Katia Sycara

Distributed message relaying is an important function of a peer-topeer system to discover service providers. Existing search protocols in unstructured peer-to-peer systems either create huge burden on communications or cause long response time. Moreover, these systems are also vulnerable to the free riding problem. In this paper we present an incentive mechanism that not only mitigates the free riding problem, but also achieves good system efficiency in message relaying for peer discovery. In this mechanism promised rewards are passed along the message propagation process. A peer is rewarded if a service provider is found via a relaying path that includes this peer. We provide some analytic insights to the symmetric Nash equilibrium strategies of this game, and an approximate approach to calculate this equilibrium. Experiments show that this incentive mechanism brings a system utility generally higher than breadth-first search and random walks, based on both the estimated utility from our approximate equilibrium and the utility generated from learning in the incentive mechanism.

AAMAS Conference 2007 Conference Paper

Managing the Pedigree and Quality of Information in Dynamic Information Sharing Environments

  • Bin Yu
  • Srikanth Kallurkar
  • Ganesh Vaidyanathan
  • Donald Steiner

The quality of information is crucial for decision making in many mission-critical applications such as battlefield operations and intelligence analysis. However, as the system becomes larger and more diverse, it is becoming increasingly difficult to assess the quality of information from various operators or data sources. In this paper we propose an agent-based approach to managing the quality of information, e. g. , its trustworthiness, in network centric information sharing environments, where software agents collaborate with each other to automatically represent and assess the trustworthiness of information from its pedigree within the framework of Dempster-Shafer theory.

JMLR Journal 2007 Journal Article

Stagewise Lasso

  • Peng Zhao
  • Bin Yu

Many statistical machine learning algorithms minimize either an empirical loss function as in AdaBoost, or a penalized empirical loss as in Lasso or SVM. A single regularization tuning parameter controls the trade-off between fidelity to the data and generalizability, or equivalently between bias and variance. When this tuning parameter changes, a regularization "path" of solutions to the minimization problem is generated, and the whole path is needed to select a tuning parameter to optimize the prediction or interpretation performance. Algorithms such as homotopy-Lasso or LARS-Lasso and Forward Stagewise Fitting (FSF) (aka e-Boosting) are of great interest because of their resulted sparse models for interpretation in addition to prediction. In this paper, we propose the BLasso algorithm that ties the FSF (e-Boosting) algorithm with the Lasso method that minimizes the L 1 penalized L 2 loss. BLasso is derived as a coordinate descent method with a fixed stepsize applied to the general Lasso loss function ( L 1 penalized convex loss). It consists of both a forward step and a backward step. The forward step is similar to e-Boosting or FSF, but the backward step is new and revises the FSF (or e-Boosting) path to approximate the Lasso path. In the cases of a finite number of base learners and a bounded Hessian of the loss function, the BLasso path is shown to converge to the Lasso path when the stepsize goes to zero. For cases with a larger number of base learners than the sample size and when the true model is sparse, our simulations indicate that the BLasso model estimates are sparser than those from FSF with comparable or slightly better prediction performance, and that the the discrete stepsize of BLasso and FSF has an additional regularization effect in terms of prediction and sparsity. Moreover, we introduce the Generalized BLasso algorithm to minimize a general convex loss penalized by a general convex function. Since the (Generalized) BLasso relies only on differences not derivatives, we conclude that it provides a class of simple and easy-to-implement algorithms for tracing the regularization or solution paths of penalized minimization problems. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

JMLR Journal 2006 Journal Article

On Model Selection Consistency of Lasso

  • Peng Zhao
  • Bin Yu

Sparsity or parsimony of statistical models is crucial for their proper interpretations, as in sciences and social sciences. Model selection is a commonly used method to find such models, but usually involves a computationally heavy combinatorial search. Lasso (Tibshirani, 1996) is now being used as a computationally feasible alternative to model selection. Therefore it is important to study Lasso for model selection purposes. In this paper, we prove that a single condition, which we call the Irrepresentable Condition, is almost necessary and sufficient for Lasso to select the true model both in the classical fixed p setting and in the large p setting as the sample size n gets large. Based on these results, sufficient conditions that are verifiable in practice are given to relate to previous works and help applications of Lasso for feature selection and sparse representation. This Irrepresentable Condition, which depends mainly on the covariance of the predictor variables, states that Lasso selects the true model consistently if and (almost) only if the predictors that are not in the true model are "irrepresentable" (in a sense to be clarified) by predictors that are in the true model. Furthermore, simulations are carried out to provide insights and understanding of this result. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

JMLR Journal 2006 Journal Article

Sparse Boosting

  • Peter Bühlmann
  • Bin Yu

We propose Sparse Boosting (the Sparse L 2 Boost algorithm), a variant on boosting with the squared error loss. Sparse L 2 Boost yields sparser solutions than the previously proposed L 2 Boosting by minimizing some penalized L 2 -loss functions, the FPE model selection criteria, through small-step gradient descent. Although boosting may give already relatively sparse solutions, for example corresponding to the soft-thresholding estimator in orthogonal linear models, there is sometimes a desire for more sparseness to increase prediction accuracy and ability for better variable selection: such goals can be achieved with Sparse L 2 Boost. We prove an equivalence of Sparse L 2 Boost to Breiman's nonnegative garrote estimator for orthogonal linear models and demonstrate the generic nature of Sparse L 2 Boost for nonparametric interaction modeling. For an automatic selection of the tuning parameter in Sparse L 2 Boost we propose to employ the gMDL model selection criterion which can also be used for early stopping of L 2 Boosting. Consequently, we can select between Sparse L 2 Boost and L 2 Boosting by comparing their gMDL scores. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )