Arrow Research search

Author name cluster

Daniela Witten

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

JMLR Journal 2024 Journal Article

Data Thinning for Convolution-Closed Distributions

  • Anna Neufeld
  • Ameer Dharamshi
  • Lucy L. Gao
  • Daniela Witten

We propose data thinning, an approach for splitting an observation into two or more independent parts that sum to the original observation, and that follow the same distribution as the original observation, up to a (known) scaling of a parameter. This very general proposal is applicable to any convolution-closed distribution, a class that includes the Gaussian, Poisson, negative binomial, gamma, and binomial distributions, among others. Data thinning has a number of applications to model selection, evaluation, and inference. For instance, cross-validation via data thinning provides an attractive alternative to the usual approach of cross-validation via sample splitting, especially in settings in which the latter is not applicable. In simulations and in an application to single-cell RNA-sequencing data, we show that data thinning can be used to validate the results of unsupervised learning approaches, such as k-means clustering and principal components analysis, for which traditional sample splitting is unattractive or unavailable. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

JMLR Journal 2023 Journal Article

Revisiting inference after prediction

  • Keshav Motwani
  • Daniela Witten

Recent work has focused on the very common practice of prediction-based inference: that is, (i) using a pre-trained machine learning model to predict an unobserved response variable, and then (ii) conducting inference on the association between that predicted response and some covariates. As pointed out by Wang et al. (2020), applying a standard inferential approach in (ii) does not accurately quantify the association between the unobserved (as opposed to the predicted) response and the covariates. In recent work, Wang et al. (2020) and Angelopoulos et al. (2023) propose corrections to step (ii) in order to enable valid inference on the association between the unobserved response and the covariates. Here, we show that the method proposed by Angelopoulos et al. (2023) successfully controls the type 1 error rate and provides confidence intervals with correct nominal coverage, regardless of the quality of the pre-trained machine learning model used to predict the unobserved response. However, the method proposed by Wang et al. (2020) provides valid inference only under very strong conditions that rarely hold in practice: for instance, if the machine learning model perfectly estimates the true regression function in the study population of interest. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

JMLR Journal 2016 Journal Article

Convex Regression with Interpretable Sharp Partitions

  • Ashley Petersen
  • Noah Simon
  • Daniela Witten

We consider the problem of predicting an outcome variable on the basis of a small number of covariates, using an interpretable yet non-additive model. We propose convex regression with interpretable sharp partitions (CRISP) for this task. CRISP partitions the covariate space into blocks in a data- adaptive way, and fits a mean model within each block. Unlike other partitioning methods, CRISP is fit using a non-greedy approach by solving a convex optimization problem, resulting in low- variance fits. We explore the properties of CRISP, and evaluate its performance in a simulation study and on a housing price data set. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

AAAI Conference 2015 Conference Paper

Pathway Graphical Lasso

  • Maxim Grechkin
  • Maryam Fazel
  • Daniela Witten
  • Su-In Lee

Graphical models provide a rich framework for summarizing the dependencies among variables. The graphical lasso approach attempts to learn the structure of a Gaussian graphical model (GGM) by maximizing the log likelihood of the data, subject to an l1 penalty on the elements of the inverse covariance matrix. Most algorithms for solving the graphical lasso problem do not scale to a very large number of variables. Furthermore, the learned network structure is hard to interpret. To overcome these challenges, we propose a novel GGM structure learning method that exploits the fact that for many real-world problems we have prior knowledge that certain edges are unlikely to be present. For example, in gene regulatory networks, a pair of genes that does not participate together in any of the cellular processes, typically referred to as pathways, is less likely to be connected. In computer vision applications in which each variable corresponds to a pixel, each variable is likely to be connected to the nearby variables. In this paper, we propose the pathway graphical lasso, which learns the structure of a GGM subject to pathway-based constraints. In order to solve this problem, we decompose the network into smaller parts, and use a message-passing algorithm in order to communicate among the subnetworks. Our algorithm has orders of magnitude improvement in run time compared to the state-of-the-art optimization methods for the graphical lasso problem that were modified to handle pathway-based constraints.

JMLR Journal 2014 Journal Article

Learning Graphical Models With Hubs

  • Kean Ming Tan
  • Palma London
  • Karthik Mohan
  • Su-In Lee
  • Maryam Fazel
  • Daniela Witten

We consider the problem of learning a high-dimensional graphical model in which there are a few hub nodes that are densely-connected to many other nodes. Many authors have studied the use of an $\ell_1$ penalty in order to learn a sparse graph in the high-dimensional setting. However, the $\ell_1$ penalty implicitly assumes that each edge is equally likely and independent of all other edges. We propose a general framework to accommodate more realistic networks with hub nodes, using a convex formulation that involves a row-column overlap norm penalty. We apply this general framework to three widely- used probabilistic graphical models: the Gaussian graphical model, the covariance graph model, and the binary Ising model. An alternating direction method of multipliers algorithm is used to solve the corresponding convex optimization problems. On synthetic data, we demonstrate that our proposed framework outperforms competitors that do not explicitly model hub nodes. We illustrate our proposal on a webpage data set and a gene expression data set. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

JMLR Journal 2014 Journal Article

Node-Based Learning of Multiple Gaussian Graphical Models

  • Karthik Mohan
  • Palma London
  • Maryam Fazel
  • Daniela Witten
  • Su-In Lee

We consider the problem of estimating high-dimensional Gaussian graphical models corresponding to a single set of variables under several distinct conditions. This problem is motivated by the task of recovering transcriptional regulatory networks on the basis of gene expression data containing heterogeneous samples, such as different disease states, multiple species, or different developmental stages. We assume that most aspects of the conditional dependence networks are shared, but that there are some structured differences between them. Rather than assuming that similarities and differences between networks are driven by individual edges, we take a node-based approach, which in many cases provides a more intuitive interpretation of the network differences. We consider estimation under two distinct assumptions: (1) differences between the $K$ networks are due to individual nodes that are perturbed across conditions, or (2) similarities among the $K$ networks are due to the presence of common hub nodes that are shared across all $K$ networks. Using a row-column overlap norm penalty function, we formulate two convex optimization problems that correspond to these two assumptions. We solve these problems using an alternating direction method of multipliers algorithm, and we derive a set of necessary and sufficient conditions that allows us to decompose the problem into independent subproblems so that our algorithm can be scaled to high-dimensional settings. Our proposal is illustrated on synthetic data, a webpage data set, and a brain cancer gene expression data set. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Structured Learning of Gaussian Graphical Models

  • Karthik Mohan
  • Mike Chung
  • Seungyeop Han
  • Daniela Witten
  • Su-In Lee
  • Maryam Fazel

We consider estimation of multiple high-dimensional Gaussian graphical models corresponding to a single set of nodes under several distinct conditions. We assume that most aspects of the networks are shared, but that there are some structured differences between them. Specifically, the network differences are generated from node perturbations: a few nodes are perturbed across networks, and most or all edges stemming from such nodes differ between networks. This corresponds to a simple model for the mechanism underlying many cancers, in which the gene regulatory network is disrupted due to the aberrant activity of a few specific genes. We propose to solve this problem using the structured joint graphical lasso, a convex optimization problem that is based upon the use of a novel symmetric overlap norm penalty, which we solve using an alternating directions method of multipliers algorithm. Our proposal is illustrated on synthetic data and on an application to brain cancer gene expression data.