Arrow Research search

Author name cluster

Daniel B. Neill

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers
2 author rows

Possible papers

15

TMLR Journal 2026 Journal Article

Auditing Predictive Models for Intersectional Biases

  • Kate Boxer
  • Edward McFowland III
  • Daniel B. Neill

Predictive models that satisfy group fairness criteria in aggregate for members of a protected class, but do not guarantee subgroup fairness, could produce biased predictions for individuals at the intersection of two or more protected classes. To address this risk, we propose Conditional Bias Scan (CBS), an auditing framework for detecting intersectional biases in the outputs of classification models that may lead to disparate impact. CBS aims to identify the subgroup with the most significant bias against the protected class, compared to the equivalent subgroup in the non-protected class. The framework can audit for predictive biases using common group fairness definitions that can be represented as conditional independence statements (separation and sufficiency) for both probabilistic and binarized predictions. We show through empirical evaluations that this methodology has substantially higher bias detection power compared to similar methods that audit for subgroup fairness. We then use this approach to detect statistically significant intersectional biases in the predictions of the COMPAS pre-trial risk assessment tool and a model trained on the German Credit data.

AAAI Conference 2023 Conference Paper

Detecting Anomalous Networks of Opioid Prescribers and Dispensers in Prescription Drug Data

  • Katie Rosman
  • Daniel B. Neill

The opioid overdose epidemic represents a serious public health crisis, with fatality rates rising considerably over the past several years. To help address the abuse of prescription opioids, state governments collect data on dispensed prescriptions, yet the use of these data is typically limited to manual searches. In this paper, we propose a novel graph-based framework for detecting anomalous opioid prescribing patterns in state Prescription Drug Monitoring Program (PDMP) data, which could aid governments in deterring opioid diversion and abuse. Specifically, we seek to identify connected networks of opioid prescribers and dispensers who engage in high-risk and possibly illicit activity. We develop and apply a novel extension of the Non-Parametric Heterogeneous Graph Scan (NPHGS) to two years of de-identified PDMP data from the state of Kansas, and find that NPHGS identifies subgraphs that are significantly more anomalous than those detected by other graph-based methods. NPHGS also reveals clusters of potentially illicit activity, which may strengthen state law enforcement and regulatory capabilities. Our paper is the first to demonstrate how prescription data can systematically identify anomalous opioid prescribers and dispensers, as well as illustrating the efficacy of a network-based approach. Additionally, our technical extensions to NPHGS offer both improved flexibility and graph density reduction, enabling the framework to be replicated across jurisdictions and extended to other problem domains.

JMLR Journal 2023 Journal Article

Exploiting Discovered Regression Discontinuities to Debias Conditioned-on-observable Estimators

  • Benjamin Jakubowski
  • Sriram Somanchi
  • Edward McFowland III
  • Daniel B. Neill

Regression discontinuity (RD) designs are widely used to estimate causal effects in the absence of a randomized experiment. However, standard approaches to RD analysis face two significant limitations. First, they require a priori knowledge of discontinuities in treatment. Second, they yield doubly-local treatment effect estimates, and fail to provide more general causal effect estimates away from the discontinuity. To address these limitations, we introduce a novel method for automatically detecting RDs at scale, integrating information from multiple discovered discontinuities with an observational estimator, and extrapolating away from discovered, local RDs. We demonstrate the performance of our method on two synthetic datasets, showing improved performance compared to direct use of an observational estimator, direct extrapolation of RD estimates, and existing methods for combining multiple causal effect estimates. Finally, we apply our novel method to estimate spatially heterogeneous treatment effects in the context of a recent economic development problem. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

AAAI Conference 2023 Conference Paper

Provable Detection of Propagating Sampling Bias in Prediction Models

  • Pavan Ravishankar
  • Qingyu Mo
  • Edward McFowland III
  • Daniel B. Neill

With an increased focus on incorporating fairness in machine learning models, it becomes imperative not only to assess and mitigate bias at each stage of the machine learning pipeline but also to understand the downstream impacts of bias across stages. Here we consider a general, but realistic, scenario in which a predictive model is learned from (potentially biased) training data, and model predictions are assessed post-hoc for fairness by some auditing method. We provide a theoretical analysis of how a specific form of data bias, differential sampling bias, propagates from the data stage to the prediction stage. Unlike prior work, we evaluate the downstream impacts of data biases quantitatively rather than qualitatively and prove theoretical guarantees for detection. Under reasonable assumptions, we quantify how the amount of bias in the model predictions varies as a function of the amount of differential sampling bias in the data, and at what point this bias becomes provably detectable by the auditor. Through experiments on two criminal justice datasets-- the well-known COMPAS dataset and historical data from NYPD's stop and frisk policy-- we demonstrate that the theoretical results hold in practice even when our assumptions are relaxed.

AAAI Conference 2022 Conference Paper

Calibrated Nonparametric Scan Statistics for Anomalous Pattern Detection in Graphs

  • Chunpai Wang
  • Daniel B. Neill
  • Feng Chen

We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts.

AAAI Conference 2022 Conference Paper

SPATE-GAN: Improved Generative Modeling of Dynamic Spatio-Temporal Patterns with an Autoregressive Embedding Loss

  • Konstantin Klemmer
  • Tianlin Xu
  • Beatrice Acciaio
  • Daniel B. Neill

From ecology to atmospheric sciences, many academic disciplines deal with data characterized by intricate spatiotemporal complexities, the modeling of which often requires specialized approaches. Generative models of these data are of particular interest, as they enable a range of impactful downstream applications like simulation or creating synthetic training data. Recently, COT-GAN, a new GAN algorithm inspired by the theory of causal optimal transport (COT), was proposed in an attempt to improve generation of sequential data. However, the task of learning complex patterns over time and space requires additional knowledge of the specific data structures. In this study, we propose a novel loss objective combined with COT-GAN based on an autoregressive embedding to reinforce the learning of spatio-temporal dynamics. We devise SPATE (spatio-temporal association), a new metric measuring spatio-temporal autocorrelation. We compute SPATE for real and synthetic data samples and use it to compute an embedding loss that considers space-time interactions, nudging the GAN to learn outputs that are faithful to the observed dynamics. We test our new SPATE-GAN on a diverse set of spatio-temporal patterns: turbulent flows, log- Gaussian Cox processes and global weather data. We show that our novel embedding loss improves performance without any changes to the architecture of the GAN backbone, highlighting our model’s increased capacity for capturing autoregressive structures.

JMLR Journal 2019 Journal Article

Change Surfaces for Expressive Multidimensional Changepoints and Counterfactual Prediction

  • William Herlands
  • Daniel B. Neill
  • Hannes Nickisch
  • Andrew Gordon Wilson

Identifying changes in model parameters is fundamental in machine learning and statistics. However, standard changepoint models are limited in expressiveness, often addressing unidimensional problems and assuming instantaneous changes. We introduce change surfaces as a multidimensional and highly expressive generalization of changepoints. We provide a model-agnostic formalization of change surfaces, illustrating how they can provide variable, heterogeneous, and non-monotonic rates of change across multiple dimensions. Additionally, we show how change surfaces can be used for counterfactual prediction. As a concrete instantiation of the change surface framework, we develop Gaussian Process Change Surfaces (GPCS). We demonstrate counterfactual prediction with Bayesian posterior mean and credible sets, as well as massive scalability by introducing novel methods for additive non-separable kernels. Using two large spatio-temporal datasets we employ GPCS to discover and characterize complex changes that can provide scientific and policy relevant insights. Specifically, we analyze twentieth century measles incidence across the United States and discover previously unknown heterogeneous changes after the introduction of the measles vaccine. Additionally, we apply the model to requests for lead testing kits in New York City, discovering distinct spatial and demographic patterns. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

IS Journal 2017 Journal Article

Graph Structure Learning from Unlabeled Data for Early Outbreak Detection

  • Sriram Somanchi
  • Daniel B. Neill

Processes such as disease propagation and information diffusion often spread over some latent network structure that must be learned from observation. Given a set of unlabeled training examples representing occurrences of an event type of interest (such as a disease outbreak), the authors aim to learn a graph structure that can be used to accurately detect future events of that type. They propose a novel framework for learning graph structure from unlabeled data by comparing the most anomalous subsets detected with and without the graph constraints. Their framework uses the mean normalized log-likelihood ratio score to measure the quality of a graph structure, and it efficiently searches for the highest-scoring graph structure. Using simulated disease outbreaks injected into real-world Emergency Department data from Allegheny County, the authors show that their method learns a structure similar to the true underlying graph, but enables faster and more accurate detection.

ICML Conference 2015 Conference Paper

Fast Kronecker Inference in Gaussian Processes with non-Gaussian Likelihoods

  • Seth R. Flaxman
  • Andrew Gordon Wilson
  • Daniel B. Neill
  • Hannes Nickisch
  • Alexander J. Smola

Gaussian processes (GPs) are a flexible class of methods with state of the art performance on spatial statistics applications. However, GPs require O(n^3) computations and O(n^2) storage, and popular GP kernels are typically limited to smoothing and interpolation. To address these difficulties, Kronecker methods have been used to exploit structure in the GP covariance matrix for scalability, while allowing for expressive kernel learning (Wilson et al. , 2014). However, fast Kronecker methods have been confined to Gaussian likelihoods. We propose new scalable Kronecker methods for Gaussian processes with non-Gaussian likelihoods, using a Laplace approximation which involves linear conjugate gradients for inference, and a lower bound on the GP marginal likelihood for kernel learning. Our approach has near linear scaling, requiring O(D n^(D+1)/D) operations and O(D n^2/D) storage, for n training data-points on a dense D > 1 dimensional grid. Moreover, we introduce a log Gaussian Cox process, with highly expressive kernels, for modelling spatiotemporal count processes, and apply it to a point pattern (n = 233, 088) of a decade of crime events in Chicago. Using our model, we discover spatially varying multiscale seasonal trends and produce highly accurate long-range local area forecasts.

TIST Journal 2015 Journal Article

Gaussian Processes for Independence Tests with Non-iid Data in Causal Inference

  • Seth R. Flaxman
  • Daniel B. Neill
  • Alexander J. Smola

In applied fields, practitioners hoping to apply causal structure learning or causal orientation algorithms face an important question: which independence test is appropriate for my data? In the case of real-valued iid data, linear dependencies, and Gaussian error terms, partial correlation is sufficient. But once any of these assumptions is modified, the situation becomes more complex. Kernel-based tests of independence have gained popularity to deal with nonlinear dependencies in recent years, but testing for conditional independence remains a challenging problem. We highlight the important issue of non-iid observations: when data are observed in space, time, or on a network, “nearby” observations are likely to be similar. This fact biases estimates of dependence between variables. Inspired by the success of Gaussian process regression for handling non-iid observations in a wide variety of areas and by the usefulness of the Hilbert-Schmidt Independence Criterion (HSIC), a kernel-based independence test, we propose a simple framework to address all of these issues: first, use Gaussian process regression to control for certain variables and to obtain residuals. Second, use HSIC to test for independence. We illustrate this on two classic datasets, one spatial, the other temporal, that are usually treated as iid. We show how properly accounting for spatial and temporal variation can lead to more reasonable causal graphs. We also show how highly structured data, like images and text, can be used in a causal inference framework using a novel structured input/output Gaussian process formulation. We demonstrate this idea on a dataset of translated sentences, trying to predict the source language.

JMLR Journal 2013 Journal Article

Fast Generalized Subset Scan for Anomalous Pattern Detection

  • Edward McFowland III
  • Skyler Speakman
  • Daniel B. Neill

We propose Fast Generalized Subset Scan (FGSS), a new method for detecting anomalous patterns in general categorical data sets. We frame the pattern detection problem as a search over subsets of data records and attributes, maximizing a nonparametric scan statistic over all such subsets. We prove that the nonparametric scan statistics possess a novel property that allows for efficient optimization over the exponentially many subsets of the data without an exhaustive search, enabling FGSS to scale to massive and high-dimensional data sets. We evaluate the performance of FGSS in three real-world application domains (customs monitoring, disease surveillance, and network intrusion detection), and demonstrate that FGSS can successfully detect and characterize relevant patterns in each domain. As compared to three other recently proposed detection algorithms, FGSS substantially decreased run time and improved detection power for massive multivariate data sets. [abs] [ pdf ][ bib ] &copy JMLR 2013. ( edit, beta )

IS Journal 2012 Journal Article

Information Visualization for Chronic Disease Risk Assessment

  • Christopher A. Harle
  • Daniel B. Neill
  • Rema Padman

Here, the authors describe and evaluate a new information-visualization method and prototype software tool that support risk assessment for negative health outcomes. Their framework uses principal component analysis and linear discriminant analysis to plot high-dimensional patient data in 2D. It also incorporates interactive visualization techniques to aid the identification of high versus low risk patients, critical risk factors, and the estimated effect of hypothetical interventions on the likelihood of negative outcomes. The authors quantitatively evaluated the visualization method using a secondary dataset describing 588 people with diabetes and their estimated future risk of heart attack. Their results show that the method visually classifies high- and low-risk people with accuracy that's similar to other common statistical methods. The framework also provides an interactive, visualization-based tool for clinicians to explore the nuances of their patients' data and disease risk.

IS Journal 2012 Journal Article

New Directions in Artificial Intelligence for Public Health Surveillance

  • Daniel B. Neill

The next decade of disease surveillance research will require novel methods to effectively use massive quantities of complex, high-dimensional data. We summarize two recent approaches which deal with the increasing complexity and scale of health data, including the use of rich text data to detect emerging outbreaks with novel symptom patterns, and fast subset scan methods to efficiently identify the most relevant patterns in massive datasets.