Arrow Research search

Author name cluster

Predrag Radivojac

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

TMLR Journal 2024 Journal Article

Learning Tree-Structured Composition of Data Augmentation

  • Dongyue Li
  • Kailai Chen
  • Predrag Radivojac
  • Hongyang R. Zhang

Data augmentation is widely used in scenarios where one needs to train a neural network given little labeled data. A common practice of augmentation training is applying a composition of multiple transformations sequentially to the data. Existing augmentation methods such as RandAugment rely on domain expertise to select a list of transformations, while other methods such as AutoAugment formulate an optimization problem over a search space of size $k^d$, which is the number of sequences of length $d$, given a list of $k$ transformation functions. In this paper, we focus on designing efficient algorithms whose running time complexity is much faster than the worst-case complexity of $O(k^d)$, provably. We propose a new algorithm to search for a binary tree-structured composition of $k$ transformations, where each tree node corresponds to one transformation. The binary tree generalizes sequential augmentations, such as the one constructed by SimCLR. Using a top-down, recursive search procedure, our algorithm achieves a runtime complexity of $O(2^d k)$, which is much faster than $O(k^d)$ as $k$ increases above $2$. We apply the algorithm to tackle data distributions with heterogeneous subpopulations, by searching for one tree in each subpopulation, and then learn a weighted combination, leading to a forest of the trees. We validate the proposed algorithms on numerous graph and image data sets, including a multi-label graph classification data set we collected. The data set exhibits significant variations in the sizes of graphs and their average degrees, making it ideal for studying data augmentation. We show that our approach can reduce the computation cost (measured by GPU hours) by 43% over existing augmentation search methods while improving performance by 4.3%. Extensive experiments on contrastive learning also validate the benefit of our approach. The tree structures can be used to interpret the relative importance of each transformation, such as identifying the important transformations on small vs. large graphs.

AAAI Conference 2023 Conference Paper

Leveraging Structure for Improved Classification of Grouped Biased Data

  • Daniel Zeiberg
  • Shantanu Jain
  • Predrag Radivojac

We consider semi-supervised binary classification for applications in which data points are naturally grouped (e.g., survey responses grouped by state) and the labeled data is biased (e.g., survey respondents are not representative of the population). The groups overlap in the feature space and consequently the input-output patterns are related across the groups. To model the inherent structure in such data, we assume the partition-projected class-conditional invariance across groups, defined in terms of the group-agnostic feature space. We demonstrate that under this assumption, the group carries additional information about the class, over the group-agnostic features, with provably improved area under the ROC curve. Further assuming invariance of partition-projected class-conditional distributions across both labeled and unlabeled data, we derive a semi-supervised algorithm that explicitly leverages the structure to learn an optimal, group-aware, probability-calibrated classifier, despite the bias in the labeled data. Experiments on synthetic and real data demonstrate the efficacy of our algorithm over suitable baselines and ablative models, spanning standard supervised and semi-supervised learning approaches, with and without incorporating the group directly as a feature.

AAAI Conference 2020 Conference Paper

Class Prior Estimation with Biased Positives and Unlabeled Examples

  • Shantanu Jain
  • Justin Delano
  • Himanshu Sharma
  • Predrag Radivojac

Positive-unlabeled learning is often studied under the assumption that the labeled positive sample is drawn randomly from the true distribution of positives. In many application domains, however, certain regions in the support of the positive class-conditional distribution are over-represented while others are under-represented in the positive sample. Although this introduces problems in all aspects of positive-unlabeled learning, we begin to address this challenge by focusing on the estimation of class priors, quantities central to the estimation of posterior probabilities and the recovery of true classification performance. We start by making a set of assumptions to model the sampling bias. We then extend the identifiability theory of class priors from the unbiased to the biased setting. Finally, we derive an algorithm for estimating the class priors that relies on clustering to decompose the original problem into subproblems of unbiased positive-unlabeled learning. Our empirical investigation suggests feasibility of the correction strategy and overall good performance.

AAAI Conference 2020 Conference Paper

Fast Nonparametric Estimation of Class Proportions in the Positive-Unlabeled Classification Setting

  • Daniel Zeiberg
  • Shantanu Jain
  • Predrag Radivojac

Estimating class proportions has emerged as an important direction in positive-unlabeled learning. Well-estimated class priors are key to accurate approximation of posterior distributions and are necessary for the recovery of true classification performance. While significant progress has been made in the past decade, there remains a need for accurate strategies that scale to big data. Motivated by this need, we propose an intuitive and fast nonparametric algorithm to estimate class proportions. Unlike any of the previous methods, our algorithm uses a sampling strategy to repeatedly (1) draw an example from the set of positives, (2) record the minimum distance to any of the unlabeled examples, and (3) remove the nearest unlabeled example. We show that the point of sharp increase in the recorded distances corresponds to the desired proportion of positives in the unlabeled set and train a deep neural network to identify that point. Our distance-based algorithm is evaluated on forty datasets and compared to all currently available methods. We provide evidence that this new approach results in the most accurate performance and can be readily used on large datasets.

IJCAI Conference 2018 Conference Paper

On Whom Should I Perform this Lab Test Next? An Active Feature Elicitation Approach

  • Sriraam Natarajan
  • Srijita Das
  • Nandini Ramanan
  • Gautam Kunapuli
  • Predrag Radivojac

We consider the problem of actively feature elicitation in which given a few examples with all the features (say the full EHR) and a few examples with some of the features (say demographics), the goal is to identify the set of examples on whom more information (say the lab tests) needs to be collected. The observation is that some set of features may be more expensive, personal or cumbersome to collect. We propose an active learning approach which identifies examples that are dissimilar to the ones with the full set of data and acquire the complete set of features for these examples. Motivated by real clinical tasks, our extensive evaluation on three clinical tasks demonstrate the effectiveness of this approach.

AAAI Conference 2017 Conference Paper

Recovering True Classifier Performance in Positive-Unlabeled Learning

  • Shantanu Jain
  • Martha White
  • Predrag Radivojac

A common approach in positive-unlabeled learning is to train a classification model between labeled and unlabeled data. This strategy is in fact known to give an optimal classifier under mild conditions; however, it results in biased empirical estimates of the classifier performance. In this work, we show that the typically used performance measures such as the receiver operating characteristic curve, or the precisionrecall curve obtained on such data can be corrected with the knowledge of class priors; i. e. , the proportions of the positive and negative examples in the unlabeled data. We extend the results to a noisy setting where some of the examples labeled positive are in fact negative and show that the correction also requires the knowledge of the proportion of noisy examples in the labeled positives. Using state-of-the-art algorithms to estimate the positive class prior and the proportion of noise, we experimentally evaluate two correction approaches and demonstrate their efficacy on real-life data.

NeurIPS Conference 2016 Conference Paper

Estimating the class prior and posterior from noisy positives and unlabeled data

  • Shantanu Jain
  • Martha White
  • Predrag Radivojac

We develop a classification algorithm for estimating posterior distributions from positive-unlabeled data, that is robust to noise in the positive labels and effective for high-dimensional data. In recent years, several algorithms have been proposed to learn from positive-unlabeled data; however, many of these contributions remain theoretical, performing poorly on real high-dimensional data that is typically contaminated with noise. We build on this previous work to develop two practical classification algorithms that explicitly model the noise in the positive labels and utilize univariate transforms built on discriminative classifiers. We prove that these univariate transforms preserve the class prior, enabling estimation in the univariate space and avoiding kernel density estimation for high-dimensional data. The theoretical development and parametric and nonparametric algorithms proposed here constitute an important step towards wide-spread use of robust classification algorithms for positive-unlabeled data.