Arrow Research search

Author name cluster

Aron Culotta

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

14 papers
2 author rows

Possible papers

14

AAAI Conference 2026 Conference Paper

The Illusion of Fairness: Auditing Fairness Interventions in Algorithmic Hiring with Audit Studies

  • Disa Sariola
  • Patrick Button
  • Aron Culotta
  • Nicholas Mattei

Classifiers trained on historical data are deployed in the real world to automate decisions from hiring to loan issuance. Judging the fairness and efficiency of these systems, and their human counterparts, is a complex and important topic studied across both computational and social sciences. One common way to address bias in classifiers is to resample the training data to offset distributional disparities. In the hiring domain, where results may vary by a protected class, many interventions from the literature equalize the hiring rate within the training set to alleviate bias. While simple and seemingly effective, these methods have typically only been evaluated using data obtained through convenience samples, e.g., data from a real-world hiring process, introducing selection and label bias. In the social and health sciences, audit studies, in which fictitious "testers" (resumes) are sent to subjects (job openings) in a randomized control trial, provide high-quality data that support rigorous estimates of discrimination by controlling for confounding factors. We investigate how data from audit studies can be used to improve our ability to both train and evaluate automated hiring algorithms. Specifically, we use data from a large audit study of age discrimination in hiring to test common resampling methods from the fair machine learning literature. We find that audit data of real-world hiring reveals cases where equalizing base rates across classes appears to achieve parity using traditional measures, but in fact has an absolute ~10% disparity when measured appropriately. We also show that corrections based on individual treatment effect estimation methods combined with audit study data can overcome these issues, underscoring the need for rigorous data collection in fairness research.

AAAI Conference 2021 Conference Paper

Robustness to Spurious Correlations in Text Classification via Automatically Generated Counterfactuals

  • Zhao Wang
  • Aron Culotta

Spurious correlations threaten the validity of statistical classifiers. While model accuracy may appear high when the test data is from the same distribution as the training data, it can quickly degrade when the test distribution changes. For example, it has been shown that classifiers perform poorly when humans make minor modifications to change the label of an example. One solution to increase model reliability and generalizability is to identify causal associations between features and classes. In this paper, we propose to train a robust text classifier by augmenting the training data with automatically generated counterfactual data. We first identify likely causal features using a statistical matching approach. Next, we generate counterfactual samples for the original training data by substituting causal features with their antonyms and then assigning opposite labels to the counterfactual samples. Finally, we combine the original data and counterfactual data to train a robust classifier. Experiments on two classification tasks show that a traditional classifier trained on the original data does very poorly on human-generated counterfactual samples (e. g. , 10%-37% drop in accuracy). However, the classifier trained on the combined data is more robust and performs well on both the original test data and the counterfactual test data (e. g. , 12%-25% increase in accuracy compared with the traditional classifier). Detailed analysis shows that the robust classifier makes meaningful and trustworthy predictions by emphasizing causal features and de-emphasizing non-causal features.

AAAI Conference 2019 Conference Paper

When Do Words Matter? Understanding the Impact of Lexical Choice on Audience Perception Using Individual Treatment Effect Estimation

  • Zhao Wang
  • Aron Culotta

Studies across many disciplines have shown that lexical choice can affect audience perception. For example, how users describe themselves in a social media profile can affect their perceived socio-economic status. However, we lack general methods for estimating the causal effect of lexical choice on the perception of a specific sentence. While randomized controlled trials may provide good estimates, they do not scale to the potentially millions of comparisons necessary to consider all lexical choices. Instead, in this paper, we first offer two classes of methods to estimate the effect on perception of changing one word to another in a given sentence. The first class of algorithms builds upon quasi-experimental designs to estimate individual treatment effects from observational data. The second class treats treatment effect estimation as a classification problem. We conduct experiments with three data sources (Yelp, Twitter, and Airbnb), finding that the algorithmic estimates align well with those produced by randomized-control trials. Additionally, we find that it is possible to transfer treatment effect classifiers across domains and still maintain high accuracy.

JAIR Journal 2018 Journal Article

Robust Text Classification under Confounding Shift

  • Virgile Landeiro
  • Aron Culotta

As statistical classifiers become integrated into real-world applications, it is important to consider not only their accuracy but also their robustness to changes in the data distribution. Although identifying and controlling for confounding variables Z - correlated with both the input X of a classifier and its output Y - has been assiduously studied in empirical social science, it is often neglected in text classification. This can be understood by the fact that, if we assume that the impact of confounding variables does not change between the time we fit a model and the time we use it, then prediction accuracy should only be slightly affected. We show in this paper that this assumption often does not hold and that when the influence of a confounding variable changes from training time to prediction time (i.e. under confounding shift), the classifier accuracy can degrade rapidly. We use Pearl's back-door adjustment as a predictive framework to develop a model robust to confounding shift under the condition that Z is observed at training time. Our approach does not make any causal conclusions but by experimenting on 6 datasets, we show that our approach is able to outperform baselines 1) in controlled cases where confounding shift is manually injected between fitting time and prediction time 2) in natural experiments where confounding shift appears either abruptly or gradually 3) in cases where there is one or multiple confounders. Finally, we discuss multiple issues we encountered during this research such as the effect of noise in the observation of Z and the importance of only controlling for confounding variables.

IJCAI Conference 2016 Conference Paper

Cold-Start Recommendations for Audio News Stories Using Matrix Factorization

  • Ehsan Mohammady Ardehaly
  • Aron Culotta
  • Vivek Sundararaman
  • Alwar Narayanan

We investigate a suite of recommendation algorithms for audio news listening applications. This domain presents several challenges that distinguish it from more commonly studied applications such as movie recommendations: (1) we do not receive explicit rating feedback, instead only observing when a user skips a story; (2) new stories arrive continuously, increasing the importance of making recommendations for items with few observations (the cold start problem); (3) story attributes have high dimensionality, making it challenging to identify similar stories. To address the first challenge, we formulate the problem as predicting the percentage of a story a user will listen to; to address the remaining challenges, we propose several matrix factorization algorithms that cluster users, n-grams, and stories simultaneously, while optimizing prediction accuracy. We empirically evaluate our approach on a dataset of 50K users, 26K stories, and 975K interactions collected over a five month period. We find that while simple models work well for stories with many observations, our proposed approach performs best for stories with few ratings, which is critical for the real-world deployment of such an application.

IJCAI Conference 2016 Conference Paper

Domain Adaptation for Learning from Label Proportions Using Self-Training

  • Ehsan Mohammady Ardehaly
  • Aron Culotta

Learning from Label Proportions (LLP) is a machine learning problem in which the training data consist of bags of instances, and only the class label distribution for each bag is known. In some domains label proportions are readily available; for example, by grouping social media users by location, one can use census statistics to build a classifier for user demographics. However, label proportions are unavailable in many domains, such as product review sites. The goal of this paper is to determine whether an LLP classifier fit in one domain can be modified to classify instances from another domain. To do so, we propose a domain adaptation algorithm that uses an LLP model fit on the source domain to generate label proportions for the target domain. A new LLP model is then fit on the target domain, and this self-training process is repeated to adapt the model from source to target. Our experiments on five diverse tasks indicate an 11% average absolute improvement in accuracy as compared to using LLP without domain adaptation. In contrast to existing domain adaptation algorithms, our approach requires only label proportions in the source domain, and the results suggest that the approach is effective even when the target domain is substantially different from the source domain.

JAIR Journal 2016 Journal Article

Predicting Twitter User Demographics using Distant Supervision from Website Traffic Data

  • Aron Culotta
  • Nirmal Kumar Ravi
  • Jennifer Cutler

Understanding the demographics of users of online social networks has important applications for health, marketing, and public messaging. Whereas most prior approaches rely on a supervised learning approach, in which individual users are labeled with demographics for training, we instead create a distantly labeled dataset by collecting audience measurement data for 1,500 websites (e.g., 50% of visitors to gizmodo.com are estimated to have a bachelor's degree). We then fit a regression model to predict these demographics from information about the followers of each website on Twitter. Using patterns derived both from textual content and the social network of each user, our final model produces an average held-out correlation of.77 across seven different variables (age, gender, education, ethnicity, income, parental status, and political preference). We then apply this model to classify individual Twitter users by ethnicity, gender, and political preference, finding performance that is surprisingly competitive with a fully supervised approach.

AAAI Conference 2016 Conference Paper

Robust Text Classification in the Presence of Confounding Bias

  • Virgile Landeiro
  • Aron Culotta

As text classifiers become increasingly used in real-time applications, it is critical to consider not only their accuracy but also their robustness to changes in the data distribution. In this paper, we consider the case where there is a confounding variable Z that influences both the text features X and the class variable Y. For example, a classifier trained to predict the health status of a user based on their online communications may be confounded by socioeconomic variables. When the influence of Z changes from training to testing data, we find that classifier accuracy can degrade rapidly. Our approach, based on Pearl’s back-door adjustment, estimates the underlying effect of a text variable on the class variable while controlling for the confounding variable. Although our goal is prediction, not causal inference, we find that such adjustments are essential to building text classifiers that are robust to confounding variables. On three diverse text classifications tasks, we find that covariate adjustment results in higher accuracy than competing baselines over a range of confounding relationships (e. g. , in one setting, accuracy improves from 60% to 81%).

AAAI Conference 2015 Conference Paper

Predicting the Demographics of Twitter Users from Website Traffic Data

  • Aron Culotta
  • Nirmal Kumar
  • Jennifer Cutler

Understanding the demographics of users of online social networks has important applications for health, marketing, and public messaging. In this paper, we predict the demographics of Twitter users based on whom they follow. Whereas most prior approaches rely on a supervised learning approach, in which individual users are labeled with demographics, we instead create a distantly labeled dataset by collecting audience measurement data for 1, 500 websites (e. g. , 50% of visitors to gizmodo. com are estimated to have a bachelor’s degree). We then fit a regression model to predict these demographics using information about the followers of each website on Twitter. The resulting average heldout correlation is. 77 across six different variables (gender, age, ethnicity, education, income, and child status). We additionally validate the model on a smaller set of Twitter users labeled individually for ethnicity and gender, finding performance that is surprisingly competitive with a fully supervised approach.

AAAI Conference 2015 Conference Paper

Using Matched Samples to Estimate the Effects of Exercise on Mental Health via Twitter

  • Virgile Landeiro Dos Reis
  • Aron Culotta

Recent work has demonstrated the value of social media monitoring for health surveillance (e. g. , tracking influenza or depression rates). It is an open question whether such data can be used to make causal inferences (e. g. , determining which activities lead to increased depression rates). Even in traditional, restricted domains, estimating causal effects from observational data is highly susceptible to confounding bias. In this work, we estimate the effect of exercise on mental health from Twitter, relying on statistical matching methods to reduce confounding bias. We train a text classifier to estimate the volume of a user’s tweets expressing anxiety, depression, or anger, then compare two groups: those who exercise regularly (identified by their use of physical activity trackers like Nike+), and a matched control group. We find that those who exercise regularly have significantly fewer tweets expressing depression or anxiety; there is no significant difference in rates of tweets expressing anger. We additionally perform a sensitivity analysis to investigate how the many experimental design choices in such a study impact the final conclusions, including the quality of the classifier and the construction of the control group.

AAAI Conference 2014 Conference Paper

Anytime Active Learning

  • Maria Ramirez-Loaiza
  • Aron Culotta
  • Mustafa Bilgic

A common bottleneck in deploying supervised learning systems is collecting human-annotated examples. In many domains, annotators form an opinion about the label of an example incrementally — e. g. , each additional word read from a document or each additional minute spent inspecting a video helps inform the annotation. In this paper, we investigate whether we can train learning systems more efficiently by requesting an annotation before inspection is fully complete — e. g. , after reading only 25 words of a document. While doing so may reduce the overall annotation time, it also introduces the risk that the annotator might not be able to provide a label if interrupted too early. We propose an anytime active learning approach that optimizes the annotation time and response rate simultaneously. We conduct user studies on two document classification datasets and develop simulated annotators that mimic the users. Our simulated experiments show that anytime active learning outperforms several baselines on these two datasets. For example, with an annotation budget of one hour, training a classifier by annotating the first 25 words of each document reduces classification error by 17% over annotating the first 100 words of each document.

AIJ Journal 2006 Journal Article

Corrective feedback and persistent learning for information extraction

  • Aron Culotta
  • Trausti Kristjansson
  • Andrew McCallum
  • Paul Viola

To successfully embed statistical machine learning models in real world applications, two post-deployment capabilities must be provided: (1) the ability to solicit user corrections and (2) the ability to update the model from these corrections. We refer to the former capability as corrective feedback and the latter as persistent learning. While these capabilities have a natural implementation for simple classification tasks such as spam filtering, we argue that a more careful design is required for structured classification tasks. One example of a structured classification task is information extraction, in which raw text is analyzed to automatically populate a database. In this work, we augment a probabilistic information extraction system with corrective feedback and persistent learning components to assist the user in building, correcting, and updating the extraction model. We describe methods of guiding the user to incorrect predictions, suggesting the most informative fields to correct, and incorporating corrections into the inference algorithm. We also present an active learning framework that minimizes not only how many examples a user must label, but also how difficult each example is to label. We empirically validate each of the technical components in simulation and quantify the user effort saved. We conclude that more efficient corrective feedback mechanisms lead to more effective persistent learning.

AAAI Conference 2005 Conference Paper

Reducing Labeling Effort for Structured Prediction Tasks

  • Aron Culotta

A common obstacle preventing the rapid deployment of supervised machine learning algorithms is the lack of labeled training data. This is particularly expensive to obtain for structured prediction tasks, where each training instance may have multiple, interacting labels, all of which must be correctly annotated for the instance to be of use to the learner. Traditional active learning addresses this problem by optimizing the order in which the examples are labeled to increase learning efficiency. However, this approach does not consider the difficulty of labeling each example, which can vary widely in structured prediction tasks. For example, the labeling predicted by a partially trained system may be easier to correct for some instances than for others. We propose a new active learning paradigm which reduces not only how many instances the annotator must label, but also how difficult each instance is to annotate. The system also leverages information from partially correct predictions to efficiently solicit annotations from the user. We validate this active learning framework in an interactive information extraction system, reducing the total number of annotation actions by 22%.