Arrow Research search

Author name cluster

Mark Steyvers

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

ICML Conference 2025 Conference Paper

Bayesian Inference for Correlated Human Experts and Classifiers

  • Markelle Kelly
  • Alex Boyd
  • Samuel Showalter
  • Mark Steyvers
  • Padhraic Smyth

Applications of machine learning often involve making predictions based on both model outputs and the opinions of human experts. In this context, we investigate the problem of querying experts for class label predictions, using as few human queries as possible, and leveraging the class probability estimates of pre-trained classifiers. We develop a general Bayesian framework for this problem, modeling expert correlation via a joint latent representation, enabling simulation-based inference about the utility of additional expert queries, as well as inference of posterior distributions over unobserved expert labels. We apply our approach to two real-world medical classification problems, as well as to CIFAR-10H and ImageNet-16H, demonstrating substantial reductions relative to baselines in the cost of querying human experts while maintaining high prediction accuracy.

NeurIPS Conference 2023 Conference Paper

Large Language Models Are Latent Variable Models: Explaining and Finding Good Demonstrations for In-Context Learning

  • Xinyi Wang
  • Wanrong Zhu
  • Michael Saxon
  • Mark Steyvers
  • William Yang Wang

In recent years, pre-trained large language models (LLMs) have demonstrated remarkable efficiency in achieving an inference-time few-shot learning capability known as in-context learning. However, existing literature has highlighted the sensitivity of this capability to the selection of few-shot demonstrations. Current understandings of the underlying mechanisms by which this capability arises from regular language model pretraining objectives remain disconnected from the real-world LLMs. This study aims to examine the in-context learning phenomenon through a Bayesian lens, viewing real-world LLMs as latent variable models. On this premise, we propose an algorithm to select optimal demonstrations from a set of annotated data with a small LM, and then directly generalize the selected demonstrations to larger LMs. We demonstrate significant improvement over baselines, averaged over eight GPT models on eight real-world text classification datasets. We also demonstrate the real-world usefulness of our algorithm on GSM8K, a math word problem dataset. Our empirical findings support our hypothesis that LLMs implicitly infer a latent variable containing task information.

AAAI Conference 2021 Conference Paper

Active Bayesian Assessment of Black-Box Classifiers

  • Disi Ji
  • Robert L. Logan
  • Padhraic Smyth
  • Mark Steyvers

Recent advances in machine learning have led to increased deployment of black-box classifiers across a wide variety of applications. In many such situations there is a critical need to both reliably assess the performance of these pre-trained models and to perform this assessment in a label-efficient manner (given that labels may be scarce and costly to collect). In this paper, we introduce an active Bayesian approach for assessment of classifier performance to satisfy the desiderata of both reliability and label-efficiency. We begin by developing inference strategies to quantify uncertainty for common assessment metrics such as accuracy, misclassification cost, and calibration error. We then propose a general framework for active Bayesian assessment using inferred uncertainty to guide efficient selection of instances for labeling, enabling better performance assessment with fewer labels. We demonstrate significant gains from our proposed active Bayesian approach via a series of systematic empirical experiments assessing the performance of modern neural classifiers (e. g. , ResNet and BERT) on several standard image and text classification datasets.

NeurIPS Conference 2021 Conference Paper

Combining Human Predictions with Model Probabilities via Confusion Matrices and Calibration

  • Gavin Kerrigan
  • Padhraic Smyth
  • Mark Steyvers

An increasingly common use case for machine learning models is augmenting the abilities of human decision makers. For classification tasks where neither the human nor model are perfectly accurate, a key step in obtaining high performance is combining their individual predictions in a manner that leverages their relative strengths. In this work, we develop a set of algorithms that combine the probabilistic output of a model with the class-level output of a human. We show theoretically that the accuracy of our combination model is driven not only by the individual human and model accuracies, but also by the model's confidence. Empirical results on image classification with CIFAR-10 and a subset of ImageNet demonstrate that such human-model combinations consistently have higher accuracies than the model or human alone, and that the parameters of the combination method can be estimated effectively with as few as ten labeled datapoints.

NeurIPS Conference 2020 Conference Paper

Can I Trust My Fairness Metric? Assessing Fairness with Unlabeled Data and Bayesian Inference

  • Disi Ji
  • Padhraic Smyth
  • Mark Steyvers

Group fairness is measured via parity of quantitative metrics across different protected demographic groups. In this paper, we investigate the problem of reliably assessing group fairness metrics when labeled examples are few but unlabeled examples are plentiful. We propose a general Bayesian framework that can augment labeled data with unlabeled data to produce more accurate and lower-variance estimates compared to methods based on labeled data alone. Our approach estimates calibrated scores (for unlabeled examples) of each group using a hierarchical latent variable model conditioned on labeled examples. This in turn allows for inference of posterior distributions for an array of group fairness metrics with a notion of uncertainty. We demonstrate that our approach leads to significant and consistent reductions in estimation error across multiple well-known fairness datasets, sensitive attributes, and predictive models. The results clearly show the benefits of using both unlabeled data and Bayesian inference in assessing whether a prediction model is fair or not.

IJCAI Conference 2019 Conference Paper

SAGE: A Hybrid Geopolitical Event Forecasting System

  • Fred Morstatter
  • Aram Galstyan
  • Gleb Satyukov
  • Daniel Benjamin
  • Andres Abeliuk
  • Mehrnoosh Mirtaheri
  • KSM Tozammel Hossain
  • Pedro Szekely

Forecasting of geopolitical events is a notoriously difficult task, with experts failing to significantly outperform a random baseline across many types of forecasting events. One successful way to increase the performance of forecasting tasks is to turn to crowdsourcing: leveraging many forecasts from non-expert users. Simultaneously, advances in machine learning have led to models that can produce reasonable, although not perfect, forecasts for many tasks. Recent efforts have shown that forecasts can be further improved by ``hybridizing'' human forecasters: pairing them with the machine models in an effort to combine the unique advantages of both. In this demonstration, we present Synergistic Anticipation of Geopolitical Events (SAGE), a platform for human/computer interaction that facilitates human reasoning with machine models.

JBHI Journal 2017 Journal Article

Content Coding of Psychotherapy Transcripts Using Labeled Topic Models

  • Garren Gaut
  • Mark Steyvers
  • Zac E. Imel
  • David C. Atkins
  • Padhraic Smyth

Psychotherapy represents a broad class of medical interventions received by millions of patients each year. Unlike most medical treatments, its primary mechanisms are linguistic; i. e. , the treatment relies directly on a conversation between a patient and provider. However, the evaluation of patient-provider conversation suffers from critical shortcomings, including intensive labor requirements, coder error, nonstandardized coding systems, and inability to scale up to larger data sets. To overcome these shortcomings, psychotherapy analysis needs a reliable and scalable method for summarizing the content of treatment encounters. We used a publicly available psychotherapy corpus from Alexander Street press comprising a large collection of transcripts of patient-provider conversations to compare coding performance for two machine learning methods. We used the labeled latent Dirichlet allocation (L-LDA) model to learn associations between text and codes, to predict codes in psychotherapy sessions, and to localize specific passages of within-session text representative of a session code. We compared the L-LDA model to a baseline lasso regression model using predictive accuracy and model generalizability (measured by calculating the area under the curve (AUC) from the receiver operating characteristic curve). The L-LDA model outperforms the lasso logistic regression model at predicting session-level codes with average AUC scores of 0. 79, and 0. 70, respectively. For fine-grained level coding, L-LDA and logistic regression are able to identify specific talk-turns representative of symptom codes. However, model performance for talk-turn identification is not yet as reliable as human coders. We conclude that the L-LDA model has the potential to be an objective, scalable method for accurate automated coding of psychotherapy sessions that perform better than comparable discriminative methods at session-level coding and can also predict fine-grained codes.

YNIMG Journal 2016 Journal Article

Why more is better: Simultaneous modeling of EEG, fMRI, and behavioral data

  • Brandon M. Turner
  • Christian A. Rodriguez
  • Tony M. Norcia
  • Samuel M. McClure
  • Mark Steyvers

The need to test a growing number of theories in cognitive science has led to increased interest in inferential methods that integrate multiple data modalities. In this manuscript, we show how a method for integrating three data modalities within a single framework provides (1) more detailed descriptions of cognitive processes and (2) more accurate predictions of unobserved data than less integrative methods. Specifically, we show how combining either EEG and fMRI with a behavioral model can perform substantially better than a behavioral-data-only model in both generative and predictive modeling analyses. We then show how a trivariate model – a model including EEG, fMRI, and behavioral data – outperforms bivariate models in both generative and predictive modeling analyses. Together, these results suggest that within an appropriate modeling framework, more data can be used to better constrain cognitive theory, and to generate more accurate predictions for behavioral and neural data.

YNIMG Journal 2013 Journal Article

A Bayesian framework for simultaneously modeling neural and behavioral data

  • Brandon M. Turner
  • Birte U. Forstmann
  • Eric-Jan Wagenmakers
  • Scott D. Brown
  • Per B. Sederberg
  • Mark Steyvers

Scientists who study cognition infer underlying processes either by observing behavior (e. g. , response times, percentage correct) or by observing neural activity (e. g. , the BOLD response). These two types of observations have traditionally supported two separate lines of study. The first is led by cognitive modelers, who rely on behavior alone to support their computational theories. The second is led by cognitive neuroimagers, who rely on statistical models to link patterns of neural activity to experimental manipulations, often without any attempt to make a direct connection to an explicit computational theory. Here we present a flexible Bayesian framework for combining neural and cognitive models. Joining neuroimaging and computational modeling in a single hierarchical framework allows the neural data to influence the parameters of the cognitive model and allows behavioral data, even in the absence of neural data, to constrain the neural model. Critically, our Bayesian approach can reveal interactions between behavioral and neural parameters, and hence between neural activity and cognitive mechanisms. We demonstrate the utility of our approach with applications to simulated fMRI data with a recognition model and to diffusion-weighted imaging data with a response time model of perceptual choice.

NeurIPS Conference 2013 Conference Paper

Scoring Workers in Crowdsourcing: How Many Control Questions are Enough?

  • Qiang Liu
  • Alexander Ihler
  • Mark Steyvers

We study the problem of estimating continuous quantities, such as prices, probabilities, and point spreads, using a crowdsourcing approach. A challenging aspect of combining the crowd's answers is that workers' reliabilities and biases are usually unknown and highly diverse. Control items with known answers can be used to evaluate workers' performance, and hence improve the combined results on the target items with unknown answers. This raises the problem of how many control items to use when the total number of items each workers can answer is limited: more control items evaluates the workers better, but leaves fewer resources for the target items that are of direct interest, and vice versa. We give theoretical results for this problem under different scenarios, and provide a simple rule of thumb for crowdsourcing practitioners. As a byproduct, we also provide theoretical analysis of the accuracy of different consensus methods.

NeurIPS Conference 2010 Conference Paper

Learning concept graphs from text with stick-breaking priors

  • America Chambers
  • Padhraic Smyth
  • Mark Steyvers

We present a generative probabilistic model for learning general graph structures, which we term concept graphs, from text. Concept graphs provide a visual summary of the thematic content of a collection of documents-a task that is difficult to accomplish using only keyword search. The proposed model can learn different types of concept graph structures and is capable of utilizing partial prior knowledge about graph structure as well as labeled documents. We describe a generative model that is based on a stick-breaking process for graphs, and a Markov Chain Monte Carlo inference procedure. Experiments on simulated data show that the model can recover known graph structure when learning in both unsupervised and semi-supervised modes. We also show that the proposed model is competitive in terms of empirical log likelihood with existing structure-based topic models (such as hPAM and hLDA) on real-world text data sets. Finally, we illustrate the application of the model to the problem of updating Wikipedia category graphs.

NeurIPS Conference 2009 Conference Paper

The Wisdom of Crowds in the Recollection of Order Information

  • Mark Steyvers
  • Brent Miller
  • Pernille Hemmer
  • Michael Lee

When individuals independently recollect events or retrieve facts from memory, how can we aggregate these retrieved memories to reconstruct the actual set of events or facts? In this research, we report the performance of individuals in a series of general knowledge tasks, where the goal is to reconstruct from memory the order of historic events, or the order of items along some physical dimension. We introduce two Bayesian models for aggregating order information based on a Thurstonian approach and Mallows model. Both models assume that each individuals reconstruction is based on either a random permutation of the unobserved ground truth, or by a pure guessing strategy. We apply MCMC to make inferences about the underlying truth and the strategies employed by individuals. The models demonstrate a wisdom of crowds" effect, where the aggregated orderings are closer to the true ordering than the orderings of the best individual. "

NeurIPS Conference 2006 Conference Paper

Modeling General and Specific Aspects of Documents with a Probabilistic Topic Model

  • Chaitanya Chemudugunta
  • Padhraic Smyth
  • Mark Steyvers

Techniques such as probabilistic topic models and latent-semantic indexing have been shown to be broadly useful at automatically extracting the topical or seman- tic content of documents, or more generally for dimension-reduction of sparse count data. These types of models and algorithms can be viewed as generating an abstraction from the words in a document to a lower-dimensional latent variable representation that captures what the document is generally about beyond the spe- cific words it contains. In this paper we propose a new probabilistic model that tempers this approach by representing each document as a combination of (a) a background distribution over common words, (b) a mixture distribution over gen- eral topics, and (c) a distribution over words that are treated as being specific to that document. We illustrate how this model can be used for information retrieval by matching documents both at a general topic level and at a specific word level, providing an advantage over techniques that only match documents at a general level (such as topic models or latent-sematic indexing) or that only match docu- ments at the specific word level (such as TF-IDF). 1 Introduction and Motivation Reducing high-dimensional data vectors to robust and interpretable lower-dimensional representa- tions has a long and successful history in data analysis, including recent innovations such as latent semantic indexing (LSI) (Deerwester et al, 1994) and latent Dirichlet allocation (LDA) (Blei, Ng, and Jordan, 2003). These types of techniques have found broad application in modeling of sparse high-dimensional count data such as the “bag of words” representations for documents or transaction data for Web and retail applications. Approaches such as LSI and LDA have both been shown to be useful for “object matching” in their respective latent spaces. In information retrieval for example, both a query and a set of documents can be represented in the LSI or topic latent spaces, and the documents can be ranked in terms of how well they match the query based on distance or similarity in the latent space. The mapping to latent space represents a generalization or abstraction away from the sparse set of observed words, to a “higher-level” semantic representation in the latent space. These abstractions in principle lead to better generalization on new data compared to inferences carried out directly in the original sparse high-dimensional space. The capability of these models to provide improved generalization has been demonstrated empirically in a number of studies (e. g. , Deerwester et al 1994; Hofmann 1999; Canny 2004; Buntine et al, 2005). However, while this type of generalization is broadly useful in terms of inference and prediction, there are situations where one can over-generalize. Consider trying to match the following query to a historical archive of news articles: election + campaign + Camejo. The query is intended to find documents that are about US presidential campaigns and also about Peter Camejo (who ran as vice-presidential candidate alongside independent Ralph Nader in 2004). LSI and topic models are likely to highly rank articles that are related to presidential elections (even if they don’t necessarily contain the words election or campaign). However, a potential problem is that the documents that are highly ranked by LSI or topic models need not include any mention of the name Camejo. The reason is that the combination of words in this query is likely to activate one or more latent variables related to the concept of presidential campaigns. However, once this generalization is made the model has “lost” the information about the specific word Camejo and it will only show up in highly ranked documents if this word happens to frequently occur in these topics (unlikely in this case given that this candidate received relatively little media coverage compared to the coverage given to the candidates from the two main parties). But from the viewpoint of the original query, our preference would be to get documents that are about the general topic of US presidential elections with the specific constraint that they mention Peter Camejo. techniques, such as the widely-used term-frequency inverse-document- Word-based retrieval frequency (TF-IDF) method, have the opposite problem in general. They tend to be overly specific in terms of matching words in the query to documents. In general of course one would like to have a balance between generality and specificity. One ad hoc approach is to combine scores from a general method such as LSI with those from a more specific method such as TF-IDF in some manner, and indeed this technique has been proposed in information retrieval (Vogt and Cottrell, 1999). Similarly, in the ad hoc LDA approach (Wei and Croft, 2006), the LDA model is linearly combined with document-specific word distributions to capture both general as well as specific information in documents. However, neither method is entirely satisfactory since it is not clear how to trade-off generality and specificity in a principled way. The contribution of this paper is a new graphical model based on latent topics that handles the trade- off between generality and specificity in a fully probabilistic and automated manner. The model, which we call the special words with background (SWB) model, is an extension of the LDA model. The new model allows words in documents to be modeled as either originating from general topics, or from document-specific “special” word distributions, or from a corpus-wide background distribu- tion. The idea is that words in a document such as election and campaign are likely to come from a general topic on presidential elections, whereas a name such as Camejo is much more likely to be treated as “non-topical” and specific to that document. Words in queries are automatically inter- preted (in a probabilistic manner) as either being topical or special, in the context of each document, allowing for a data-driven document-specific trade-off between the benefits of topic-based abstrac- tion and specific word matching. Daum´e and Marcu (2006) independently proposed a probabilistic model using similar concepts for handling different training and test distributions in classification problems. Although we have focused primarily on documents in information retrieval in the discussion above, the model we propose can in principle be used on any large sparse matrix of count data. For example, transaction data sets where rows are individuals and columns correspond to items purchased or Web sites visited are ideally suited to this approach. The latent topics can capture broad patterns of population behavior and the “special word distributions” can capture the idiosyncracies of specific individuals. Section 2 reviews the basic principles of the LDA model and introduces the new SWB model. Sec- tion 3 illustrates how the model works in practice using examples from New York Times news articles. In Section 4 we describe a number of experiments with 4 different document sets, includ- ing perplexity experiments and information retrieval experiments, illustrating the trade-offs between generalization and specificity for different models. Section 5 contains a brief discussion and con- cluding comments. 2 A Topic Model for Special Words Figure 1(a) shows the graphical model for what we will refer to as the “standard topic model” or LDA. There are D documents and document d has Nd words. α and β are fixed parameters of symmetric Dirichlet priors for the D document-topic multinomials represented by θ and the T topic- word multinomials represented by φ. In the generative model, for each document d, the Nd words

NeurIPS Conference 2005 Conference Paper

Prediction and Change Detection

  • Mark Steyvers
  • Scott Brown

We measure the ability of human observers to predict the next datum in a sequence that is generated by a simple statistical process undergoing change at random points in time. Accurate performance in this task requires the identification of changepoints. We assess individual differences between observers both empirically, and using two kinds of models: a Bayesian approach for change detection and a family of cognitively plausible fast and frugal models. Some individuals detect too many changes and hence perform sub-optimally due to excess variability. Other individuals do not detect enough changes, and perform sub-optimally because they fail to notice short-term temporal trends. 1 I n t r o d u c t i o n Decision-making often requires a rapid response to change. For example, stock analysts need to quickly detect changes in the market in order to adjust investment strategies. Coaches need to track changes in a player’s performance in order to adjust strategy. When tracking changes, there are costs involved when either more or less changes are observed than actually occurred. For example, when using an overly conservative change detection criterion, a stock analyst might miss important short-term trends and interpret them as random fluctuations instead. On the other hand, a change may also be detected too readily. For example, in basketball, a player who makes a series of consecutive baskets is often identified as a “hot hand” player whose underlying ability is perceived to have suddenly increased [1, 2]. This might lead to sub-optimal passing strategies, based on random fluctuations. We are interested in explaining individual differences in a sequential prediction task. Observers are shown stimuli generated from a simple statistical process with the task of predicting the next datum in the sequence. The latent parameters of the statistical process change discretely at random points in time. Performance in this task depends on the accurate detection of those changepoints, as well as inference about future outcomes based on the outcomes that followed the most recent inferred changepoint. There is much prior research in statistics on the problem of identifying changepoints [3, 4, 5]. In this paper, we adopt a Bayesian approach to the changepoint identification problem and develop a simple inference procedure to predict the next datum in a sequence. The Bayesian model serves as an ideal observer model and is useful to characterize the ways in which individuals deviate from optimality. The plan of the paper is as follows. We first introduce the sequential prediction task and discuss a Bayesian analysis of this prediction problem. We then discuss the results from a few individuals in this prediction task and show how the Bayesian approach can capture individual differences with a single “twitchiness” parameter that describes how readily changes are perceived in random sequences. We will show that some individuals are too twitchy: their performance is too variable because they base their predictions on too little of the recent data. Other individuals are not twitchy enough, and they fail to capture fast changes in the data. We also show how behavior can be explained with a set of fast and frugal models [6]. These are cognitively realistic models that operate under plausible computational constraints. 2 A p r e d i c t i o n t a s k w i t h m u l t i p l e c h a n g e p o i n t s In the prediction task, stimuli are presented sequentially and the task is to predict the next stimulus in the sequence. After t trials, the observer has been presented with stimuli y1, y2, …, yt and the task is to make a prediction about yt+1. After the prediction is made, the actual outcome yt+1 is revealed and the next trial proceeds to the prediction of yt+2. This procedure starts with y1 and is repeated for T trials. The observations yt are D-dimensional vectors with elements sampled from binomial distributions. The parameters of those distributions change discretely at random points in time such that the mean increases or decreases after a change point. This generates a sequence of observation vectors, y1, y2, …, yT, where each yt = {yt, 1 … yt, D}. Each of the yt, d is sampled from a binomial distribution Bin(θt, d, K), so 0 ≤ yt, d ≤ K. The parameter vector θt ={θt, 1 … θt, D} changes depending on the locations of the changepoints. At each time step, x is a binary indicator for the occurrence of a t changepoint occurring at time t+1. The parameter α determines the probability of a change occurring in the sequence. The generative model is specified by the following algorithm: For d=1. .D sample θ1, d from a Uniform(0, 1) distribution

NeurIPS Conference 2004 Conference Paper

Integrating Topics and Syntax

  • Thomas Griffiths
  • Mark Steyvers
  • David Blei
  • Joshua Tenenbaum

Statistical approaches to language learning typically focus on either short-range syntactic dependencies or long-range semantic dependencies between words. We present a generative model that uses both kinds of dependencies, and can be used to simultaneously find syntactic classes and semantic topics despite having no representation of syntax or seman- tics beyond statistical dependency. This model is competitive on tasks like part-of-speech tagging and document classification with models that exclusively use short- and long-range dependencies respectively.

NeurIPS Conference 2002 Conference Paper

Prediction and Semantic Association

  • Thomas Griffiths
  • Mark Steyvers

We explore the consequences of viewing semantic association as the result of attempting to predict the concepts likely to arise in a particular context. We argue that the success of existing accounts of semantic representation comes as a result of indirectly addressing this problem, and show that a closer correspondence to human data can be obtained by taking a probabilistic approach that explicitly models the generative structure of language.