Author name cluster

Tom Mitchell

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

27 papers

1 author row

AAAI Conference 2021 Conference Paper

Conversational Neuro-Symbolic Commonsense Reasoning

Forough Arabshahi
Jennifer Lee
Mikayla Gawarecki
Kathryn Mazaitis
Amos Azaria
Tom Mitchell

In order for conversational AI systems to hold more natural and broad-ranging conversations, they will require much more commonsense, including the ability to identify unstated presumptions of their conversational partners. For example, in the command “If it snows at night then wake me up early because I don’t want to be late for work” the speaker relies on commonsense reasoning of the listener to infer the implicit presumption that they wish to be woken only if it snows enough to cause traffic slowdowns. We consider here the problem of understanding such imprecisely stated natural language commands given in the form of if-(state), then-(action), because- (goal) statements. More precisely, we consider the problem of identifying the unstated presumptions of the speaker that allow the requested action to achieve the desired goal from the given state (perhaps elaborated by making the implicit presumptions explicit). We release a benchmark data set for this task, collected from humans and annotated with commonsense presumptions. We present a neuro-symbolic theorem prover that extracts multi-hop reasoning chains, and apply it to this problem. Furthermore, to accommodate the reality that current AI commonsense systems lack full coverage, we also present an interactive conversational framework built on our neurosymbolic system, that conversationally evokes commonsense knowledge from humans to complete its reasoning chains.

AAAI Conference 2021 Conference Paper

We Don’t Speak the Same Language: Interpreting Polarization through Machine Translation

Ashiqur R. KhudaBukhsh
Rupak Sarkar
Mark S. Kamlet
Tom Mitchell

Polarization among US political parties, media and elites is a widely studied topic. Prominent lines of prior research across multiple disciplines have observed and analyzed growing polarization in social media. In this paper, we present a new methodology that offers a fresh perspective on interpreting polarization through the lens of machine translation. With a novel proposition that two sub-communities are speaking in two different languages, we demonstrate that modern machine translation methods can provide a simple yet powerful and interpretable framework to understand the differences between two (or more) large-scale social media discussion data sets at the granularity of words. Via a substantial corpus of 86. 6 million comments by 6. 5 million users on over 200, 000 news videos hosted by YouTube channels of four prominent US news networks, we demonstrate that simple word-level and phrase-level translation pairs can reveal deep insights into the current political divide – what is black lives matter to one can be all lives matter to the other.

AAAI Conference 2020 Conference Paper

Contextual Parameter Generation for Knowledge Graph Link Prediction

George Stoica
Otilia Stretcu
Emmanouil Antonios Platanios
Tom Mitchell
Barnabás Póczos

We consider the task of knowledge graph link prediction. Given a question consisting of a source entity and a relation (e. g. , Shakespeare and BornIn), the objective is to predict the most likely answer entity (e. g. , England). Recent approaches tackle this problem by learning entity and relation embeddings. However, they often constrain the relationship between these embeddings to be additive (i. e. , the embeddings are concatenated and then processed by a sequence of linear functions and element-wise non-linearities). We show that this type of interaction signiﬁcantly limits representational power. For example, such models cannot handle cases where a different projection of the source entity is used for each relation. We propose to use contextual parameter generation to address this limitation. More speciﬁcally, we treat relations as the context in which source entities are processed to produce predictions, by using relation embeddings to generate the parameters of a model operating over source entity embeddings. This allows models to represent more complex interactions between entities and relations. We apply our method on two existing link prediction methods, including the current state-of-the-art, resulting in signiﬁcant performance gains and establishing a new state-of-the-art for this task. These gains are achieved while also reducing convergence time by up to 28 times.

NeurIPS Conference 2019 Conference Paper

Game Design for Eliciting Distinguishable Behavior

Fan Yang
Liu Leqi
Yifan Wu
Zachary Lipton
Pradeep Ravikumar
Tom Mitchell
William Cohen

The ability to inferring latent psychological traits from human behavior is key to developing personalized human-interacting machine learning systems. Approaches to infer such traits range from surveys to manually-constructed experiments and games. However, these traditional games are limited because they are typically designed based on heuristics. In this paper, we formulate the task of designing behavior diagnostic games that elicit distinguishable behavior as a mutual information maximization problem, which can be solved by optimizing a variational lower bound. Our framework is instantiated by using prospect theory to model varying player traits, and Markov Decision Processes to parameterize the games. We validate our approach empirically, showing that our designed games can successfully distinguish among players with different traits, outperforming manually-designed ones by a large margin.

NeurIPS Conference 2019 Conference Paper

Learning Data Manipulation for Augmentation and Weighting

Zhiting Hu
Bowen Tan
Russ Salakhutdinov
Tom Mitchell
Eric Xing

Manipulating data, such as weighting data examples or augmenting with new instances, has been increasingly used to improve model training. Previous work has studied various rule- or learning-based approaches designed for specific types of data manipulation. In this work, we propose a new method that supports learning different manipulation schemes with the same gradient-based algorithm. Our approach builds upon a recent connection of supervised learning and reinforcement learning (RL), and adapts an off-the-shelf reward learning algorithm from RL for joint data manipulation learning and model training. Different parameterization of the ``data reward'' function instantiates different manipulation schemes. We showcase data augmentation that learns a text transformation network, and data weighting that dynamically adapts the data sample importance. Experiments show the resulting algorithms significantly improve the image and text classification performance in low data regime and class-imbalance problems.

NeurIPS Conference 2018 Conference Paper

Learning Pipelines with Limited Data and Domain Knowledge: A Study in Parsing Physics Problems

Mrinmaya Sachan
Kumar Avinava Dubey
Tom Mitchell
Dan Roth
Eric Xing

As machine learning becomes more widely used in practice, we need new methods to build complex intelligent systems that integrate learning with existing software, and with domain knowledge encoded as rules. As a case study, we present such a system that learns to parse Newtonian physics problems in textbooks. This system, Nuts&Bolts, learns a pipeline process that incorporates existing code, pre-learned machine learning models, and human engineered rules. It jointly trains the entire pipeline to prevent propagation of errors, using a combination of labelled and unlabelled data. Our approach achieves a good performance on the parsing task, outperforming the simple pipeline and its variants. Finally, we also show how Nuts&Bolts can be used to achieve improvements on a relation extraction task and on the end task of answering Newtonian physics problems.

NeurIPS Conference 2017 Conference Paper

Estimating Accuracy from Unlabeled Data: A Probabilistic Logic Approach

Emmanouil Platanios
Hoifung Poon
Tom Mitchell
Eric Horvitz

We propose an efficient method to estimate the accuracy of classifiers using only unlabeled data. We consider a setting with multiple classification problems where the target classes may be tied together through logical constraints. For example, a set of classes may be mutually exclusive, meaning that a data instance can belong to at most one of them. The proposed method is based on the intuition that: (i) when classifiers agree, they are more likely to be correct, and (ii) when the classifiers make a prediction that violates the constraints, at least one classifier must be making an error. Experiments on four real-world data sets produce accuracy estimates within a few percent of the true accuracy, using solely unlabeled data. Our models also outperform existing state-of-the-art solutions in both estimating accuracies, and combining multiple classifier outputs. The results emphasize the utility of logical constraints in estimating accuracy, thus validating our intuition.

IJCAI Conference 2017 Conference Paper

Parsing Natural Language Conversations using Contextual Cues

Shashank Srivastava
Amos Azaria
Tom Mitchell

In this work, we focus on semantic parsing of natural language conversations. Most existing methods for semantic parsing are based on understanding the semantics of a single sentence at a time. However, understanding conversations also requires an understanding of conversational context and discourse structure across sentences. We formulate semantic parsing of conversations as a structured prediction task, incorporating structural features that model the `flow of discourse' across sequences of utterances. We create a dataset for semantic parsing of conversations, consisting of 113 real-life sequences of interactions of human users with an automated email assistant. The data contains 4759 natural language statements paired with annotated logical forms. Our approach yields significant gains in performance over traditional semantic parsing.

AAAI Conference 2016 Conference Paper

Inferring Interpersonal Relations in Narrative Summaries

Shashank Srivastava
Snigdha Chaturvedi
Tom Mitchell

Characterizing relationships between people is fundamental for the understanding of narratives. In this work, we address the problem of inferring the polarity of relationships between people in narrative summaries. We formulate the problem as a joint structured prediction for each narrative, and present a general model that combines evidence from linguistic and semantic features, as well as features based on the structure of the social community in the text. We additionally provide a clustering-based approach that can exploit regularities in narrative types. e. g. , learn an afﬁnity for love-triangles in romantic stories. On a dataset of movie summaries from Wikipedia, our structured models provide more than a 30% error-reduction over a competitive baseline that considers pairs of characters in isolation.

AAAI Conference 2016 Conference Paper

Instructable Intelligent Personal Agent

Amos Azaria
Jayant Krishnamurthy
Tom Mitchell

Unlike traditional machine learning methods, humans often learn from natural language instruction. As users become increasingly accustomed to interacting with mobile devices using speech, their interest in instructing these devices in natural language is likely to grow. We introduce our Learning by Instruction Agent (LIA), an intelligent personal agent that users can teach to perform new action sequences to achieve new commands, using solely natural language interaction. LIA uses a CCG semantic parser to ground the semantics of each command in terms of primitive executable procedures deﬁning sensors and effectors of the agent. Given a natural language command that LIA does not understand, it prompts the user to explain how to achieve the command through a sequence of steps, also speciﬁed in natural language. A novel lexicon induction algorithm enables LIA to generalize across taught commands, e. g. , having been taught how to “forward an email to Alice, ” LIA can correctly interpret the command “forward this email to Bob. ” A user study involving email tasks demonstrates that users voluntarily teach LIA new commands, and that these taught commands signiﬁcantly reduce task completion time. These results demonstrate the potential of natural language instruction as a signiﬁcant, under-explored paradigm for machine learning.

IJCAI Conference 2015 Conference Paper

AskWorld: Budget-Sensitive Query Evaluation for Knowledge-on-Demand

Mehdi Samadi
Partha Talukdar
Manuela Veloso
Tom Mitchell

Recently, several Web-scale knowledge harvesting systems have been built, each of which is competent at extracting information from certain types of data (e. g. , unstructured text, structured tables on the web, etc.). In order to determine the response to a new query posed to such systems (e. g. , is sugar a healthy food?), it is useful to integrate opinions from multiple systems. If a response is desired within a specific time budget (e. g. , in less than 2 seconds), then maybe only a subset of these resources can be queried. In this paper, we address the problem of knowledge integration for on-demand time-budgeted query answering. We propose a new method, AskWorld, which learns a policy that chooses which queries to send to which resources, by accommodating varying budget constraints that are available only at query (test) time. Through extensive experiments on real world datasets, we demonstrate AskWorld’s capability in selecting most informative resources to query within test-time constraints, resulting in improved performance compared to competitive baselines.

AAAI Conference 2015 Conference Paper

Never-Ending Learning

Tom Mitchell
William Cohen
Estevam Hruschka
Partha Talukdar
Justin Betteridge
Andrew Carlson
Bhavana Dalvi Mishra
Matthew Gardner

Whereas people learn many different types of knowledge from diverse experiences over many years, most current machine learning systems acquire just a single function or data model from just a single data set. We propose a never-ending learning paradigm for machine learning, to better reﬂect the more ambitious and encompassing type of learning performed by humans. As a case study, we describe the Never- Ending Language Learner (NELL), which achieves some of the desired properties of a never-ending learner, and we discuss lessons learned. NELL has been learning to read the web 24 hours/day since January 2010, and so far has acquired a knowledge base with over 80 million conﬁdence-weighted beliefs (e. g. , servedWith(tea, biscuits)), while learning continually to improve its reading competence over time. NELL has also learned to reason over its knowledge base to infer new beliefs from old ones, and is now beginning to extend its ontology by synthesizing new relational predicates. NELL can be tracked online at http: //rtw. ml. cmu. edu, and followed on Twitter at @CMUNELL.

YNIMG Journal 2014 Journal Article

Multivariate analysis of correlation between electrophysiological and hemodynamic responses during cognitive processing

Jan Kujala
Gustavo Sudre
Johanna Vartiainen
Mia Liljeström
Tom Mitchell
Riitta Salmelin

Animal and human studies have frequently shown that in primary sensory and motor regions the BOLD signal correlates positively with high-frequency and negatively with low-frequency neuronal activity. However, recent evidence suggests that this relationship may also vary across cortical areas. Detailed knowledge of the possible spectral diversity between electrophysiological and hemodynamic responses across the human cortex would be essential for neural-level interpretation of fMRI data and for informative multimodal combination of electromagnetic and hemodynamic imaging data, especially in cognitive tasks. We applied multivariate partial least squares correlation analysis to MEG–fMRI data recorded in a reading paradigm to determine the correlation patterns between the data types, at once, across the cortex. Our results revealed heterogeneous patterns of high-frequency correlation between MEG and fMRI responses, with marked dissociation between lower and higher order cortical regions. The low-frequency range showed substantial variance, with negative and positive correlations manifesting at different frequencies across cortical regions. These findings demonstrate the complexity of the neurophysiological counterparts of hemodynamic fluctuations in cognitive processing.

YNIMG Journal 2012 Journal Article

Tracking neural coding of perceptual and semantic features of concrete nouns

Gustavo Sudre
Dean Pomerleau
Mark Palatucci
Leila Wehbe
Alona Fyshe
Riitta Salmelin
Tom Mitchell

We present a methodological approach employing magnetoencephalography (MEG) and machine learning techniques to investigate the flow of perceptual and semantic information decodable from neural activity in the half second during which the brain comprehends the meaning of a concrete noun. Important information about the cortical location of neural activity related to the representation of nouns in the human brain has been revealed by past studies using fMRI. However, the temporal sequence of processing from sensory input to concept comprehension remains unclear, in part because of the poor time resolution provided by fMRI. In this study, subjects answered 20 questions (e. g. is it alive?) about the properties of 60 different nouns prompted by simultaneous presentation of a pictured item and its written name. Our results show that the neural activity observed with MEG encodes a variety of perceptual and semantic features of stimuli at different times relative to stimulus onset, and in different cortical locations. By decoding these features, our MEG-based classifier was able to reliably distinguish between two different concrete nouns that it had never seen before. The results demonstrate that there are clear differences between the time course of the magnitude of MEG activity and that of decodable semantic information. Perceptual features were decoded from MEG activity earlier in time than semantic features, and features related to animacy, size, and manipulability were decoded consistently across subjects. We also observed that regions commonly associated with semantic processing in the fMRI literature may not show high decoding results in MEG. We believe that this type of approach and the accompanying machine learning methods can form the basis for further modeling of the flow of neural information during language processing and a variety of other cognitive processes.

YNIMG Journal 2011 Journal Article

Quantitative modeling of the neural representation of objects: How semantic feature norms can account for fMRI activation

Kai-min Kevin Chang
Tom Mitchell
Marcel Adam Just

Recent multivariate analyses of fMRI activation have shown that discriminative classifiers such as Support Vector Machines (SVM) are capable of decoding fMRI-sensed neural states associated with the visual presentation of categories of various objects. However, the lack of a generative model of neural activity limits the generality of these discriminative classifiers for understanding the underlying neural representation. In this study, we propose a generative classifier that models the hidden factors that underpin the neural representation of objects, using a multivariate multiple linear regression model. The results indicate that object features derived from an independent behavioral feature norming study can explain a significant portion of the systematic variance in the neural activity observed in an object-contemplation task. Furthermore, the resulting regression model is useful for classifying a previously unseen neural activation vector, indicating that the distributed pattern of neural activities encodes sufficient signal to discriminate differences among stimuli. More importantly, there appears to be a double dissociation between the two classifier approaches and within- versus between-participants generalization. Whereas an SVM-based discriminative classifier achieves the best classification accuracy in within-participants analysis, the generative classifier outperforms an SVM-based model which does not utilize such intermediate representations in between-participants analysis. This pattern of results suggests the SVM-based classifier may be picking up some idiosyncratic patterns that do not generalize well across participants and that good generalization across participants may require broad, large-scale patterns that are used in our set of intermediate semantic features. Finally, this intermediate representation allows us to extrapolate the model of the neural activity to previously unseen words, which cannot be done with a discriminative classifier.

AAAI Conference 2010 Conference Paper

Toward an Architecture for Never-Ending Language Learning

Andrew Carlson
Justin Betteridge
Bryan Kisiel
Burr Settles
Estevam Hruschka
Tom Mitchell

We consider here the problem of building a never-ending language learner; that is, an intelligent computer agent that runs forever and that each day must (1) extract, or read, information from the web to populate a growing structured knowledge base, and (2) learn to perform this task better than on the previous day. In particular, we propose an approach and a set of design principles for such an agent, describe a partial implementation of such a system that has already learned to extract a knowledge base containing over 242, 000 beliefs with an estimated precision of 74% after running for 67 days, and discuss lessons learned from this preliminary attempt to build a never-ending learning agent.

YNIMG Journal 2009 Journal Article

Machine learning classifiers and fMRI: A tutorial overview

Francisco Pereira
Tom Mitchell
Matthew Botvinick

Interpreting brain image experiments requires analysis of complex, multivariate data. In recent years, one analysis approach that has grown in popularity is the use of machine learning algorithms to train classifiers to decode stimuli, mental states, behaviours and other variables of interest from fMRI data and thereby show the data contain information about them. In this tutorial overview we review some of the key choices faced in using this approach as well as how to derive statistically significant results, illustrating each point from a case study. Furthermore, we show how, in addition to answering the question of ‘is there information about a variable of interest’ (pattern discrimination), classifiers can be used to tackle other classes of question, namely ‘where is the information’ (pattern localization) and ‘how is that information encoded’ (pattern characterization).

NeurIPS Conference 2009 Conference Paper

Zero-shot Learning with Semantic Output Codes

Mark Palatucci
Dean Pomerleau
Geoffrey Hinton
Tom Mitchell

We consider the problem of zero-shot learning, where the goal is to learn a classifier $f: X \rightarrow Y$ that must predict novel values of $Y$ that were omitted from the training set. To achieve this, we define the notion of a semantic output code classifier (SOC) which utilizes a knowledge base of semantic properties of $Y$ to extrapolate to novel classes. We provide a formalism for this type of classifier and study its theoretical properties in a PAC framework, showing conditions under which the classifier can accurately predict novel classes. As a case study, we build a SOC classifier for a neural decoding task and show that it can often predict words that people are thinking about from functional magnetic resonance images (fMRI) of their neural activity, even without training examples for those words.

NeurIPS Conference 2004 Conference Paper

Detecting Significant Multidimensional Spatial Clusters

Daniel Neill
Andrew Moore
Francisco Pereira
Tom Mitchell

Assume a uniform, multidimensional grid of bivariate data, where each cell of the grid has a count ci and a baseline bi. Our goal is to find spatial regions (d-dimensional rectangles) where the ci are significantly higher than expected given bi. We focus on two applications: detection of clusters of disease cases from epidemiological data (emergency depart- ment visits, over-the-counter drug sales), and discovery of regions of in- creased brain activity corresponding to given cognitive tasks (from fMRI data). Each of these problems can be solved using a spatial scan statistic (Kulldorff, 1997), where we compute the maximum of a likelihood ratio statistic over all spatial regions, and find the significance of this region by randomization. However, computing the scan statistic for all spatial regions is generally computationally infeasible, so we introduce a novel fast spatial scan algorithm, generalizing the 2D scan algorithm of (Neill and Moore, 2004) to arbitrary dimensions. Our new multidimensional multiresolution algorithm allows us to find spatial clusters up to 1400x faster than the naive spatial scan, without any loss of accuracy. 1 Introduction One of the core goals of modern statistical inference and data mining is to discover patterns and relationships in data. In many applications, however, it is important not only to discover patterns, but to distinguish those patterns that are significant from those that are likely to have occurred by chance. This is particularly important in epidemiological applications, where a rise in the number of disease cases in a region may or may not be indicative of an emerging epidemic. In order to decide whether further investigation is necessary, epidemiologists must know not only the location of a possible outbreak, but also some measure of the likelihood that an outbreak is occurring in that region. Similarly, when investigating brain imaging data, we want to not only find regions of increased activity, but determine whether these increases are significant or due to chance fluctuations. More generally, we are interested in spatial data mining problems where the goal is detec- tion of overdensities: spatial regions with high counts relative to some underlying baseline. In the epidemiological datasets, the count is some quantity (e. g. number of disease cases, or units of cough medication sold) in a given area, where the baseline is the expected value of that quantity based on historical data. In the brain imaging datasets, our count is the total fMRI activation in a given set of voxels under the experimental condition, while our baseline is the total activation in that set of voxels under the null or control condition. We consider the case in which data has been aggregated to a uniform, d-dimensional grid. For the fMRI data, we have three spatial dimensions; for the epidemiological data, we have two spatial dimensions but also use several other quantities (time, patients' age and gender) as "pseudo-spatial" dimensions; this is discussed in more detail below. In the general case, let G be a d-dimensional grid of cells, with size N1 N2. .. Nd. Each cell si G (where i is a d-dimensional vector) is associated with a count ci and a baseline bi. Our goal is to search over all d-dimensional rectangular regions S G, and find regions where the total count C(S) = S ci is higher than expected, given the baseline B(S) = S bi. In addition to discovering these high-density regions, we must also perform statistical testing to determine whether these regions are significant. As is necessary in the scan statistics framework, we focus on finding the single, most significant region; the method can be iterated (removing each significant cluster once it is found) to find multiple significant regions. 1. 1 Likelihood ratio statistics Our basic model assumes that counts ci are generated by an inhomogeneous Poisson pro- cess with mean qbi, where q (the underlying ratio of count to baseline) may vary spatially. We wish to detect hyper-rectangular regions S such that q is significantly higher inside S than outside S. To do so, for a given region S, we assume that q = qin uniformly for cells si S, and q = qout uniformly for cells si G-S. We then test the null hypothesis H0(S): qin (1+)qout against the alternative hypothesis H1(S): qin > (1+)qout. If = 0, this is equivalent to the classical spatial scan statistic [1-2]: we are testing for regions where qin is greater than qout. However, in many real-world applications (including the epidemiological and fMRI datasets discussed later) we expect some fluctuation in the underlying baseline; thus, we do not want to detect all deviations from baseline, but only those where the amount of deviation is greater than some threshold. For example, a 10% increase in disease cases in some region may not be interesting to epidemiologists, even if the underlying population is large enough to conclude that this is a "real" (statistically significant) increase in q. By increasing, we can focus the scan statistic on regions with larger ratios of count to base- line. For example, we can use the scan statistic with = 0. 25 to test for regions where qin is more than 25% higher than qout. Following Kulldorff [1], our spatial scan statistic is the maximum, over all regions S, of the ratio of the likelihoods under the alternative and null hypotheses. Taking logs for convenience, we have: sup q s D i (S) = log in>(1+)qout S P(ci Po(qinbi))siG-S P(ci Po(qoutbi)) sup qin(1+)qout siS P(ci Po(qinbi))siG-S P(ci Po(qoutbi)) C(S) C C = ( tot tot sgn) C(S) log + (C -C(S) ( tot 1 + )B(S) -C(S))log Btot -B(S) -CtotlogBtot+B(S) where C(S) and B(S) are the count and baseline of the region S under consideration, Ctot and Btot are the total count and baseline of the entire grid G, and sgn = +1 if C(S) > (1 + B(S) )Ctot-C(S) and -1 otherwise. Then the scan statistic D B, max is equal to the maximum D(S) tot -B(S) over all spatial regions (d-dimensional rectangles) under consideration. We note that our statistical and computational methods are not limited to the Poisson model given here; any model of null and alternative hypotheses such that the resulting statistic D(S) satisfies the conditions given in [4] can be used for the fast spatial scan. 1. 2 Randomization testing Once we have found the highest scoring region S = arg maxS D(S) of grid G, we must still determine the statistical significance of this region. Since the exact distribution of the test statistic Dmax is only known in special cases, in general we must find the region's p-value by randomization. To do so, we run a large number R of random replications, where a replica has the same underlying baselines bi as G, but counts are randomly drawn from the null hypothesis H0(S). More precisely, we pick ci Po(qbi), where q = qin = (1+) Ctot Btot +B(S) for si S, and q = qout = Ctot for s B i tot +B(S) G - S. The number of replicas G with Dmax(G ) Dmax(G), divided by the total number of replications R, gives us the p-value for our most significant region S. If this p-value is less than (where is the false positive rate, typically chosen to be 0. 05 or 0. 1), we can conclude that the discovered region is statistically significant at level. 1. 3 The naive spatial scan The simplest method of finding Dmax is to compute D(S) for all rectangular regions of sizes k1 k2. .. kd, where 1 kj Nj. Since there are a total of d (N j=1 j - kj + 1) regions of each size, there are a total of O(d N2) j=1 regions to examine. We can compute D(S) j for any region S in constant time, by first finding the count C(S) and baseline B(S), then computing D. 1 This allows us to compute Dmax of a grid G in O(d N2) j=1 time. However, j significance testing by randomization also requires us to find Dmax for each replica G, and compare this to Dmax(G); thus the total complexity is multiplied by the number of replications R. When the size of the grid is large, as is the case for the epidemiological and fMRI datasets we are considering, this naive approach is computationally infeasible. Instead, we apply our "overlap-multiresolution partitioning" algorithm [3-4], generalizing this method from two-dimensional to d-dimensional datasets. This reduces the complexity to O(d N j=1 j log N j ) in cases where the most significant region S has a sufficiently high ra- tio of count to baseline, and (as we show in Section 3) typically results in tens to thousands of times speedup over the naive approach. We note that this fast spatial scan algorithm is exact (always finds the correct value of Dmax and the corresponding region S); the speedup results from the observation that we do not need to search a given set of regions if we can prove that none of them have score > Dmax. Thus we use a top-down, branch-and-bound approach: we maintain the current maximum score of the regions we have searched so far, calculate upper bounds on the scores of subregions contained in a given region, and prune regions whose upper bounds are less than the current value of Dmax. When searching a replica grid, we care only whether Dmax of the replica grid is greater than Dmax(G). Thus we can use Dmax of the original grid for pruning on the replicas, and can stop searching a replica if we find a region with score > Dmax(G). 2 Overlap-multiresolution partitioning As in [4], we use a multiresolution search method which relies on an overlap-kd tree data structure. The overlap-kd tree, like kd-trees [5] and quadtrees [6], is a hierarchical, space- partitioning data structure. The root node of the tree represents the entire space under consideration (i. e. the entire grid G), and each other node represents a subregion of the grid. Each non-leaf node of a d-dimensional overlap-kd tree has 2d children, an "upper" and a "lower" child in each dimension. For example, in three dimensions, a node has six children: upper and lower children in the x, y, and z dimensions. The overlap-kd tree is different from the standard kd-tree and quadtree in that adjacent regions overlap: rather than splitting the region in half along each dimension, instead each child contains more than half the area of the parent region. For example, a 64 64 64 grid will have six children: two of size 48 6464, two of size 644864, and two of size 646448. 1An old trick makes it possible to compute the count and baseline of any rectangular region in time constant in N: we first form a d-dimensional array of the cumulative counts, then compute each region's count by adding/subtracting at most 2d cumulative counts. Note that because of the exponential dependence on d, these techniques suffer from the "curse of dimensionality": neither the naive spatial scan, nor the fast spatial scan discussed below, are appropriate for very high dimensional datasets. In general, let region S have size k1 k2. .. kd. Then the two children of S in dimension j (for j = 1. .. d) have size k1. .. kj-1 fjkj kj+1. .. kd, where 1 S Figure 1: Overlap-multires partitioning of region S (for d = 2). Any subregion of S either a) is contained in some S S_1 S_2 S_3 i, S_4 S_C i = 1. .. 4, or b) contains SC. Now we can search all subregions of S by recursively searching S1. .. S2d, then searching all of the regions contained in S which contain the center SC. There may be a large number of such "outer regions, " but since we know that each such region contains the center, we can place very tight bounds on the score of these regions, often allowing us to prune most or all of them. Thus the basic outline of our search procedure (ignoring pruning, for the moment) is: overlap-search(S) { call base-case-search(S) define child regions S 1. .S 2d, center S C as above call overlap-search(S i) for i=1. .2d for all S' such that S' is contained in S and contains S_C, call base-case-search(S') } The fractions fi are selected based on the current sizes ki of the region being searched: if ki = 2m, then fi = 3, and if k. For simplicity, we assume that 4 i = 3 2m, then fi = 23 all Ni are powers of two, and thus all region sizes ki will fall into one of these two cases. Repeating this partitioning recursively, we obtain the overlap-kd tree structure. For d = 2, the first two levels of the overlap-kd tree are shown in Figure 2. Figure 2: The first two levels of the two- dimensional overlap-kd tree. Each node represents a gridded region (denoted by a thick rectangle) of the entire dataset (thin square and dots). The overlap-kd tree has several useful properties, which we present here without proof. First, for every rectangular region S G, either S is a gridded region (contained in the overlap-kd tree), or there exists a unique gridded region S such that S is an outer region of S (i. e. S is contained in S, and contains the center region of S ). This means that, if overlap-search is called exactly once for each gridded region2, and no pruning is done, then base-case-search will be called exactly once for every rectangular region S G. In practice, we will prune many regions, so base-case-search will be called at most once for every rect- angular region, and every region will be either searched or pruned. The second nice prop- erty of our overlap-kd tree is that the total number of gridded regions is O(d N j=1 j log N j ). This implies that, if we are able to prune (almost) all outer regions, we can find Dmax of the grid in O(d N N2) j=1 j log N j ) time rather than O(dj=1. In fact, we may not even need to j search all gridded regions, so in many cases the search will be even faster. 2As in [4], we use "lazy expansion" to ensure that gridded regions are not multiply searched. 2. 1 Score bounds and pruning We now consider which regions can be pruned (discarded without searching) during our multiresolution search procedure. First, given some region S, we must calculate an upper bound on the scores D(S ) for regions S S. More precisely, we are interested in two upper bounds: a bound on the score of all subregions S S, and a bound on the score of the outer subregions of S (those regions contained in S and containing its center SC). If the first bound is less than or equal to Dmax, we can prune region S completely; we do not need to search any (gridded or outer) subregion of S. If only the second bound is less than or equal to Dmax, we do not need to search the outer subregions of S, but we must recursively call overlap-search on the gridded children of S. If both bounds are greater than Dmax, we must both recursively call overlap-search and search the outer regions. Score bounds are calculated based on various pieces of information about the subregions of S, including: upper and lower bounds bmax, bmin on the baseline of subregions S; an upper bound dmax on the ratio C of S; an upper bound d of S B inc on the ratio C B -SC; and a lower bound dmin on the ratio C of S B -S. We also know the count C and baseline B of region S, and the count ccenter and baseline bcenter of region SC. Let cin and bin be the count and baseline of S. To find an upper bound on D(S ), we must calculate the values of cin and bin which maximize D subject to the given constraints: cin-ccenter bin-bcenter dinc, cin bin dmax, C-cin B-bin dmin, and bmin bin bmax. The solution to this maximization problem is derived in [4], and (since scores are based only on count and baseline rather than the size and shape of the region) it applies directly to the multidimensional case. The bounds on baselines and ratios C are first calculated using global values (as a fast, "first-pass" pruning technique). B For the remaining, unpruned regions, we calculate tighter bounds using the quartering method of [4], and use these to prune more regions. 2. 2 Related work Our work builds most directly on the results of Kulldorff [1], who presents the two- dimensional spatial scan framework and the classical ( = 0) likelihood ratio statistic. It also extends [4], in which we present the two-dimensional fast spatial scan. Our major extensions in the present work are twofold: the d-dimensional fast spatial scan, and the generalized likelihood ratio statistics D. A variety of other cluster detection techniques exist in the literature on epidemiology [1-3, 7-8], brain imaging [9-11], and machine learn- ing [12-15]. The machine learning literature focuses on heuristic or approximate cluster- finding techniques, which typically cannot deal with spatially varying baselines, and more importantly, give no information about the statistical significance of the clusters found. Our technique is exact (in that it calculates the maximum of the likelihood ratio statistic over all hyper-rectangular spatial regions), and uses a powerful statistical test to determine significance. Nevertheless, other methods in the literature have some advantages over the present approach, such as applicability to high-dimensional data and fewer assumptions on the underlying model. The fMRI literature generally tests significance on a per-voxel basis (after applying some method of spatial smoothing); clusters must then be inferred by grouping individually significant voxels, and (with the exception of [10]) no per-cluster false positive rate is guaranteed. The epidemiological literature focuses on detecting signif- icant circular, two-dimensional clusters, and thus cannot deal with multidimensional data or elongated regions. Detection of elongated regions is extremely important in both epi- demiology (because of the need to detect windborne or waterborne pathogens) and brain imaging (because of the "folded sheet" structure of the brain); the present work, as well as [4], allow detection of such clusters.

NeurIPS Conference 2003 Conference Paper

Training fMRI Classifiers to Detect Cognitive States across Multiple Human Subjects

Xuerui Wang
Rebecca Hutchinson
Tom Mitchell

We consider learning to classify cognitive states of human subjects, based on their brain activity observed via functional Magnetic Resonance Imaging (fMRI). This problem is important because such classiﬁers con- stitute “virtual sensors” of hidden cognitive states, which may be useful in cognitive science research and clinical applications. In recent work, Mitchell, et al. [6, 7, 9] have demonstrated the feasibility of training such classiﬁers for individual human subjects (e. g. , to distinguish whether the subject is reading an ambiguous or unambiguous sentence, or whether they are reading a noun or a verb). Here we extend that line of research, exploring how to train classiﬁers that can be applied across multiple hu- man subjects, including subjects who were not involved in training the classiﬁer. We describe the design of several machine learning approaches to training multiple-subject classiﬁers, and report experimental results demonstrating the success of these methods in learning cross-subject classiﬁers for two different fMRI data sets.

AIJ Journal 2001 Journal Article

Author's response to reviews of Machine Learning

Tom Mitchell

AIJ Journal 2000 Journal Article

Learning to construct knowledge bases from the World Wide Web

Mark Craven
Dan DiPasquo
Dayne Freitag
Andrew McCallum
Tom Mitchell
Kamal Nigam
Seán Slattery

The World Wide Web is a vast source of information accessible to computers, but understandable only to humans. The goal of the research described here is to automatically create a computer understandable knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Web information, and promote new uses of the Web to support knowledge-based inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs. The first is an ontology that defines the classes (e. g. , company, person, employee, product) and relations (e. g. , employed_by, produced_by) of interest when creating the knowledge base. The second is a set of training data consisting of labeled regions of hypertext that represent instances of these classes and relations. Given these inputs, the system learns to extract information from other pages and hyperlinks on the Web. This article describes our general approach, several machine learning algorithms for this task, and promising initial results with a prototype system that has created a knowledge base describing university people, courses, and research projects.

AAAI Conference 1998 Conference Paper

Learning to Extract Symbolic Knowledge from the World Wide Web

Mark Craven
Dayne Freitag
Tom Mitchell

The World Wide Webis a vast source of information accessible to computers, but understandable only to humans. Thegoal of the research described here is to automatically create a computer understandable world wide knowledge base whose content mirrors that of the World Wide Web. Such a knowledge base would enable much more effective retrieval of Webinformation, and promote newuses of the Webto support knowledgebased inference and problem solving. Our approach is to develop a trainable information extraction system that takes two inputs: an ontology defining the classes and relations of interest, and a set of training data consisting of labeled reg-ions of hypertext representing instances of these classes and relations. Giventhese inputs, the system learns to extract information from other pages and hyperlinks on the Web. This paper describes our general approach, several machinelearning algorithms for this task, and promising initial results with a prototype system.

AIIM Journal 1997 Journal Article

An evaluation of machine-learning methods for predicting pneumonia mortality

Gregory F. Cooper
Constantin F. Aliferis
Richard Ambrosino
John Aronis
Bruce G. Buchanan
Richard Caruana
Michael J. Fine
Clark Glymour

This paper describes the application of eight statistical and machine-learning methods to derive computer models for predicting mortality of hospital patients with pneumonia from their findings at initial presentation. The eight models were each constructed based on 9847 patient cases and they were each evaluated on 4352 additional cases. The primary evaluation metric was the error in predicted survival as a function of the fraction of patients predicted to survive. This metric is useful in assessing a model's potential to assist a clinician in deciding whether to treat a given patient in the hospital or at home. We examined the error rates of the models when predicting that a given fraction of patients will survive. We examined survival fractions between 0. 1 and 0. 6. Over this range, each model's predictive error rate was within 1% of the error rate of every other model. When predicting that approximately 30% of the patients will survive, all the models have an error rate of less than 1. 5%. The models are distinguished more by the number of variables and parameters that they contain than by their error rates; these differences suggest which models may be the most amenable to future implementation as paper-based guidelines.

NeurIPS Conference 1995 Conference Paper

Using the Future to "Sort Out" the Present: Rankprop and Multitask Learning for Medical Risk Evaluation

Rich Caruana
Shumeet Baluja
Tom Mitchell

A patient visits the doctor; the doctor reviews the patient's history, asks questions, makes basic measurements (blood pressure, .. . ), and prescribes tests or treatment. The prescribed course of action is based on an assessment of patient risk-patients at higher risk are given more and faster attention. It is also sequential- it is too expensive to immediately order all tests which might later be of value. This paper presents two methods that together improve the accuracy of backprop nets on a pneumonia risk assessment problem by 10-50%. Rankprop improves on backpropagation with sum of squares error in ranking patients by risk. Multitask learning takes advantage of future lab tests available in the training set, but not available in practice when predictions must be made. Both methods are broadly applicable.

AAAI Conference 1992 Conference Paper

A Personal Learning Apprentice

Lisa Dent
Tom Mitchell

Personalized knowledge-based systems have not yet become widespread, despite their potential for valuable assistance in many daily tasks. This is due, in part, to the high cost of developing and maintaining customized knowledge bases. The construction of personal assistants as learning apprentices -- interactive assistants that learn continually from their users -- is one approach which could dramatically reduce the cost of knowledge-based advisors. We present one such personal learning apprentice, called CAP, which assists in managing a meeting calendar. CAP has been used since June 1991 by a secretary in our work place to manage a faculty member’s meeting calendar, and is the first instance of a fielded learning apprentice in routine use. This paper describes the organization of CAP, its performance in initial field tests, and more general lessons learned from this effort about learning apprentice systems.

NeurIPS Conference 1992 Conference Paper

Explanation-Based Neural Network Learning for Robot Control

Tom Mitchell
Sebastian Thrun

How can artificial neural nets generalize better from fewer examples? In order to generalize successfully, neural network learning methods typically require large training data sets. We introduce a neural network learning method that generalizes rationally from many fewer data points, relying instead on prior knowledge encoded in previously learned neural networks. For example, in robot control learning tasks reported here, previously learned networks that model the effects of robot actions are used to guide subsequent learning of robot control functions. For each observed training example of the target function (e. g. the robot control policy), the learner explains the observed example in terms of its prior knowledge, then analyzes this explanation to infer additional information about the shape, or slope, of the target function. This shape knowledge is used to bias generalization when learning the target function. Results are presented applying this approach to a simulated robot task based on reinforcement learning.