Author name cluster

Charles Elkan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers

2 author rows

ICML Conference 2012 Conference Paper

Predicting accurate probabilities with a ranking loss

Aditya Krishna Menon
Xiaoqian Jiang
Shankar Vembu
Charles Elkan
Lucila Ohno-Machado

Details

EWRL Workshop 2011 Conference Paper

Reinforcement Learning with a Bilinear Q Function

Charles Elkan

Abstract Many reinforcement learning methods are based on a function Q ( s, a ) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q ( s, a ) = s T Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that W can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.

Details

JMLR Journal 2010 Journal Article

Quadratic Programming Feature Selection

Irene Rodriguez-Lujan
Ramon Huerta
Charles Elkan
Carlos Santa Cruz

Identifying a subset of features that preserves classification accuracy is a problem of growing importance, because of the increasing size and dimensionality of real-world data sets. We propose a new feature selection method, named Quadratic Programming Feature Selection (QPFS), that reduces the task to a quadratic optimization problem. In order to limit the computational complexity of solving the optimization problem, QPFS uses the Nyström method for approximate matrix diagonalization. QPFS is thus capable of dealing with very large data sets, for which the use of other methods is computationally expensive. In experiments with small and medium data sets, the QPFS method leads to classification accuracy similar to that of other successful techniques. For large data sets, QPFS is superior in terms of computational efficiency. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

PDF Details

ICML Conference 2009 Conference Paper

Accounting for burstiness in topic models

Gabriel Doyle
Charles Elkan

Details

ICML Conference 2006 Conference Paper

Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution

Charles Elkan

Details

ICML Conference 2005 Conference Paper

Modeling word burstiness using the Dirichlet distribution

Rasmus Elsborg Madsen
David Kauchak
Charles Elkan

Details

JMLR Journal 2004 Journal Article

Sources of Success for Boosted Wrapper Induction

David Kauchak
Joseph Smarr
Charles Elkan

In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We investigate systematically how each algorithmic component of BWI, in particular boosting, contributes to its success. We show that the benefit of boosting arises from the ability to reweight examples to learn specific rules (resulting in high precision) combined with the ability to continue learning rules after all positive examples have been covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose a new measure that we call the SWI ratio. We show that this measure is a good predictor of IE success and a useful tool for analyzing IE tasks. Based on these results, we analyze the strengths and limitations of BWI. Specifically, we explain limitations in the information made available, and in the representations used. We also investigate the consequences of the fact that confidence values returned during extraction are not true probabilities. Next, we investigate the benefits of including grammatical and semantic information for natural text documents, as well as parse tree and attribute-value information for XML and HTML documents. We show experimentally that incorporating even limited grammatical information can increase the regularity of natural text extraction tasks, resulting in improved performance. We conclude with proposals for enriching the representational power of BWI and other IE methods to exploit these and other types of regularities. [abs] [ pdf ][ ps.gz ][ ps ]

PDF Details

NeurIPS Conference 2003 Conference Paper

Learning the k in k-means

Greg Hamerly
Charles Elkan

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic prob- lem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test ac- cepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the stand- ard statistical signiﬁcance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density es- timation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difﬁcult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectation- maximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; ﬁve should be used. On the right, too many centers are used; one center is sufﬁcient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for ﬁxed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C)− p 2 log n where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the descrip- tion length is a measure of how well the data are ﬁt by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model ﬁt to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5. 1]. This method of estimating k is best applied with domain-speciﬁc knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to reﬁne the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear −0. 100. 10. 20. 30. 40. 50. 60. 70. 80. 90. 10. 20. 30. 40. 50. 60. 70. 80. 9−3−2−10123−4−3−2−101234 Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯x}). 2: C ← kmeans(C, X). 3: Let {xi|class(xi) = j} be the set of datapoints assigned to center cj. 4: Use a statistical test to detect if each {xi|class(xi) = j} follow a Gaussian distribution (at conﬁdence level α). 5: If the data look Gaussian, keep cj. Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when ﬁnding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be ﬁxed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2. 1 Testing clusters for Gaussian ﬁt To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0: The data around the center are sampled from a Gaussian. • H1: The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0, then we believe that the one center is sufﬁcient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1, then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i)), where F is the N(0, 1) cumulative distribution function. Then the statistic is

PDF Details

ICML Conference 2003 Conference Paper

Principled Methods for Advising Reinforcement Learning Agents

Eric Wiewiora
Garrison W. Cottrell
Charles Elkan

Details

ICML Conference 2003 Conference Paper

Using the Triangle Inequality to Accelerate k-Means

Charles Elkan

Details

ICML Conference 2001 Conference Paper

Bayesian approaches to failure prediction for disk drives

Greg Hamerly
Charles Elkan

Details

ICML Conference 2001 Conference Paper

Obtaining calibrated probability estimates from decision trees and naive Bayesian classifiers

Bianca Zadrozny
Charles Elkan

Details

AIJ Journal 1994 Journal Article

A high-performance explanation-based learning algorithm

Alberto Segre
Charles Elkan

Details DOI

AIJ Journal 1993 Journal Article

Building large knowledge-based systems: Representation and inference in the cyc project

Charles Elkan
Russell Greiner

Details DOI

IJCAI Conference 1993 Conference Paper

Estimating the Accuracy of Learned Concepts

Timothy L. Bailey
Charles Elkan

This paper investigates alternative estimators of the accuracy of concepts learned from exam ples. In particular, the cross-validation and 632 bootstrap estimators are studied, using syn thetic training data and the FOIL learning al gorithm. Our experimental results contradict previous papers in statistics, which advocate the 632 bootstrap method as superior to crossvalidation. Nevertheless, our results also sug gest that conclusions based on cross-validation in previous machine learning papers are unreli able. Specifically, our observations are that (i) the true error of the concept learned by FOIL from independently drawn sets of examples of the same concept varies widely, (ii) the esti mate of true error provided by cross-validation has high variability but is approximately unbi ased, and (iii) the 632 bootstrap estimator has lower variability than cross-validation, but is systematically biased.

PDF

AAAI Conference 1993 Conference Paper

The Paradoxical Success of Fuzzy Logic

Charles Elkan

This paper investigates the question of which aspects of fuzzy logic are essential to its practical usefulness. We show that as a formal system, a standard version of fuzzy logic collapses mathematically to two-valued logic, while empirically, fuzzy logic is not adequate for reasoning about uncertain evidence in expert systems. Nevertheless, applications of fuzzy logic in heuristic control have been highly successful. We argue that the inconsistencies of fuzzy logic have not been harmful in practice because current fuzzy controllers are far simpler than other knowledge-based systems. In the future, the technical limitations of fuzzy logic can be expected to become important in practice, and work on fuzzy controllers will also encounter several problems of scale already known for other knowledge-based systems.

PDF Details

AIJ Journal 1990 Journal Article

A rational reconstruction of nonmonotonic truth maintenance systems

Charles Elkan

Details DOI

AAAI Conference 1990 Conference Paper

Incremental, Approximate Planning

Charles Elkan

PDF Details

IJCAI Conference 1989 Conference Paper

Conspiracy Numbers and Caching for Searching And/Or Trees and Theorem-Proving

Charles Elkan

PDF

MFCS Conference 1989 Conference Paper

Logical Characterizations of Nonmonotonic TMSs

Charles Elkan

Abstract Nonmonotonic truth maintenance systems (TMSs) have been widely used without a clear appreciation of their capabilities and limitations. This paper characterizes the logical inferences performed by a nonmonotonic TMS in terms of two formalisms of independent interest, namely logic programming with the stable set semantics and autoepistemic logic. The paper also shows that implementing a nonmonotonic TMS is an NP -complete problem.

Details