Arrow Research search

Author name cluster

Charles Elkan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

EWRL Workshop 2011 Conference Paper

Reinforcement Learning with a Bilinear Q Function

  • Charles Elkan

Abstract Many reinforcement learning methods are based on a function Q ( s, a ) whose value is the discounted total reward expected after performing the action a in the state s. This paper explores the implications of representing the Q function as Q ( s, a ) = s T Wa, where W is a matrix that is learned. In this representation, both s and a are real-valued vectors that may have high dimension. We show that action selection can be done using standard linear programming, and that W can be learned using standard linear regression in the algorithm known as fitted Q iteration. Experimentally, the resulting method learns to solve the mountain car task in a sample-efficient way. The same method is also applicable to an inventory management task where the state space and the action space are continuous and high-dimensional.

JMLR Journal 2010 Journal Article

Quadratic Programming Feature Selection

  • Irene Rodriguez-Lujan
  • Ramon Huerta
  • Charles Elkan
  • Carlos Santa Cruz

Identifying a subset of features that preserves classification accuracy is a problem of growing importance, because of the increasing size and dimensionality of real-world data sets. We propose a new feature selection method, named Quadratic Programming Feature Selection (QPFS), that reduces the task to a quadratic optimization problem. In order to limit the computational complexity of solving the optimization problem, QPFS uses the Nyström method for approximate matrix diagonalization. QPFS is thus capable of dealing with very large data sets, for which the use of other methods is computationally expensive. In experiments with small and medium data sets, the QPFS method leads to classification accuracy similar to that of other successful techniques. For large data sets, QPFS is superior in terms of computational efficiency. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

JMLR Journal 2004 Journal Article

Sources of Success for Boosted Wrapper Induction

  • David Kauchak
  • Joseph Smarr
  • Charles Elkan

In this paper, we examine an important recent rule-based information extraction (IE) technique named Boosted Wrapper Induction (BWI) by conducting experiments on a wider variety of tasks than previously studied, including tasks using several collections of natural text documents. We investigate systematically how each algorithmic component of BWI, in particular boosting, contributes to its success. We show that the benefit of boosting arises from the ability to reweight examples to learn specific rules (resulting in high precision) combined with the ability to continue learning rules after all positive examples have been covered (resulting in high recall). As a quantitative indicator of the regularity of an extraction task, we propose a new measure that we call the SWI ratio. We show that this measure is a good predictor of IE success and a useful tool for analyzing IE tasks. Based on these results, we analyze the strengths and limitations of BWI. Specifically, we explain limitations in the information made available, and in the representations used. We also investigate the consequences of the fact that confidence values returned during extraction are not true probabilities. Next, we investigate the benefits of including grammatical and semantic information for natural text documents, as well as parse tree and attribute-value information for XML and HTML documents. We show experimentally that incorporating even limited grammatical information can increase the regularity of natural text extraction tasks, resulting in improved performance. We conclude with proposals for enriching the representational power of BWI and other IE methods to exploit these and other types of regularities. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2003 Conference Paper

Learning the k in k-means

  • Greg Hamerly
  • Charles Elkan

When clustering a dataset, the right number k of clusters to use is often not obvious, and choosing k automatically is a hard algorithmic prob- lem. In this paper we present an improved algorithm for learning k while clustering. The G-means algorithm is based on a statistical test for the hypothesis that a subset of data follows a Gaussian distribution. G-means runs k-means with increasing k in a hierarchical fashion until the test ac- cepts the hypothesis that the data assigned to each k-means center are Gaussian. Two key advantages are that the hypothesis test does not limit the covariance of the data and does not compute a full covariance matrix. Additionally, G-means only requires one intuitive parameter, the stand- ard statistical significance level α. We present results from experiments showing that the algorithm works well, and better than a recent method based on the BIC penalty for model complexity. In these experiments, we show that the BIC is ineffective as a scoring function, since it does not penalize strongly enough the model’s complexity. 1 Introduction and related work Clustering algorithms are useful tools for data mining, compression, probability density es- timation, and many other important tasks. However, most clustering algorithms require the user to specify the number of clusters (called k), and it is not always clear what is the best value for k. Figure 1 shows examples where k has been improperly chosen. Choosing k is often an ad hoc decision based on prior knowledge, assumptions, and practical experience. Choosing k is made more difficult when the data has many dimensions, even when clusters are well-separated. Center-based clustering algorithms (in particular k-means and Gaussian expectation- maximization) usually assume that each cluster adheres to a unimodal distribution, such as Gaussian. With these methods, only one center should be used to model each subset of data that follows a unimodal distribution. If multiple centers are used to describe data drawn from one mode, the centers are a needlessly complex description of the data, and in fact the multiple centers capture the truth about the subset less well than one center. In this paper we present a simple algorithm called G-means that discovers an appropriate k using a statistical test for deciding whether to split a k-means center into two centers. We describe examples and present experimental results that show that the new algorithm Figure 1: Two clusterings where k was improperly chosen. Dark crosses are k-means centers. On the left, there are too few centers; five should be used. On the right, too many centers are used; one center is sufficient for representing the data. In general, one center should be used to represent one Gaussian cluster. is successful. This technique is useful and applicable for many clustering algorithms other than k-means, but here we consider only the k-means algorithm for simplicity. Several algorithms have been proposed previously to determine k automatically. Like our method, most previous methods are wrappers around k-means or some other clustering algorithm for fixed k. Wrapper methods use splitting and/or merging rules for centers to increase or decrease k as the algorithm proceeds. Pelleg and Moore [14] proposed a regularization framework for learning k, which they call X-means. The algorithm searches over many values of k and scores each clustering model using the so-called Bayesian Information Criterion [10]: BIC(C|X) = L(X|C)− p 2 log n where L(X|C) is the log-likelihood of the dataset X according to model C, p = k(d + 1) is the number of parameters in the model C with dimensionality d and k cluster centers, and n is the number of points in the dataset. X-means chooses the model with the best BIC score on the data. Aside from the BIC, other scoring functions are also available. Bischof et al. [1] use a minimum description length (MDL) framework, where the descrip- tion length is a measure of how well the data are fit by the model. Their algorithm starts with a large value for k and removes centers (reduces k) whenever that choice reduces the description length. Between steps of reducing k, they use the k-means algorithm to optimize the model fit to the data. With hierarchical clustering algorithms, other methods may be employed to determine the best number of clusters. One is to build a merging tree (“dendrogram”) of the data based on a cluster distance metric, and search for areas of the tree that are stable with respect to inter- and intra-cluster distances [9, Section 5. 1]. This method of estimating k is best applied with domain-specific knowledge and human intuition. 2 The Gaussian-means (G-means) algorithm The G-means algorithm starts with a small number of k-means centers, and grows the number of centers. Each iteration of the algorithm splits into two those centers whose data appear not to come from a Gaussian distribution. Between each round of splitting, we run k-means on the entire dataset and all the centers to refine the current solution. We can initialize with just k = 1, or we can choose some larger value of k if we have some prior knowledge about the range of k. G-means repeatedly makes decisions based on a statistical test for the data assigned to each center. If the data currently assigned to a k-means center appear to be Gaussian, then we want to represent that data with only one center. However, if the same data do not appear −0. 100. 10. 20. 30. 40. 50. 60. 70. 80. 90. 10. 20. 30. 40. 50. 60. 70. 80. 9−3−2−10123−4−3−2−101234 Algorithm 1 G-means(X, α) 1: Let C be the initial set of centers (usually C ← {¯x}). 2: C ← kmeans(C, X). 3: Let {xi|class(xi) = j} be the set of datapoints assigned to center cj. 4: Use a statistical test to detect if each {xi|class(xi) = j} follow a Gaussian distribution (at confidence level α). 5: If the data look Gaussian, keep cj. Otherwise replace cj with two centers. 6: Repeat from step 2 until no more centers are added. to be Gaussian, then we want to use multiple centers to model the data properly. The algorithm will run k-means multiple times (up to k times when finding k centers), so the time complexity is at most O(k) times that of k-means. The k-means algorithm implicitly assumes that the datapoints in each cluster are spherically distributed around the center. Less restrictively, the Gaussian expectation-maximization algorithm assumes that the datapoints in each cluster have a multidimensional Gaussian distribution with a covariance matrix that may or may not be fixed, or shared. The Gaussian distribution test that we present below are valid for either covariance matrix assumption. The test also accounts for the number of datapoints n tested by incorporating n in the calculation of the critical value of the test (see Equation 2). This prevents the G-means algorithm from making bad decisions about clusters with few datapoints. 2. 1 Testing clusters for Gaussian fit To specify the G-means algorithm fully we need a test to detect whether the data assigned to a center are sampled from a Gaussian. The alternative hypotheses are • H0: The data around the center are sampled from a Gaussian. • H1: The data around the center are not sampled from a Gaussian. If we accept the null hypothesis H0, then we believe that the one center is sufficient to model its data, and we should not split the cluster into two sub-clusters. If we reject H0 and accept H1, then we want to split the cluster. The test we use is based on the Anderson-Darling statistic. This one-dimensional test has been shown empirically to be the most powerful normality test that is based on the empirical cumulative distribution function (ECDF). Given a list of values xi that have been converted to mean 0 and variance 1, let x(i) be the ith ordered value. Let zi = F (x(i)), where F is the N(0, 1) cumulative distribution function. Then the statistic is

IJCAI Conference 1993 Conference Paper

Estimating the Accuracy of Learned Concepts

  • Timothy L. Bailey
  • Charles Elkan

This paper investigates alternative estimators of the accuracy of concepts learned from exam­ ples. In particular, the cross-validation and 632 bootstrap estimators are studied, using syn­ thetic training data and the FOIL learning al­ gorithm. Our experimental results contradict previous papers in statistics, which advocate the 632 bootstrap method as superior to crossvalidation. Nevertheless, our results also sug­ gest that conclusions based on cross-validation in previous machine learning papers are unreli­ able. Specifically, our observations are that (i) the true error of the concept learned by FOIL from independently drawn sets of examples of the same concept varies widely, (ii) the esti­ mate of true error provided by cross-validation has high variability but is approximately unbi­ ased, and (iii) the 632 bootstrap estimator has lower variability than cross-validation, but is systematically biased.

AAAI Conference 1993 Conference Paper

The Paradoxical Success of Fuzzy Logic

  • Charles Elkan

This paper investigates the question of which aspects of fuzzy logic are essential to its practical usefulness. We show that as a formal system, a standard version of fuzzy logic collapses mathematically to two-valued logic, while empirically, fuzzy logic is not adequate for reasoning about uncertain evidence in expert systems. Nevertheless, applications of fuzzy logic in heuristic control have been highly successful. We argue that the inconsistencies of fuzzy logic have not been harmful in practice because current fuzzy controllers are far simpler than other knowledge-based systems. In the future, the technical limitations of fuzzy logic can be expected to become important in practice, and work on fuzzy controllers will also encounter several problems of scale already known for other knowledge-based systems.

MFCS Conference 1989 Conference Paper

Logical Characterizations of Nonmonotonic TMSs

  • Charles Elkan

Abstract Nonmonotonic truth maintenance systems (TMSs) have been widely used without a clear appreciation of their capabilities and limitations. This paper characterizes the logical inferences performed by a nonmonotonic TMS in terms of two formalisms of independent interest, namely logic programming with the stable set semantics and autoepistemic logic. The paper also shows that implementing a nonmonotonic TMS is an NP -complete problem.