Author name cluster

David McAllester

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

18 papers

2 author rows

ICML Conference 2025 Conference Paper

PENCIL: Long Thoughts with Short Memory

Chenxiao Yang
Nathan Srebro
David McAllester
Zhiyuan Li 0005

While state-of-the-art LLMs have demonstrated great promise of using long Chains-of-Thought (CoT) to boost reasoning, scaling it up to more challenging problems is fundamentally limited by suboptimal memory usage — intermediate computations accumulate indefinitely in context even no longer needed for future thoughts. We introduce PENCIL, which incorporates a novel reduction mechanism into the autoregressive generation process that recursively clean up intermediate thoughts based on patterns learned from training. By alternately generating and erasing, PENCIL can think deeper to solve harder problems using shorter context and less computes. Empirically, for example, we demonstrate PENCIL with a small 25M-parameter transformer and 2048 context length solves Einstein’s puzzle — a task that challenges much larger models like GPT-4. Theoretically, we prove PENCIL can perform universal efficient computation by simulating any Turing machines with optimal time and space complexity, and thus can solve arbitrary computable tasks that are otherwise intractable for vanilla CoT.

NeurIPS Conference 2017 Conference Paper

Exploring Generalization in Deep Learning

Behnam Neyshabur
Srinadh Bhojanapalli
David McAllester
Nati Srebro

With a goal of understanding what drives generalization in deep networks, we consider several recently suggested explanations, including norm-based control, sharpness and robustness. We study how these measures can ensure generalization, highlighting the importance of scale normalization, and making a connection between sharpness and PAC-Bayes theory. We then investigate how well the measures explain different observed phenomena.

NeurIPS Conference 2014 Conference Paper

Discriminative Metric Learning by Neighborhood Gerrymandering

Shubhendu Trivedi
David McAllester
Greg Shakhnarovich

We formulate the problem of metric learning for k nearest neighbor classification as a large margin structured prediction problem, with a latent variable representing the choice of neighbors and the task loss directly corresponding to classification error. We describe an efficient algorithm for exact loss augmented inference, and a fast gradient descent algorithm for learning in this model. The objective drives the metric to establish neighborhood boundaries that benefit the true class labels for the training points. Our approach, reminiscent of gerrymandering (redrawing of political boundaries to provide advantage to certain parties), is more direct in its handling of optimizing classification accuracy than those previously proposed. In experiments on a variety of data sets our method is shown to achieve excellent results compared to current state of the art in metric learning.

NeurIPS Conference 2011 Conference Paper

Generalization Bounds and Consistency for Latent Structural Probit and Ramp Loss

Joseph Keshet
David McAllester

We consider latent structural versions of probit loss and ramp loss. We show that these surrogate loss functions are consistent in the strong sense that for any feature map (finite or infinite dimensional) they yield predictors approaching the infimum task loss achievable by any linear predictor over the given features. We also give finite sample generalization bounds (convergence rates) for these loss functions. These bounds suggest that probit loss converges more rapidly. However, ramp loss is more easily optimized and may ultimately be more practical.

NeurIPS Conference 2011 Conference Paper

Object Detection with Grammar Models

Ross Girshick
Pedro Felzenszwalb
David McAllester

Compositional models provide an elegant formalism for representing the visual appearance of highly variable objects. While such models are appealing from a theoretical point of view, it has been difficult to demonstrate that they lead to performance advantages on challenging datasets. Here we develop a grammar model for person detection and show that it outperforms previous high-performance systems on the PASCAL benchmark. Our model represents people using a hierarchy of deformable parts, variable structure and an explicit model of occlusion for partially visible objects. To train the model, we introduce a new discriminative framework for learning structured prediction models from weakly-labeled data.

NeurIPS Conference 2010 Conference Paper

Direct Loss Minimization for Structured Prediction

Tamir Hazan
Joseph Keshet
David McAllester

In discriminative machine learning one is interested in training a system to optimize a certain desired measure of performance, or loss. In binary classification one typically tries to minimizes the error rate. But in structured prediction each task often has its own measure of performance such as the BLEU score in machine translation or the intersection-over-union score in PASCAL segmentation. The most common approaches to structured prediction, structural SVMs and CRFs, do not minimize the task loss: the former minimizes a surrogate loss with no guarantees for task loss and the latter minimizes log loss independent of task loss. The main contribution of this paper is a theorem stating that a certain perceptron-like learning rule, involving features vectors derived from loss-adjusted inference, directly corresponds to the gradient of task loss. We give empirical results on phonetic alignment of a standard test set from the TIMIT corpus, which surpasses all previously reported results on this problem.

JMLR Journal 2004 Journal Article

Computable Shell Decomposition Bounds

John Langford
David McAllester

Haussler, Kearns, Seung and Tishby introduced the notion of a shell decomposition of the union bound as a means of understanding certain empirical phenomena in learning curves such as phase transitions. Here we use a variant of their ideas to derive an upper bound on the generalization error of a hypothesis computable from its training error and the histogram of training errors for the hypotheses in the class. In most cases this new bound is significantly tighter than traditional bounds computed from the training error and the cardinality of the class. Our results can also be viewed as providing a rigorous foundation for a model selection algorithm proposed by Scheffer and Joachims. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2004 Conference Paper

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Peter Bartlett
Michael Collins
Ben Taskar
David McAllester

We consider the problem of structured classiﬁcation, where the task is to predict a label y from an input x, and y has meaningful internal struc- ture. Our framework includes supervised training of Markov random ﬁelds and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem deﬁned in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efﬁcient—even in cases where the number of labels y is exponential in size—provided that certain expecta- tions under Gibbs distributions can be calculated efﬁciently. The method for structured labels relies on a more general result, speciﬁcally the ap- plication of exponentiated gradient updates [7, 8] to quadratic programs.

JMLR Journal 2003 Journal Article

Concentration Inequalities for the Missing Mass and for Histogram Rule Error

David McAllester
Luis Ortiz

This paper gives distribution-free concentration inequalities for the missing mass and the error rate of histogram rules. Negative association methods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding's inequality, the Angluin-Valiant bound, Bernstein's inequality, Bennett's inequality, or McDiarmid's theorem. The concentration inequality for histogram rule error is motivated by the desire to construct a new class of bounds on the generalization error of decision trees. [abs] [ pdf ][ ps.gz ][ ps ]

NeurIPS Conference 2002 Conference Paper

Concentration Inequalities for the Missing Mass and for Histogram Rule Error

Luis Ortiz
David McAllester

This paper gives distribution-free concentration inequalities for the miss- ing mass and the error rate of histogram rules. Negative association meth- ods can be used to reduce these concentration problems to concentration questions about independent sums. Although the sums are independent, they are highly heterogeneous. Such highly heterogeneous independent sums cannot be analyzed using standard concentration inequalities such as Hoeffding’s inequality, the Angluin-Valiant bound, Bernstein’s in- equality, Bennett’s inequality, or McDiarmid’s theorem.

NeurIPS Conference 2001 Conference Paper

PAC Generalization Bounds for Co-training

Sanjoy Dasgupta
Michael Littman
David McAllester

The rule-based bootstrapping introduced by Yarowsky, and its co- training variant by Blum and Mitchell, have met with considerable em- pirical success. Earlier work on the theory of co-training has been only loosely related to empirically useful co-training algorithms. Here we give a new PAC-style bound on generalization error which justiﬁes both the use of conﬁdences — partial rules and partial labeling of the unlabeled data — and the use of an agreement-based objective function as sug- gested by Collins and Singer. Our bounds apply to the multiclass case, i. e. , where instances are to be assigned one of

NeurIPS Conference 1999 Conference Paper

Boosting with Multi-Way Branching in Decision Trees

Yishay Mansour
David McAllester

It is known that decision tree learning can be viewed as a form of boosting. However, existing boosting theorems for decision tree learning allow only binary-branching trees and the generalization to multi-branching trees is not immediate. Practical decision tree al(cid: 173) gorithms, such as CART and C4. 5, implement a trade-off between the number of branches and the improvement in tree quality as measured by an index function. Here we give a boosting justifica(cid: 173) tion for a particular quantitative trade-off curve. Our main theorem states, in essence, that if we require an improvement proportional to the log of the number of branches then top-down greedy con(cid: 173) struction of decision trees remains an effective boosting algorithm.

NeurIPS Conference 1999 Conference Paper

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Richard Sutton
David McAllester
Satinder Singh
Yishay Mansour

Function approximation is essential to reinforcement learning, but the standard approach of approximating a value function and deter(cid: 173) mining a policy from it has so far proven theoretically intractable. In this paper we explore an alternative approach in which the policy is explicitly represented by its own function approximator, indepen(cid: 173) dent of the value function, and is updated according to the gradient of expected reward with respect to the policy parameters. Williams's REINFORCE method and actor-critic methods are examples of this approach. Our main new result is to show that the gradient can be written in a form suitable for estimation from experience aided by an approximate action-value or advantage function. Using this result, we prove for the first time that a version of policy iteration with arbitrary differentiable function approximation is convergent to a locally optimal policy. Large applications of reinforcement learning (RL) require the use of generalizing func(cid: 173) tion approximators such neural networks, decision-trees, or instance-based methods. The dominant approach for the last decade has been the value-function approach, in which all function approximation effort goes into estimating a value function, with the action-selection policy represented implicitly as the "greedy" policy with respect to the estimated values (e. g. , as the policy that selects in each state the action with highest estimated value). The value-function approach has worked well in many appli(cid: 173) cations, but has several limitations. First, it is oriented toward finding deterministic policies, whereas the optimal policy is often stochastic, selecting different actions with specific probabilities (e. g. , see Singh, Jaakkola, and Jordan, 1994). Second, an arbi(cid: 173) trarily small change in the estimated value of an action can cause it to be, or not be, selected. Such discontinuous changes have been identified as a key obstacle to estab(cid: 173) lishing convergence assurances for algorithms following the value-function approach (Bertsekas and Tsitsiklis, 1996). For example, Q-Iearning, Sarsa, and dynamic pro(cid: 173) gramming methods have all been shown unable to converge to any policy for simple MDPs and simple function approximators (Gordon, 1995, 1996; Baird, 1995; Tsit(cid: 173) siklis and van Roy, 1996; Bertsekas and Tsitsiklis, 1996). This can occur even if the best approximation is found at each step before changing the policy, and whether the notion of "best" is in the mean-squared-error sense or the slightly different senses of residual-gradient, temporal-difference, and dynamic-programming methods. In this paper we explore an alternative approach to function approximation in RL.

AAAI Conference 1997 Conference Paper

Evidence for Invariants in Local Search

David McAllester

It is well known that the performance of a stochastic local search procedure depends upon the setting of its noise parameter, and that the optimal setting varies with the problem distribution. It is therefore desirable to develop general priniciples for tuning the procedures. We present two statistical measures of the local search process that allow one to quickly find the optimal noise settings. These properties are independent of the fine details of the local search strategies, and appear to be relatively independent of the structure of the problem domains. We applied these principles to the problem of evaluating new search heuristics, and discovered two promising new strategies.

IJCAI Conference 1997 Conference Paper

Ten Challenges in Propositional Reasoning and Search

Bart Selman
Henry Kautz
David McAllester

AAAI Conference 1991 Conference Paper

Observations on Cognitive Judgments

David McAllester

It is obvious to anyone familiar with the rules of the game of chess that a king on an empty board can reach every square. It is true, but not obvious, that a knight can reach every square. Why is the first fact obvious but the second fact not? This paper presents an analytic theory of a class of obviousness judgments of this type. Whether or not the specifics of this analysis are correct, it seems that the study of obviousness judgments can be used to construct integrated theories of linguistics, knowledge representation, and inference.

AAAI Conference 1991 Conference Paper

Systematic Nonlinear Planning

David McAllester

This paper presents a simple, sound, complete, and systematic algorithm for domain independent STRIPS planning. Simplicity is achieved by starting with a ground procedure and then applying a general, and independently verifiable, lifting transformation. Previous planners have been designed directly aa lifted procedures. Our ground procedure is a ground version of Tate’ s NONLIN procedure. In Tate’ s procedure one is not required to determine whether a prerequisite of a step in an unfinished plan is guaranteed to hold in all linearizations. This allows Tate’ s procedure to avoid the use of Chapman’ s modal truth criterion. Systematicity is the property that the same plan, or partial plan, is never examined more than once. Systematicity is achieved through a simple modification of Tate’ s procedure.

AAAI Conference 1990 Conference Paper

Truth Maintenance

David McAllester

General purpose truth maintenance systems have received considerable attention in the past few years. This paper discusses the functionality of truth maintenance systems and compares various existing algorithms. Applications and directions for future research are also discussed.