Arrow Research search

Author name cluster

Michael Collins

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

ICML Conference 2024 Conference Paper

Controlled Decoding from Language Models

  • Sidharth Mudgal
  • Jong Lee
  • Harish Ganapathy
  • YaGuang Li
  • Tao Wang
  • Yanping Huang
  • Zhifeng Chen
  • Heng-Tze Cheng

KL-regularized reinforcement learning (RL) is a popular alignment framework to control the language model responses towards high reward outcomes. We pose a tokenwise RL objective and propose a modular solver for it, called controlled decoding (CD). CD exerts control through a separate prefix scorer module, which is trained to learn a value function for the reward. The prefix scorer is used at inference time to control the generation from a frozen base model, provably sampling from a solution to the RL objective. We empirically demonstrate that CD is effective as a control mechanism on popular benchmarks. We also show that prefix scorers for multiple rewards may be combined at inference time, effectively solving a multi-objective RL problem with no additional training. We show that the benefits of applying CD transfer to an unseen base model with no further tuning as well. Finally, we show that CD can be applied in a blockwise decoding fashion at inference-time, essentially bridging the gap between the popular best-of-$K$ strategy and tokenwise control through reinforcement learning. This makes CD a promising approach for alignment of language models.

JMLR Journal 2019 Journal Article

Kernel Approximation Methods for Speech Recognition

  • Avner May
  • Alireza Bagheri Garakani
  • Zhiyun Lu
  • Dong Guo
  • Kuan Liu
  • Aurélien Bellet
  • Linxi Fan
  • Michael Collins

We study the performance of kernel methods on the acoustic modeling task for automatic speech recognition, and compare their performance to deep neural networks (DNNs). To scale the kernel methods to large data sets, we use the random Fourier feature method of Rahimi and Recht (2007). We propose two novel techniques for improving the performance of kernel acoustic models. First, we propose a simple but effective feature selection method which reduces the number of random features required to attain a fixed level of performance. Second, we present a number of metrics which correlate strongly with speech recognition performance when computed on the heldout set; we attain improved performance by using these metrics to decide when to stop training. Additionally, we show that the linear bottleneck method of Sainath et al. (2013a) improves the performance of our kernel models significantly, in addition to speeding up training and making the models more compact. Leveraging these three methods, the kernel methods attain token error rates between $0.5\%$ better and $0.1\%$ worse than fully-connected DNNs across four speech recognition data sets, including the TIMIT and Broadcast News benchmark tasks. [abs] [ pdf ][ bib ] &copy JMLR 2019. ( edit, beta )

AAAI Conference 2015 Conference Paper

A Family of Latent Variable Convex Relaxations for IBM Model 2

  • Andrei Simion
  • Michael Collins
  • Cliff Stein

Recently, a new convex formulation of IBM Model 2 was introduced. In this paper we develop the theory further and introduce a class of convex relaxations for latent variable models which include IBM Model 2. When applied to IBM Model 2, our relaxation class subsumes the previous relaxation as a special case. As proof of concept, we study a new relaxation of IBM Model 2 which is simpler than the previous algorithm: the new relaxation relies on the use of nothing more than a multinomial EM algorithm, does not require the tuning of a learning rate, and has some favorable comparisons to IBM Model 2 in terms of F-Measure. The ideas presented could be applied to a wide range of NLP and machine learning problems.

JMLR Journal 2014 Journal Article

Spectral Learning of Latent-Variable PCFGs: Algorithms and Sample Complexity

  • Shay B. Cohen
  • Karl Stratos
  • Michael Collins
  • Dean P. Foster
  • Lyle Ungar

We introduce a spectral learning algorithm for latent-variable PCFGs (Matsuzaki et al., 2005; Petrov et al., 2006). Under a separability (singular value) condition, we prove that the method provides statistically consistent parameter estimates. Our result rests on three theorems: the first gives a tensor form of the inside- outside algorithm for PCFGs; the second shows that the required tensors can be estimated directly from training examples where hidden-variable values are missing; the third gives a PAC-style convergence bound for the estimation method. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Tensor Decomposition for Fast Parsing with Latent-Variable PCFGs

  • Michael Collins
  • Shay Cohen

We describe an approach to speed-up inference with latent variable PCFGs, which have been shown to be highly effective for natural language parsing. Our approach is based on a tensor formulation recently introduced for spectral estimation of latent-variable PCFGs coupled with a tensor decomposition algorithm well-known in the multilinear algebra literature. We also describe an error bound for this approximation, which bounds the difference between the probabilities calculated by the algorithm and the true probabilities that the approximated model gives. Empirical evaluation on real-world natural language parsing data demonstrates a significant speed-up at minimal cost for parsing performance.

NeurIPS Conference 2009 Conference Paper

Learning Label Embeddings for Nearest-Neighbor Multi-class Classification with an Application to Speech Recognition

  • Natasha Singh-miller
  • Michael Collins

We consider the problem of using nearest neighbor methods to provide a conditional probability estimate, P(y|a), when the number of labels y is large and the labels share some underlying structure. We propose a method for learning error-correcting output codes (ECOCs) to model the similarity between labels within a nearest neighbor framework. The learned ECOCs and nearest neighbor information are used to provide conditional probability estimates. We apply these estimates to the problem of acoustic modeling for speech recognition. We demonstrate an absolute reduction in word error rate (WER) of 0. 9% (a 2. 5% relative reduction in WER) on a lecture recognition task over a state-of-the-art baseline GMM model.

JMLR Journal 2008 Journal Article

Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks

  • Michael Collins
  • Amir Globerson
  • Terry Koo
  • Xavier Carreras
  • Peter L. Bartlett

Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or max-margin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O (1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O (log(1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

NeurIPS Conference 2004 Conference Paper

Conditional Random Fields for Object Recognition

  • Ariadna Quattoni
  • Michael Collins
  • Trevor Darrell

We present a discriminative part-based approach for the recognition of object classes from unsegmented cluttered scenes. Objects are modeled as flexible constellations of parts conditioned on local observations found by an interest operator. For each object class the probability of a given assignment of parts to local features is modeled by a Conditional Ran- dom Field (CRF). We propose an extension of the CRF framework that incorporates hidden variables and combines class conditional CRFs into a unified framework for part-based object recognition. The parameters of the CRF are estimated in a maximum likelihood framework and recogni- tion proceeds by finding the most likely class under our model. The main advantage of the proposed CRF framework is that it allows us to relax the assumption of conditional independence of the observed data (i. e. local features) often used in generative approaches, an assumption that might be too restrictive for a considerable number of object classes.

NeurIPS Conference 2004 Conference Paper

Exponentiated Gradient Algorithms for Large-margin Structured Classification

  • Peter Bartlett
  • Michael Collins
  • Ben Taskar
  • David McAllester

We consider the problem of structured classification, where the task is to predict a label y from an input x, and y has meaningful internal struc- ture. Our framework includes supervised training of Markov random fields and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem defined in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efficient—even in cases where the number of labels y is exponential in size—provided that certain expecta- tions under Gibbs distributions can be calculated efficiently. The method for structured labels relies on a more general result, specifically the ap- plication of exponentiated gradient updates [7, 8] to quadratic programs.

NeurIPS Conference 2001 Conference Paper

A Generalization of Principal Components Analysis to the Exponential Family

  • Michael Collins
  • S. Dasgupta
  • Robert Schapire

Principal component analysis (PCA) is a commonly applied technique for dimensionality reduction. PCA implicitly minimizes a squared loss function, which may be inappropriate for data that is not real-valued, such as binary-valued data. This paper draws on ideas from the Exponen- tial family, Generalized linear models, and Bregman distances, to give a generalization of PCA to loss functions that we argue are better suited to other data types. We describe algorithms for minimizing the loss func- tions, and give examples on simulated data.

NeurIPS Conference 2001 Conference Paper

Convolution Kernels for Natural Language

  • Michael Collins
  • Nigel Duffy

We describe the application of kernel methods to Natural Language Pro- cessing (NLP) problems. In many NLP tasks the objects being modeled are strings, trees, graphs or other discrete structures which require some mechanism to convert them into feature vectors. We describe kernels for various natural language structures, allowing rich, high dimensional rep- resentations of these structures. We show how a kernel over trees can be applied to parsing using the voted perceptron algorithm, and we give experimental results on the ATIS corpus of parse trees.