Author name cluster

Ben Taskar

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

32 papers

2 author rows

NeurIPS Conference 2014 Conference Paper

Expectation-Maximization for Learning Determinantal Point Processes

Jennifer Gillenwater
Alex Kulesza
Emily Fox
Ben Taskar

A determinantal point process (DPP) is a probabilistic model of set diversity compactly parameterized by a positive semi-definite kernel matrix. To fit a DPP to a given task, we would like to learn the entries of its kernel matrix by maximizing the log-likelihood of the available data. However, log-likelihood is non-convex in the entries of the kernel matrix, and this learning problem is conjectured to be NP-hard. Thus, previous work has instead focused on more restricted convex learning settings: learning only a single weight for each row of the kernel matrix, or learning weights for a linear combination of DPPs with fixed kernel matrices. In this work we propose a novel algorithm for learning the full kernel matrix. By changing the kernel parameterization from matrix entries to eigenvalues and eigenvectors, and then lower-bounding the likelihood in the manner of expectation-maximization algorithms, we obtain an effective optimization procedure. We test our method on a real-world product recommendation task, and achieve relative gains of up to 16. 5% in test log-likelihood compared to the naive approach of maximizing likelihood by projected gradient ascent on the entries of the kernel matrix.

ICML Conference 2014 Conference Paper

Learning the Parameters of Determinantal Point Process Kernels

Raja Hafiz Affandi
Emily B. Fox
Ryan P. Adams
Ben Taskar

Determinantal point processes (DPPs) are well-suited for modeling repulsion and have proven useful in applications where diversity is desired. While DPPs have many appealing properties, learning the parameters of a DPP is difficult, as the likelihood is non-convex and is infeasible to compute in many scenarios. Here we propose Bayesian methods for learning the DPP kernel parameters. These methods are applicable in large-scale discrete and continuous DPP settings, even when the likelihood can only be bounded. We demonstrate the utility of our DPP learning methods in studying the progression of diabetic neuropathy based on the spatial distribution of nerve fibers, and in studying human perception of diversity in images.

NeurIPS Conference 2013 Conference Paper

Approximate Inference in Continuous Determinantal Processes

Raja Hafiz Affandi
Emily Fox
Ben Taskar

Determinantal point processes (DPPs) are random point processes well-suited for modeling repulsion. In machine learning, the focus of DPP-based models has been on diverse subset selection from a discrete and finite base set. This discrete setting admits an efficient algorithm for sampling based on the eigendecomposition of the defining kernel matrix. Recently, there has been growing interest in using DPPs defined on continuous spaces. While the discrete-DPP sampler extends formally to the continuous case, computationally, the steps required cannot be directly extended except in a few restricted cases. In this paper, we present efficient approximate DPP sampling schemes based on Nystrom and random Fourier feature approximations that apply to a wide range of kernel functions. We demonstrate the utility of continuous DPPs in repulsive mixture modeling applications and synthesizing human poses spanning activity spaces.

ICML Conference 2013 Conference Paper

Collective Stability in Structured Prediction: Generalization from One Example

Ben London 0001
Bert Huang
Ben Taskar
Lise Getoor

Structured predictors enable joint inference over multiple interdependent output variables. These models are often trained on a small number of examples with large internal structure. Existing distribution-free generalization bounds do not guarantee generalization in this setting, though this contradicts a large body of empirical evidence from computer vision, natural language processing, social networks and other fields. In this paper, we identify a set of natural conditions – weak dependence, hypothesis complexity and a new measure, collective stability – that are sufficient for generalization from even a single example, without imposing an explicit generative model of the data. We then demonstrate that the complexity and stability conditions are satisfied by a broad class of models, including marginal inference in templated graphical models. We thus obtain uniform convergence rates that can decrease significantly faster than previous bounds, particularly when each structured example is sufficiently large and the number of training examples is constant, even one.

NeurIPS Conference 2013 Conference Paper

Learning Adaptive Value of Information for Structured Prediction

David Weiss
Ben Taskar

Discriminative methods for learning structured models have enabled wide-spread use of very rich feature representations. However, the computational cost of feature extraction is prohibitive for large-scale or time-sensitive applications, often dominating the cost of inference in the models. Significant efforts have been devoted to sparsity-based model selection to decrease this cost. Such feature selection methods control computation statically and miss the opportunity to fine-tune feature extraction to each input at run-time. We address the key challenge of learning to control fine-grained feature extraction adaptively, exploiting non-homogeneity of the data. We propose an architecture that uses a rich feedback loop between extraction and prediction. The run-time control policy is learned using efficient value-function approximation, which adaptively determines the value of information of features at the level of individual variables for each input. We demonstrate significant speedups over state-of-the-art methods on two challenging datasets. For articulated pose estimation in video, we achieve a more accurate state-of-the-art model that is simultaneously 4$\times$ faster while using only a small fraction of possible features, with similar results on an OCR task.

ICML Conference 2013 Conference Paper

The Pairwise Piecewise-Linear Embedding for Efficient Non-Linear Classification

Ofir Pele
Ben Taskar
Amir Globerson
Michael Werman

Linear classiffers are much faster to learn and test than non-linear ones. On the other hand, non-linear kernels offer improved performance, albeit at the increased cost of training kernel classiffers. To use non-linear mappings with efficient linear learning algorithms, explicit embeddings that approximate popular kernels have recently been proposed. However, the embedding process itself is often costly and the results are usually less accurate than kernel methods. In this work we propose a non-linear feature map that is both very efficient, but at the same time highly expressive. The method is based on discretization and interpolation of individual features values and feature pairs. The discretization allows us to model different regions of the feature space separately, while the interpolation preserves the original continuous values. Using this embedding is strictly more general than a linear model and as efficient as the second-order polynomial explicit feature map. An extensive empirical evaluation shows that our method consistently signiffcantly outperforms other methods, including a wide range of kernels. This is in contrast to other proposed embeddings that were faster than kernel methods, but with lower accuracy.

NeurIPS Conference 2012 Conference Paper

Near-Optimal MAP Inference for Determinantal Point Processes

Jennifer Gillenwater
Alex Kulesza
Ben Taskar

Determinantal point processes (DPPs) have recently been proposed as computationally efficient probabilistic models of diverse sets for a variety of applications, including document summarization, image search, and pose estimation. Many DPP inference operations, including normalization and sampling, are tractable; however, finding the most likely configuration (MAP), which is often required in practice for decoding, is NP-hard, so we must resort to approximate inference. Because DPP probabilities are log-submodular, greedy algorithms have been used in the past with some empirical success; however, these methods only give approximation guarantees in the special case of DPPs with monotone kernels. In this paper we propose a new algorithm for approximating the MAP problem based on continuous techniques for submodular function maximization. Our method involves a novel continuous relaxation of the log-probability function, which, in contrast to the multilinear extension used for general submodular functions, can be evaluated and differentiated exactly and efficiently. We obtain a practical algorithm with a 1/4-approximation guarantee for a general class of non-monotone DPPs. Our algorithm also extends to MAP inference under complex polytope constraints, making it possible to combine DPPs with Markov random fields, weighted matchings, and other models. We demonstrate that our approach outperforms greedy methods on both synthetic and real-world data.

ICML Conference 2011 Conference Paper

k-DPPs: Fixed-Size Determinantal Point Processes

Alex Kulesza
Ben Taskar

UAI Conference 2011 Conference Paper

Learning Determinantal Point Processes

Alex Kulesza
Ben Taskar

JMLR Journal 2011 Journal Article

Learning from Partial Labels

Timothee Cour
Ben Sapp
Ben Taskar

We address the problem of partially-labeled multiclass classification, where instead of a single label per instance, the algorithm is given a candidate set of labels, only one of which is correct. Our setting is motivated by a common scenario in many image and video collections, where only partial access to labels is available. The goal is to learn a classifier that can disambiguate the partially-labeled training instances, and generalize to unseen data. We define an intuitive property of the data distribution that sharply characterizes the ability to learn in this setting and show that effective learning is possible even when all the data is only partially labeled. Exploiting this property of the data, we propose a convex learning formulation based on minimization of a loss function appropriate for the partial label setting. We analyze the conditions under which our loss function is asymptotically consistent, as well as its generalization and transductive performance. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies; in particular, we annotated and experimented on a very large video data set and achieve 6% error for character naming on 16 episodes of the TV series Lost. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

JMLR Journal 2011 Journal Article

Posterior Sparsity in Unsupervised Dependency Parsing

Jennifer Gillenwater
Kuzman Ganchev
João Graça
Fernando Pereira
Ben Taskar

A strong inductive bias is essential in unsupervised grammar induction. In this paper, we explore a particular sparsity bias in dependency grammars that encourages a small number of unique dependency types. We use part-of-speech (POS) tags to group dependencies by parent-child types and investigate sparsity-inducing penalties on the posterior distributions of parent-child POS tag pairs in the posterior regularization (PR) framework of Graça et al. (2007). In experiments with 12 different languages, we achieve significant gains in directed attachment accuracy over the standard expectation maximization (EM) baseline, with an average accuracy improvement of 6.5%, outperforming EM by at least 1% for 9 out of 12 languages. Furthermore, the new method outperforms models based on standard Bayesian sparsity-inducing parameter priors with an average improvement of 5% and positive gains of at least 1% for 9 out of 12 languages. On English text in particular, we show that our approach improves performance over other state-of-the-art techniques. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

JMLR Journal 2010 Journal Article

Posterior Regularization for Structured Latent Variable Models

Kuzman Ganchev
João Graça
Jennifer Gillenwater
Ben Taskar

We present posterior regularization, a probabilistic framework for structured, weakly supervised learning. Our framework efficiently incorporates indirect supervision via constraints on posterior distributions of probabilistic models with latent variables. Posterior regularization separates model complexity from the complexity of structural constraints it is desired to satisfy. By directly imposing decomposable regularization on the posterior moments of latent variables during learning, we retain the computational efficiency of the unconstrained model while ensuring desired constraints hold in expectation. We present an efficient algorithm for learning with posterior regularization and illustrate its versatility on a diverse set of structural constraints such as bijectivity, symmetry and group sparsity in several large scale experiments, including multi-view learning, cross-lingual dependency grammar induction, unsupervised part-of-speech induction, and bitext word alignment. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2010 Conference Paper

Semi-Supervised Learning with Adversarially Missing Label Information

Umar Syed
Ben Taskar

We address the problem of semi-supervised learning in an adversarial setting. Instead of assuming that labels are missing at random, we analyze a less favorable scenario where the label information can be missing partially and arbitrarily, which is motivated by several practical examples. We present nearly matching upper and lower generalization bounds for learning in this setting under reasonable assumptions about available label information. Motivated by the analysis, we formulate a convex optimization problem for parameter estimation, derive an efficient algorithm, and analyze its convergence. We provide experimental results on several standard data sets showing the robustness of our algorithm to the pattern of missing label information, outperforming several strong baselines.

NeurIPS Conference 2010 Conference Paper

Sidestepping Intractable Inference with Structured Ensemble Cascades

David Weiss
Benjamin Sapp
Ben Taskar

For many structured prediction problems, complex models often require adopting approximate inference techniques such as variational methods or sampling, which generally provide no satisfactory accuracy guarantees. In this work, we propose sidestepping intractable inference altogether by learning ensembles of tractable sub-models as part of a structured prediction cascade. We focus in particular on problems with high-treewidth and large state-spaces, which occur in many computer vision tasks. Unlike other variational methods, our ensembles do not enforce agreement between sub-models, but filter the space of possible outputs by simply adding and thresholding the max-marginals of each constituent model. Our framework jointly estimates parameters for all models in the ensemble for each level of the cascade by minimizing a novel, convex loss function, yet requires only a linear increase in computation over learning or inference in a single tractable sub-model. We provide a generalization bound on the filtering loss of the ensemble as a theoretical justification of our approach, and we evaluate our method on both synthetic data and the task of estimating articulated human pose from challenging videos. We find that our approach significantly outperforms loopy belief propagation on the synthetic data and a state-of-the-art model on the pose estimation/tracking problem.

NeurIPS Conference 2010 Conference Paper

Structured Determinantal Point Processes

Alex Kulesza
Ben Taskar

We present a novel probabilistic model for distributions over sets of structures -- for example, sets of sequences, trees, or graphs. The critical characteristic of our model is a preference for diversity: sets containing dissimilar structures are more likely. Our model is a marriage of structured probabilistic models, like Markov random fields and context free grammars, with determinantal point processes, which arise in quantum physics as models of particles with repulsive interactions. We extend the determinantal point process model to handle an exponentially-sized set of particles (structures) via a natural factorization of the model into parts. We show how this factorization leads to tractable algorithms for exact inference, including computing marginals, computing conditional probabilities, and sampling. Our algorithms exploit a novel polynomially-sized dual representation of determinantal point processes, and use message passing over a special semiring to compute relevant quantities. We illustrate the advantages of the model on tracking and articulated pose estimation problems.

NeurIPS Conference 2009 Conference Paper

Posterior vs Parameter Sparsity in Latent Variable Models

Kuzman Ganchev
Ben Taskar
Fernando Pereira
João Gama

In this paper we explore the problem of biasing unsupervised models to favor sparsity. We extend the posterior regularization framework [8] to encourage the model to achieve posterior sparsity on the unlabeled training data. We apply this new method to learn ﬁrst-order HMMs for unsupervised part-of-speech (POS) tagging, and show that HMMs learned this way consistently and signiﬁcantly out-performs both EM-trained HMMs, and HMMs with a sparsity-inducing Dirichlet prior trained by variational EM. We evaluate these HMMs on three languages — English, Bulgarian and Portuguese — under four conditions. We ﬁnd that our method always improves performance with respect to both baselines, while variational Bayes actually degrades performance in most cases. We increase accuracy with respect to EM by 2. 5%-8. 7% absolute and we see improvements even in a semisupervised condition where a limited dictionary is provided.

UAI Conference 2008 Conference Paper

Multi-View Learning over Structured and Non-Identical Outputs

Kuzman Ganchev
João Graça
John Blitzer
Ben Taskar

In many machine learning problems, labeled training data is limited but unlabeled data is ample. Some of these problems have instances that can be factored into multiple views, each of which is nearly sufficent in determining the correct labels. In this paper we present a new algorithm for probabilistic multi-view learning which uses the idea of stochastic agreement between views as regularization. Our algorithm works on structured and unstructured problems and easily generalizes to partial agreement scenarios. For the full agreement case, our algorithm minimizes the Bhattacharyya distance between the models of each view, and performs better than CoBoosting and two-view Perceptron on several flat and structured classification problems.

UAI Conference 2008 Conference Paper

Multi-View Learning over Structured and Non-Identical Outputs

Kuzman Ganchev
João Graça
John Blitzer
Ben Taskar

ICRA Conference 2008 Conference Paper

Online, self-supervised terrain classification via discriminatively trained submodular Markov random fields

Paul Vernaza
Ben Taskar
Daniel D. Lee

The authors present a novel approach to the task of autonomous terrain classification based on structured prediction. We consider the problem of learning a classifier that will accurately segment an image into “obstacle” and “ground” patches based on supervised input. Previous approaches to this problem have focused mostly on local appearance; typically, a classifier is trained and evaluated on a pixel-bypixel basis, making an implicit assumption of independence in local pixel neighborhoods. We relax this assumption by modeling correlations between pixels in the submodular MRF framework. We show how both the learning and inference tasks can be simply and efficiently implemented-exact inference via an efficient max flow computation; and learning, via an averaged-subgradient method. Unlike most comparable MRFbased approaches, our method is suitable for implementation on a robot in real-time. Experimental results are shown that demonstrate a marked increase in classification accuracy over standard methods in addition to real-time performance.

ICML Conference 2007 Conference Paper

A permutation-augmented sampler for DP mixture models

Percy Liang
Michael I. Jordan
Ben Taskar

NeurIPS Conference 2007 Conference Paper

Expectation Maximization and Posterior Constraints

Kuzman Ganchev
Ben Taskar
João Gama

The expectation maximization (EM) algorithm is a widely used maximum likelihood estimation procedure for statistical models when the values of some of the variables in the model are not observed. Very often, however, our aim is primarily to find a model that assigns values to the latent variables that have intended meaning for our data and maximizing expected likelihood only sometimes accomplishes this. Unfortunately, it is typically difficult to add even simple a-priori information about latent variables in graphical models without making the models overly complex or intractable. In this paper, we present an efficient, principled way to inject rich constraints on the posteriors of latent variables into the EM algorithm. Our method can be used to learn tractable graphical models that satisfy additional, otherwise intractable constraints. Focusing on clustering and the alignment problem for statistical machine translation, we show that simple, intuitive posterior constraints can greatly improve the performance over standard baselines and be competitive with more complex, intractable models.

UAI Conference 2007 Conference Paper

Mixture-of-Parents Maximum Entropy Markov Models

David S. Rosenberg
Dan Klein 0001
Ben Taskar

We present the mixture-of-parents maximum entropy Markov model (MoP-MEMM), a class of directed graphical models extending MEMMs. The MoP-MEMM allows tractable incorporation of long-range dependencies be- tween nodes by restricting the conditional distribution of each node to be a mixture of distributions given the parents. We show how to efficiently compute the exact marginal posterior node distributions, regardless of the range of the dependencies. This enables us to model non-sequential correlations present within text documents, as well as between in- terconnected documents, such as hyperlinked web pages. We apply the MoP-MEMM to a named entity recognition task and a web page classification task. In each, our model shows significant improvement over the basic MEMM, and is competitive with other long- range sequence models that use approximate inference.

JMLR Journal 2006 Journal Article

Structured Prediction, Dual Extragradient and Bregman Projections

Ben Taskar
Simon Lacoste-Julien
Michael I. Jordan

We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex flow, depending on the structure of the problem. We show that this approach provides a memory-efficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. [abs] [ pdf ][ bib ] &copy JMLR 2006. ( edit, beta )

ICML Conference 2005 Conference Paper

Learning structured prediction models: a large margin approach

Ben Taskar
Vassil Chatalbashev
Daphne Koller
Carlos Guestrin

We consider large margin estimation in a broad range of prediction models where inference involves solving combinatorial optimization problems, for example, weighted graph-cuts or matchings. Our goal is to learn parameters such that inference using the model reproduces correct answers on the training data. Our method relies on the expressive power of convex optimization problems to compactly capture inference or solution optimality in structured prediction models. Directly embedding this structure within the learning formulation produces concise convex problems for efficient estimation of very complex and diverse models. We describe experimental results on a matching task, disulfide connectivity prediction, showing significant improvements over state-of-the-art methods.

NeurIPS Conference 2005 Conference Paper

Structured Prediction via the Extragradient Method

Ben Taskar
Simon Lacoste-Julien
Michael Jordan

We present a simple and scalable algorithm for large-margin estima- tion of structured models, including an important class of Markov net- works and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem and apply the extragradient method, yielding an algorithm with linear convergence using simple gra- dient and projection calculations. The projection step can be solved us- ing combinatorial algorithms for min-cost quadratic ﬂow. This makes the approach an efﬁcient alternative to formulations based on reductions to a quadratic program (QP). We present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm.

NeurIPS Conference 2004 Conference Paper

Exponentiated Gradient Algorithms for Large-margin Structured Classification

Peter Bartlett
Michael Collins
Ben Taskar
David McAllester

We consider the problem of structured classiﬁcation, where the task is to predict a label y from an input x, and y has meaningful internal struc- ture. Our framework includes supervised training of Markov random ﬁelds and weighted context-free grammars as special cases. We describe an algorithm that solves the large-margin optimization problem deﬁned in [12], using an exponential-family (Gibbs distribution) representation of structured objects. The algorithm is efﬁcient—even in cases where the number of labels y is exponential in size—provided that certain expecta- tions under Gibbs distributions can be calculated efﬁciently. The method for structured labels relies on a more general result, speciﬁcally the ap- plication of exponentiated gradient updates [7, 8] to quadratic programs.

ICML Conference 2004 Conference Paper

Learning associative Markov networks

Ben Taskar
Vassil Chatalbashev
Daphne Koller

ICML Conference 2003 Conference Paper

Learning on the Test Data: Leveraging Unseen Features

Ben Taskar
Ming Fai Wong
Daphne Koller

NeurIPS Conference 2003 Conference Paper

Link Prediction in Relational Data

Ben Taskar
Ming-fai Wong
Pieter Abbeel
Daphne Koller

Many real-world domains are relational in nature, consisting of a set of objects related to each other in complex ways. This paper focuses on predicting the existence and the type of links between entities in such domains. We apply the relational Markov network framework of Taskar et al. to deﬁne a joint probabilis- tic model over the entire link graph — entity attributes and links. The application of the RMN algorithm to this task requires the deﬁnition of probabilistic patterns over subgraph structures. We apply this method to two new relational datasets, one involving university webpages, and the other a social network. We show that the collective classiﬁcation approach of RMNs, and the introduction of subgraph patterns over link labels, provide signiﬁcant improvements in accuracy over ﬂat classiﬁcation, which attempts to predict each link in isolation.

NeurIPS Conference 2003 Conference Paper

Max-Margin Markov Networks

Ben Taskar
Carlos Guestrin
Daphne Koller

In typical classiﬁcation tasks, we seek a function which assigns a label to a sin- gle object. Kernel-based approaches, such as support vector machines (SVMs), which maximize the margin of conﬁdence of the classiﬁer, are the method of choice for many such tasks. Their popularity stems both from the ability to use high-dimensional feature spaces, and from their strong theoretical guaran- tees. However, many real-world tasks involve sequential, spatial, or structured data, where multiple labels must be assigned. Existing kernel-based methods ig- nore structure in the problem, assigning labels independently to each object, los- ing much useful information. Conversely, probabilistic graphical models, such as Markov networks, can represent correlations between labels, by exploiting problem structure, but cannot handle high-dimensional feature spaces, and lack strong theoretical generalization guarantees. In this paper, we present a new framework that combines the advantages of both approaches: Maximum mar- gin Markov (M3) networks incorporate both kernels, which efﬁciently deal with high-dimensional features, and the ability to capture correlations in structured data. We present an efﬁcient algorithm for learning M3 networks based on a compact quadratic program formulation. We provide a new theoretical bound for generalization in structured domains. Experiments on the task of handwrit- ten character recognition and collective hypertext classiﬁcation demonstrate very signiﬁcant gains over previous approaches.

UAI Conference 2002 Conference Paper

Discriminative Probabilistic Models for Relational Data

Ben Taskar
Pieter Abbeel
Daphne Koller

In many supervised learning tasks, the entities to be labeled are related to each other in complex ways and their labels are not independent. For example, in hypertext classification, the labels of linked pages are highly correlated. A standard approach is to classify each entity independently, ignoring the correlations between them. Recently, Probabilistic Relational Models, a relational version of Bayesian networks, were used to define a joint probabilistic model for a collection of related entities. In this paper, we present an alternative framework that builds on (conditional) Markov networks and addresses two limitations of the previous approach. First, undirected models do not impose the acyclicity constraint that hinders representation of many important relational dependencies in directed models. Second, undirected models are well suited for discriminative training, where we optimize the conditional likelihood of the labels given the features, which generally improves classification accuracy. We show how to train these models effectively, and how to use approximate probabilistic inference over the learned model for collective classification of multiple related entities. We provide experimental results on a webpage classification task, showing that accuracy can be significantly improved by modeling relational dependencies.

ICML Conference 2001 Conference Paper

Learning Probabilistic Models of Relational Structure

Lise Getoor
Nir Friedman
Daphne Koller
Ben Taskar