Arrow Research search

Author name cluster

John Blitzer

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

UAI Conference 2012 Conference Paper

Latent Structured Ranking

  • Jason Weston
  • John Blitzer

Many latent (factorized) models have been proposed for recommendation tasks like collaborative filtering and for ranking tasks like document or image retrieval and annotation. Common to all those methods is that during inference the items are scored independently by their similarity to the query in the latent embedding space. The structure of the ranked list (i.e. considering the set of items returned as a whole) is not taken into account. This can be a problem because the set of top predictions can be either too diverse (contain results that contradict each other) or are not diverse enough. In this paper we introduce a method for learning latent structured rankings that improves over existing methods by providing the right blend of predictions at the top of the ranked list. Particular emphasis is put on making this method scalable. Empirical results on large scale image annotation and music recommendation tasks show improvements over existing approaches.

NeurIPS Conference 2011 Conference Paper

Co-Training for Domain Adaptation

  • Minmin Chen
  • Kilian Weinberger
  • John Blitzer

Domain adaptation algorithms seek to generalize a model trained in a source domain to a new target domain. In many practical cases, the source and target distributions can differ substantially, and in some cases crucial target features may not have support in the source domain. In this paper we introduce an algorithm that bridges the gap between source and target domains by slowly adding both the target features and instances in which the current algorithm is the most confident. Our algorithm is a variant of co-training, and we name it CODA (Co-training for domain adaptation). Unlike the original co-training work, we do not assume a particular feature split. Instead, for each iteration of co-training, we add target features and formulate a single optimization problem which simultaneously learns a target predictor, a split of the feature space into views, and a shared subset of source and target features to include in the predictor. CODA significantly out-performs the state-of-the-art on the 12-domain benchmark data set of Blitzer et al. . Indeed, over a wide range (65 of 84 comparisons) of target supervision, ranging from no labeled target data to a relatively large number of target labels, CODA achieves the best performance.

UAI Conference 2008 Conference Paper

Multi-View Learning over Structured and Non-Identical Outputs

  • Kuzman Ganchev
  • João Graça
  • John Blitzer
  • Ben Taskar

In many machine learning problems, labeled training data is limited but unlabeled data is ample. Some of these problems have instances that can be factored into multiple views, each of which is nearly sufficent in determining the correct labels. In this paper we present a new algorithm for probabilistic multi-view learning which uses the idea of stochastic agreement between views as regularization. Our algorithm works on structured and unstructured problems and easily generalizes to partial agreement scenarios. For the full agreement case, our algorithm minimizes the Bhattacharyya distance between the models of each view, and performs better than CoBoosting and two-view Perceptron on several flat and structured classification problems.

NeurIPS Conference 2008 Conference Paper

Regularized Learning with Networks of Features

  • Ted Sandler
  • John Blitzer
  • Partha Talukdar
  • Lyle Ungar

For many supervised learning problems, we possess prior knowledge about which features yield similar information about the target variable. In predicting the topic of a document, we might know that two words are synonyms, or when performing image recognition, we know which pixels are adjacent. Such synonymous or neighboring features are near-duplicates and should therefore be expected to have similar weights in a good model. Here we present a framework for regularized learning in settings where one has prior knowledge about which features are expected to have similar and dissimilar weights. This prior knowledge is encoded as a graph whose vertices represent features and whose edges represent similarities and dissimilarities between them. During learning, each feature's weight is penalized by the amount it differs from the average weight of its neighbors. For text classification, regularization using graphs of word co-occurrences outperforms manifold learning and compares favorably to other recently proposed semi-supervised learning methods. For sentiment analysis, feature graphs constructed from declarative human knowledge, as well as from auxiliary task learning, significantly improve prediction accuracy.

NeurIPS Conference 2007 Conference Paper

Learning Bounds for Domain Adaptation

  • John Blitzer
  • Koby Crammer
  • Alex Kulesza
  • Fernando Pereira
  • Jennifer Wortman

Empirical risk minimization offers well-known learning guarantees when training and test data come from the same domain. In the real world, though, we often wish to adapt a classifier from a source domain with a large amount of training data to different target domain with very little training data. In this work we give uniform convergence bounds for algorithms that minimize a convex combination of source and target empirical risk. The bounds explicitly model the inherent trade-off between training on a large but inaccurate source data set and a small but accurate target training set. Our theory also gives results when we have multiple source domains, each of which may have a different number of instances, and we exhibit cases in which minimizing a non-uniform combination of source risks can achieve much lower target error than standard empirical risk minimization.

NeurIPS Conference 2006 Conference Paper

Analysis of Representations for Domain Adaptation

  • Shai Ben-David
  • John Blitzer
  • Koby Crammer
  • Fernando Pereira

Discriminative learning methods for classification perform well when training and test data are drawn from the same distribution. In many situations, though, we have labeled training data for a source domain, and we wish to learn a classifier which performs well on a target domain with a different distribution. Under what conditions can we adapt a classifier trained on the source domain for use in the target domain? Intuitively, a good feature representation is a crucial factor in the success of domain adaptation. We formalize this intuition theoretically with a generalization bound for domain adaption. Our theory illustrates the tradeoffs inherent in designing a representation for domain adaptation and gives a new justification for a recently proposed model. It also points toward a promising new model for domain adaptation: one which explicitly minimizes the difference between the source and target domains, while at the same time maximizing the margin of the training set.

NeurIPS Conference 2005 Conference Paper

Distance Metric Learning for Large Margin Nearest Neighbor Classification

  • Kilian Weinberger
  • John Blitzer
  • Lawrence Saul

We show how to learn a Mahanalobis distance metric for k -nearest neighbor (kNN) classification by semidefinite programming. The metric is trained with the goal that the k -nearest neighbors always belong to the same class while examples from different classes are separated by a large margin. On seven data sets of varying size and difficulty, we find that metrics trained in this way lead to significant improvements in kNN classification--for example, achieving a test error rate of 1. 3% on the MNIST handwritten digits. As in support vector machines (SVMs), the learning problem reduces to a convex optimization based on the hinge loss. Unlike learning in SVMs, however, our framework requires no modification or extension for problems in multiway (as opposed to binary) classification.

NeurIPS Conference 2004 Conference Paper

Hierarchical Distributed Representations for Statistical Language Modeling

  • John Blitzer
  • Fernando Pereira
  • Kilian Weinberger
  • Lawrence Saul

Statistical language models estimate the probability of a word occurring in a given context. The most common language models rely on a discrete enumeration of predictive contexts (e. g. , n-grams) and consequently fail to capture and exploit statistical regularities across these contexts. In this paper, we show how to learn hierarchical, distributed representations of word contexts that maximize the predictive value of a statistical language model. The representations are initialized by unsupervised algorithms for linear and nonlinear dimensionality reduction [14], then fed as input into a hierarchical mixture of experts, where each expert is a multinomial dis- tribution over predicted words [12]. While the distributed representations in our model are inspired by the neural probabilistic language model of Bengio et al. [2, 3], our particular architecture enables us to work with significantly larger vocabularies and training corpora. For example, on a large-scale bigram modeling task involving a sixty thousand word vocab- ulary and a training corpus of three million sentences, we demonstrate consistent improvement over class-based bigram models [10, 13]. We also discuss extensions of our approach to longer multiword contexts. 1 Introduction Statistical language models are essential components of natural language systems for human-computer interaction. They play a central role in automatic speech recognition [11], machine translation [5], statistical parsing [8], and information retrieval [15]. These mod- els estimate the probability that a word will occur in a given context, where in general a context specifies a relationship to one or more words that have already been observed. The simplest, most studied case is that of n-gram language modeling, where each word is predicted from the preceding n-1 words. The main problem in building these models is that the vast majority of word combinations occur very infrequently, making it difficult to estimate accurate probabilities of words in most contexts. Researchers in statistical language modeling have developed a variety of smoothing tech- niques to alleviate this problem of data sparseness. Most smoothing methods are based on simple back-off formulas or interpolation schemes that discount the probability of observed events and assign the "leftover" probability mass to events unseen in training [7]. Unfortu- nately, these methods do not typically represent or take advantage of statistical regularities among contexts. One expects the probabilities of rare or unseen events in one context to be related to their probabilities in statistically similar contexts. Thus, it should be possible to estimate more accurate probabilities by exploiting these regularities. Several approaches have been suggested for sharing statistical information across contexts. The aggregate Markov model (AMM) of Saul and Pereira [13] (also discussed by Hofmann and Puzicha [10] as a special case of the aspect model) factors the conditional probability table of a word given its context by a latent variable representing context "classes". How- ever, this latent variable approach is difficult to generalize to multiword contexts, as the size of the conditional probability table for class given context grows exponentially with the context length. The neural probabilistic language model (NPLM) of Bengio et al. [2, 3] achieved signifi- cant improvements over state-of-the-art smoothed n-gram models [6]. The NPLM encodes contexts as low-dimensional continuous vectors. These are fed to a multilayer neural net- work that outputs a probability distribution over words. The low-dimensional vectors and the parameters of the network are trained simultaneously to minimize the perplexity of the language model. This model has no difficulty encoding multiword contexts, but its training and application are very costly because of the need to compute a separate normalization for the conditional probabilities associated to each context. In this paper, we introduce and evaluate a statistical language model that combines the advantages of the AMM and NPLM. Like the NPLM, it can be used for multiword con- texts, and like the AMM it avoids per-context normalization. In our model, contexts are represented as low-dimensional real vectors initialized by unsupervised algorithms for di- mensionality reduction [14]. The probabilities of words given contexts are represented by a hierarchical mixture of experts (HME) [12], where each expert is a multinomial distri- bution over predicted words. This tree-structured mixture model allows a rich dependency on context without expensive per-context normalization. Proper initialization of the dis- tributed representations is crucial; in particular, we find that initializations from the results of linear and nonlinear dimensionality reduction algorithms lead to better models (with significantly lower test perplexities) than random initialization. In practice our model is several orders of magnitude faster to train and apply than the NPLM, enabling us to work with larger vocabularies and training corpora. We present re- sults on a large-scale bigram modeling task, showing that our model also leads to significant improvements over comparable AMMs. 2 Distributed representations of words Natural language has complex, multidimensional semantics. As a trivial example, consider the following four sentences: The vase broke. The vase contains water. The window broke. The window contains water. The bottom right sentence is syntactically valid but semantically meaningless. As shown by the table, a two-bit distributed representation of the words "vase" and "window" suffices to express that a vase is both a container and breakable, while a window is breakable but can- not be a container. More generally, we expect low dimensional continuous representations of words to be even more effective at capturing semantic regularities. Distributed representations of words can be derived in several ways. In a given corpus of text, for example, consider the matrix of bigram counts whose element Cij records the number of times that word wj follows word wi. Further, let pij = Cij/ C k ik denote the conditional frequencies derived from these counts, and let pi denote the V -dimensional frequency vector with elements pij, where V is the vocabulary size. Note that the vectors pi themselves provide a distributed representation of the words wi in the corpus. For large vocabularies and training corpora, however, this is an extremely unwieldy representation, tantamount to storing the full matrix of bigram counts. Thus, it is natural to seek a lower dimensional representation that captures the same information. To this end, we need to map each vector pi to some d-dimensional vector xi, with d V. We consider two methods in dimensionality reduction for this problem. The results from these methods are then used to initialize the HME architecture in the next section. 2. 1 Linear dimensionality reduction The simplest form of dimensionality reduction is principal component analysis (PCA). PCA computes a linear projection of the frequency vectors pi into the low dimensional subspace that maximizes their variance. The variance-maximizing subspace of dimension- ality d is spanned by the top d eigenvectors of the frequency vector covariance matrix. The eigenvalues of the covariance matrix measure the variance captured by each axis of the subspace. The effect of PCA can also be understood as a translation and rotation of the frequency vectors pi, followed by a truncation that preserves only their first d elements. 2. 2 Nonlinear dimensionality reduction Intuitively, we would like to map the vectors pi into a low dimensional space where se- mantically similar words remain close together and semantically dissimilar words are far apart. Can we find a nonlinear mapping that does this better than PCA? Weinberger et al. recently proposed a new solution to this problem based on semidefinite programming [14]. Let xi denote the image of pi under this mapping. The mapping is discovered by first learning the V V matrix of Euclidean squared distances [1] given by Dij = |xi - xj|2. This is done by balancing two competing goals: (i) to co-locate semantically similar words, and (ii) to separate semantically dissimilar words. The first goal is achieved by fixing the distances between words with similar frequency vectors to their original values. In particu- lar, if pj and pk lie within some small neighborhood of each other, then the corresponding element Djk in the distance matrix is fixed to the value |pj - pk|2. The second goal is achieved by maximizing the sum of pairwise squared distances ijDij. Thus, we push the words in the vocabulary as far apart as possible subject to the constraint that the distances between semantically similar words do not change. The only freedom in this optimization is the criterion for judging that two words are se- mantically similar. In practice, we adopt a simple criterion such as k-nearest neighbors in the space of frequency vectors pi and choose k as small as possible so that the resulting neighborhood graph is connected [14]. The optimization is performed over the space of Euclidean squared distance matrices [1]. Necessary and sufficient conditions for the matrix D to be interpretable as a Euclidean squared distance matrix are that D is symmetric and that the Gram matrix1 derived from G = - 1 HDHT is semipositive definite, where H = I - 1 11T. The optimization can 2 V thus be formulated as the semidefinite programming problem: Maximize ijDij subject to: (i) DT = D, (ii) - 1 HDH 0, and 2 (iii) Dij = |pi - pj|2 for all neighboring vectors pi and pj. 1Assuming without loss of generality that the vectors xi are centered on the origin, the dot prod- ucts Gij = xi xj are related to the pairwise squared distances Dij = |xi - xj |2 as stated above.