Arrow Research search

Author name cluster

David Newman

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
1 author row

Possible papers

5

TIST Journal 2012 Journal Article

TopicNets

  • Brynjar Gretarsson
  • John O’Donovan
  • Svetlin Bostandjiev
  • Tobias Höllerer
  • Arthur Asuncion
  • David Newman
  • Padhraic Smyth

We present TopicNets, a Web-based system for visual and interactive analysis of large sets of documents using statistical topic models. A range of visualization types and control mechanisms to support knowledge discovery are presented. These include corpus- and document-specific views, iterative topic modeling, search, and visual filtering. Drill-down functionality is provided to allow analysts to visualize individual document sections and their relations within the global topic space. Analysts can search across a dataset through a set of expansion techniques on selected document and topic nodes. Furthermore, analysts can select relevant subsets of documents and perform real-time topic modeling on these subsets to interactively visualize topics at various levels of granularity, allowing for a better understanding of the documents. A discussion of the design and implementation choices for each visual analysis technique is presented. This is followed by a discussion of three diverse use cases in which TopicNets enables fast discovery of information that is otherwise hard to find. These include a corpus of 50,000 successful NSF grant proposals, 10,000 publications from a large research center, and single documents including a grant proposal and a PhD thesis.

NeurIPS Conference 2011 Conference Paper

Improving Topic Coherence with Regularized Topic Models

  • David Newman
  • Edwin Bonilla
  • Wray Buntine

Topic models have the potential to improve search and browsing by extracting useful semantic themes from web pages and other text documents. When learned topics are coherent and interpretable, they can be valuable for faceted browsing, results set diversity analysis, and document retrieval. However, when dealing with small collections or noisy text (e. g. web search result snippets or blog posts), learned topics can be less coherent, less interpretable, and less useful. To overcome this, we propose two methods to regularize the learning of topic models. Our regularizers work by creating a structured prior over words that reflect broad patterns in the external data. Using thirteen datasets we show that both regularizers improve topic coherence and interpretability while learning a faithful representation of the collection of interest. Overall, this work makes topic models more useful across a broader range of text data.

JMLR Journal 2009 Journal Article

Distributed Algorithms for Topic Models

  • David Newman
  • Arthur Asuncion
  • Padhraic Smyth
  • Max Welling

We describe distributed algorithms for two widely-used topic models, namely the Latent Dirichlet Allocation (LDA) model, and the Hierarchical Dirichet Process (HDP) model. In our distributed algorithms the data is partitioned across separate processors and inference is done in a parallel, distributed fashion. We propose two distributed algorithms for LDA. The first algorithm is a straightforward mapping of LDA to a distributed processor setting. In this algorithm processors concurrently perform Gibbs sampling over local data followed by a global update of topic counts. The algorithm is simple to implement and can be viewed as an approximation to Gibbs-sampled LDA. The second version is a model that uses a hierarchical Bayesian extension of LDA to directly account for distributed data. This model has a theoretical guarantee of convergence but is more complex to implement than the first algorithm. Our distributed algorithm for HDP takes the straightforward mapping approach, and merges newly-created topics either by matching or by topic-id. Using five real-world text corpora we show that distributed learning works well in practice. For both LDA and HDP, we show that the converged test-data log probability for distributed learning is indistinguishable from that obtained with single-processor learning. Our extensive experimental results include learning topic models for two multi-million document collections using a 1024-processor parallel computer. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

NeurIPS Conference 2007 Conference Paper

Distributed Inference for Latent Dirichlet Allocation

  • David Newman
  • Padhraic Smyth
  • Max Welling
  • Arthur Asuncion

 processors only sees We investigate the problem of learning a widely-used latent-variable model – the Latent Dirichlet Allocation (LDA) or “topic” model – using distributed compu- of the total data set. We pro- tation, where each of pose two distributed inference schemes that are motivated from different perspec- tives. The first scheme uses local Gibbs sampling on each processor with periodic updates—it is simple to implement and can be viewed as an approximation to a single processor implementation of Gibbs sampling. The second scheme re- lies on a hierarchical Bayesian extension of the standard LDA model to directly processors—it has a theo- account for the fact that data are distributed across retical guarantee of convergence but is more complex to implement than the ap- proximate method. Using five real-world text corpora we show that distributed learning works very well for LDA models, i. e. , perplexity and precision-recall scores for distributed learning are indistinguishable from those obtained with single-processor learning. Our extensive experimental results include large-scale distributed computation on 1000 virtual processors; and speedup experiments of learning topics in a 100-million word corpus using 16 processors.

NeurIPS Conference 2006 Conference Paper

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

  • Yee Teh
  • David Newman
  • Max Welling

Latent Dirichlet allocation (LDA) is a Bayesian network that has recently gained much popularity in applications ranging from document modeling to computer vision. Due to the large scale nature of these applications, current inference pro- cedures like variational Bayes and Gibbs sampling have been found lacking. In this paper we propose the collapsed variational Bayesian inference algorithm for LDA, and show that it is computationally efficient, easy to implement and signifi- cantly more accurate than standard variational Bayesian inference for LDA.