Arrow Research search

Author name cluster

Lawrence Carin

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

150 papers
2 author rows

Possible papers

150

NeurIPS Conference 2025 Conference Paper

Coupling Generative Modeling and an Autoencoder with the Causal Bridge

  • Ruolin Meng
  • Ming-Yu Chung
  • Dhanajit Brahma
  • Ricardo Henao
  • Lawrence Carin

We consider inferring the causal effect of a treatment (intervention) on an outcome of interest in situations where there is potentially an unobserved confounder influencing both the treatment and the outcome. This is achievable by assuming access to two separate sets of control (proxy) measurements associated with treatment and outcomes, which are used to estimate treatment effects through a function termed the causal bridge (CB). We present a new theoretical perspective, associated assumptions for when estimating treatment effects with the CB is feasible, and a bound on the average error of the treatment effect when the CB assumptions are violated. From this new perspective, we then demonstrate how coupling the CB with an autoencoder architecture allows for the sharing of statistical strength between observed quantities (proxies, treatment, and outcomes), thus improving the quality of the CB estimates. Experiments on synthetic and real-world data demonstrate the effectiveness of the proposed approach relative to state-of-the-art methodology for causal inference with proxy measurements.

NeurIPS Conference 2025 Conference Paper

From Softmax to Score: Transformers Can Effectively Implement In-Context Denoising Steps

  • Paul Rosu
  • Lawrence Carin
  • Xiang Cheng

Transformers have emerged as powerful meta-learners, with growing evidence that they implement learning algorithms within their forward pass. We study this phenomenon in the context of denoising, presenting a unified framework that shows Transformers can implement (a) manifold denoising via Laplacian flows, (b) score-based denoising from diffusion models, and (c) a generalized form of anisotropic diffusion denoising. Our theory establishes exact equivalence between Transformer attention updates and these algorithms. Empirically, we validate these findings on image denoising tasks, showing that even simple Transformers can perform robust denoising both with and without context. These results illustrate the Transformer’s flexibility as a denoising meta-learner. Code available at https: //github. com/paulrosu11/Transformers are Diffusion_Denoisers.

ICLR Conference 2025 Conference Paper

Graph Transformers Dream of Electric Flow

  • Xiang Cheng
  • Lawrence Carin
  • Suvrit Sra

We show theoretically and empirically that the linear Transformer, when applied to graph data, can implement algorithms that solve canonical problems such as electric flow and eigenvector decomposition. The Transformer has access to information on the input graph only via the graph's incidence matrix. We present explicit weight configurations for implementing each algorithm, and we bound the constructed Transformers' errors by the errors of the underlying algorithms. Our theoretical findings are corroborated by experiments on synthetic data. Additionally, on a real-world molecular regression task, we observe that the linear Transformer is capable of learning a more effective positional encoding than the default one based on Laplacian eigenvectors. Our work is an initial step towards elucidating the inner-workings of the Transformer for graph data. Code is available at https://github.com/chengxiang/LinearGraphTransformer

ICML Conference 2025 Conference Paper

On Understanding Attention-Based In-Context Learning for Categorical Data

  • Aaron T. Wang
  • William Convertino
  • Xiang Cheng
  • Ricardo Henao
  • Lawrence Carin

In-context learning based on attention models is examined for data with categorical outcomes, with inference in such models viewed from the perspective of functional gradient descent (GD). We develop a network composed of attention blocks, with each block employing a self-attention layer followed by a cross-attention layer, with associated skip connections. This model can exactly perform multi-step functional GD inference for in-context inference with categorical observations. We perform a theoretical analysis of this setup, generalizing many prior assumptions in this line of work, including the class of attention mechanisms for which it is appropriate. We demonstrate the framework empirically on synthetic data, image classification and language generation.

UAI Conference 2022 Conference Paper

Capturing actionable dynamics with structured latent ordinary differential equations

  • Paidamoyo Chapfuwa
  • Sherri Rose
  • Lawrence Carin
  • Edward Meeds
  • Ricardo Henao

End-to-end learning of dynamical systems with black-box models, such as neural ordinary differential equations (ODEs), provides a flexible framework for learning dynamics from data without prescribing a mathematical model for the dynamics. Unfortunately, this flexibility comes at the cost of understanding the dynamical system, for which ODEs are used ubiquitously. Further, experimental data are collected under various conditions (inputs), such as treatments, or grouped in some way, such as part of sub-populations. Understanding the effects of these system inputs on system outputs is crucial to have any meaningful model of a dynamical system. To that end, we propose a structured latent ODE model that explicitly captures system input variations within its latent representation. Building on a static latent variable specification, our model learns (independent) stochastic factors of variation for each input to the system, thus separating the effects of the system inputs in the latent space. This approach provides actionable modeling through the controlled generation of time-series data for novel input combinations (or perturbations). Additionally, we propose a flexible approach for quantifying uncertainties, leveraging a quantile regression formulation. Results on challenging biological datasets show consistent improvements over competitive baselines in the controlled generation of observational data and inference of biologically meaningful system inputs.

AIIM Journal 2022 Journal Article

Explainable multiple abnormality classification of chest CT volumes

  • Rachel Lea Draelos
  • Lawrence Carin

Understanding model predictions is critical in healthcare, to facilitate rapid verification of model correctness and to guard against use of models that exploit confounding variables. We introduce the challenging new task of explainable multiple abnormality classification in volumetric medical images, in which a model must indicate the regions used to predict each abnormality. To solve this task, we propose a multiple instance learning convolutional neural network, AxialNet, that allows identification of top slices for each abnormality. Next we incorporate HiResCAM, an attention mechanism, to identify sub-slice regions. We prove that for AxialNet, HiResCAM explanations are guaranteed to reflect the locations the model used, unlike Grad-CAM which sometimes highlights irrelevant locations. Armed with a model that produces faithful explanations, we then aim to improve the model’s learning through a novel mask loss that leverages HiResCAM and 3D allowed regions to encourage the model to predict abnormalities based only on the organs in which those abnormalities appear. The 3D allowed regions are obtained automatically through a new approach, PARTITION, that combines location information extracted from radiology reports with organ segmentation maps obtained through morphological image processing. Overall, we propose the first model for explainable multi-abnormality prediction in volumetric medical images, and then use the mask loss to achieve a 33% improvement in organ localization of multiple abnormalities in the RAD-ChestCT dataset of 36, 316 scans, representing the state of the art. This work advances the clinical applicability of multiple abnormality modeling in chest CT volumes.

ICLR Conference 2022 Conference Paper

Gradient Importance Learning for Incomplete Observations

  • Qitong Gao
  • Dong Wang 0037
  • Joshua David Amason
  • Siyang Yuan
  • Chenyang Tao
  • Ricardo Henao
  • Majda Hadziahmetovic
  • Lawrence Carin

Though recent works have developed methods that can generate estimates (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks such as classification. This is particularly true if the data have large missingness rates or a small sample size. More importantly, the imputation error could be propagated into the prediction step that follows, which may constrain the capabilities of the prediction model. In this work, we introduce the gradient importance learning (GIL) method to train multilayer perceptrons (MLPs) and long short-term memories (LSTMs) to directly perform inference from inputs containing missing values without imputation. Specifically, we employ reinforcement learning (RL) to adjust the gradients used to train these models via back-propagation. This allows the model to exploit the underlying information behind missingness patterns. We test the approach on real-world time-series (i.e., MIMIC-III), tabular data obtained from an eye clinic, and a standard dataset (i.e., MNIST), where our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.

NeurIPS Conference 2021 Conference Paper

CAM-GAN: Continual Adaptation Modules for Generative Adversarial Networks

  • Sakshi Varshney
  • Vinay Kumar Verma
  • P. K. Srijith
  • Lawrence Carin
  • Piyush Rai

We present a continual learning approach for generative adversarial networks (GANs), by designing and leveraging parameter-efficient feature map transformations. Our approach is based on learning a set of global and task-specific parameters. The global parameters are fixed across tasks whereas the task-specific parameters act as local adapters for each task, and help in efficiently obtaining task-specific feature maps. Moreover, we propose an element-wise addition of residual bias in the transformed feature space, which further helps stabilize GAN training in such settings. Our approach also leverages task similarities based on the Fisher information matrix. Leveraging this knowledge from previous tasks significantly improves the model performance. In addition, the similarity measure also helps reduce the parameter growth in continual adaptation and helps to learn a compact model. In contrast to the recent approaches for continually-learned GANs, the proposed approach provides a memory-efficient way to perform effective continual data generation. Through extensive experiments on challenging and diverse datasets, we show that the feature-map-transformation approach outperforms state-of-the-art methods for continually-learned GANs, with substantially fewer parameters. The proposed method generates high-quality samples that can also improve the generative-replay-based continual learning for discriminative tasks.

ICLR Conference 2021 Conference Paper

FairFil: Contrastive Neural Debiasing Method for Pretrained Text Encoders

  • Pengyu Cheng
  • Weituo Hao
  • Siyang Yuan
  • Shijing Si
  • Lawrence Carin

Pretrained text encoders, such as BERT, have been applied increasingly in various natural language processing (NLP) tasks, and have recently demonstrated significant performance gains. However, recent studies have demonstrated the existence of social bias in these pretrained NLP models. Although prior works have made progress on word-level debiasing, improved sentence-level fairness of pretrained encoders still lacks exploration. In this paper, we proposed the first neural debiasing method for a pretrained sentence encoder, which transforms the pretrained encoder outputs into debiased representations via a fair filter (FairFil) network. To learn the FairFil, we introduce a contrastive learning framework that not only minimizes the correlation between filtered embeddings and bias words but also preserves rich semantic information of the original sentences. On real-world datasets, our FairFil effectively reduces the bias degree of pretrained text encoders, while continuously showing desirable performance on downstream tasks. Moreover, our post hoc method does not require any retraining of the text encoders, further enlarging FairFil's application space.

AAAI Conference 2021 Conference Paper

GO Hessian for Expectation-Based Objectives

  • Yulai Cong
  • Miaoyun Zhao
  • Jianqiao Li
  • Junya Chen
  • Lawrence Carin

An unbiased low-variance gradient estimator, termed GO gradient, was proposed recently for expectation-based objectives Eqγ (y)[f(y)], where the random variable (RV) y may be drawn from a stochastic computation graph (SCG) with continuous (non-reparameterizable) internal nodes and continuous/discrete leaves. Based on the GO gradient, we present for Eqγ (y)[f(y)] an unbiased low-variance Hessian estimator, named GO Hessian, which contains the deterministic Hessian as a special case. Considering practical implementation, we reveal that the GO Hessian in expectation obeys the chain rule and is therefore easy-to-use with auto-differentiation and Hessian-vector products, enabling efficient cheap exploitation of curvature information over deep SCGs. As representative examples, we present the GO Hessian for non-reparameterizable gamma and negative binomial RVs/nodes. Leveraging the GO Hessian, we develop a new second-order method for Eqγ (y)[f(y)], with challenging experiments conducted to verify its effectiveness and efficiency.

ICLR Conference 2021 Conference Paper

Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning

  • Siyang Yuan
  • Pengyu Cheng
  • Ruiyi Zhang 0002
  • Weituo Hao
  • Zhe Gan
  • Lawrence Carin

Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. In this paper we propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separate low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness.

AAAI Conference 2021 Conference Paper

Learning Graphons via Structured Gromov-Wasserstein Barycenters

  • Hongteng Xu
  • Dixin Luo
  • Lawrence Carin
  • Hongyuan Zha

We propose a novel and principled method to learn a nonparametric graph model called graphon, which is defined in an infinite-dimensional space and represents arbitrary-size graphs. Based on the weak regularity lemma from the theory of graphons, we leverage a step function to approximate a graphon. We show that the cut distance of graphons can be relaxed to the Gromov-Wasserstein distance of their step functions. Accordingly, given a set of graphs generated by an underlying graphon, we learn the corresponding step function as the Gromov-Wasserstein barycenter of the given graphs. Furthermore, we develop several enhancements and extensions of the basic algorithm, e. g. , the smoothed Gromov- Wasserstein barycenter for guaranteeing the continuity of the learned graphons and the mixed Gromov-Wasserstein barycenters for learning multiple structured graphons. The proposed approach overcomes drawbacks of prior state-ofthe-art methods, and outperforms them on both synthetic and real-world data. The code is available at https: //github. com/ HongtengXu/SGWB-Graphon.

ICLR Conference 2021 Conference Paper

MixKD: Towards Efficient Distillation of Large-scale Language Models

  • Kevin J. Liang
  • Weituo Hao
  • Dinghan Shen
  • Yufan Zhou 0001
  • Weizhu Chen
  • Changyou Chen
  • Lawrence Carin

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

NeurIPS Conference 2021 Conference Paper

Supercharging Imbalanced Data Learning With Energy-based Contrastive Representation Transfer

  • Junya Chen
  • Zidi Xiu
  • Benjamin Goldstein
  • Ricardo Henao
  • Lawrence Carin
  • Chenyang Tao

Dealing with severe class imbalance poses a major challenge for many real-world applications, especially when the accurate classification and generalization of minority classes are of primary interest. In computer vision and NLP, learning from datasets with long-tail behavior is a recurring theme, especially for naturally occurring labels. Existing solutions mostly appeal to sampling or weighting adjustments to alleviate the extreme imbalance, or impose inductive bias to prioritize generalizable associations. Here we take a novel perspective to promote sample efficiency and model generalization based on the invariance principles of causality. Our contribution posits a meta-distributional scenario, where the causal generating mechanism for label-conditional features is invariant across different labels. Such causal assumption enables efficient knowledge transfer from the dominant classes to their under-represented counterparts, even if their feature distributions show apparent disparities. This allows us to leverage a causal data augmentation procedure to enlarge the representation of minority classes. Our development is orthogonal to the existing imbalanced data learning techniques thus can be seamlessly integrated. The proposed approach is validated on an extensive set of synthetic and real-world tasks against state-of-the-art solutions.

NeurIPS Conference 2020 Conference Paper

AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning

  • Hao Zhang
  • Yuan Li
  • Zhijie Deng
  • Xiaodan Liang
  • Lawrence Carin
  • Eric Xing

Synchronization is a key step in data-parallel distributed machine learning (ML). Different synchronization systems and strategies perform differently, and to achieve optimal parallel training throughput requires synchronization strategies that adapt to model structures and cluster configurations. Existing synchronization systems often only consider a single or a few synchronization aspects, and the burden of deciding the right synchronization strategy is then placed on the ML practitioners, who may lack the required expertise. In this paper, we develop a model- and resource-dependent representation for synchronization, which unifies multiple synchronization aspects ranging from architecture, message partitioning, placement scheme, to communication topology. Based on this representation, we build an end-to-end pipeline, AutoSync, to automatically optimize synchronization strategies given model structures and resource specifications, lowering the bar for data-parallel distributed ML. By learning from low-shot data collected in only 200 trial runs, AutoSync can discover synchronization strategies up to 1. 6x better than manually optimized ones. We develop transfer-learning mechanisms to further reduce the auto-optimization cost -- the simulators can transfer among similar model architectures, among similar cluster configurations, or both. We also present a dataset that contains over 10000 synchronization strategies and run-time pairs on a diverse set of models and cluster specifications.

AAAI Conference 2020 Conference Paper

Bridging Maximum Likelihood and Adversarial Learning via α-Divergence

  • Miaoyun Zhao
  • Yulai Cong
  • Shuyang Dai
  • Lawrence Carin

Maximum likelihood (ML) and adversarial learning are two popular approaches for training generative models, and from many perspectives these techniques are complementary. ML learning encourages the capture of all data modes, and it is typically characterized by stable training. However, ML learning tends to distribute probability mass diffusely over the data space, e. g. , yielding blurry synthetic images. Adversarial learning is well known to synthesize highly realistic natural images, despite practical challenges like mode dropping and delicate training. We propose an α-Bridge to unify the advantages of ML and adversarial learning, enabling the smooth transfer from one to the other via the α-divergence. We reveal that generalizations of the α-Bridge are closely related to approaches developed recently to regularize adversarial learning, providing insights into that prior work, and further understanding of why the α-Bridge performs well in practice.

NeurIPS Conference 2020 Conference Paper

Calibrating CNNs for Lifelong Learning

  • Pravendra Singh
  • Vinay Kumar Verma
  • Pratik Mazumder
  • Lawrence Carin
  • Piyush Rai

We present an approach for lifelong/continual learning of convolutional neural networks (CNN) that does not suffer from the problem of catastrophic forgetting when moving from one task to the other. We show that the activation maps generated by the CNN trained on the old task can be calibrated using very few calibration parameters, to become relevant to the new task. Based on this, we calibrate the activation maps produced by each network layer using spatial and channel-wise calibration modules and train only these calibration parameters for each new task in order to perform lifelong learning. Our calibration modules introduce significantly less computation and parameters as compared to the approaches that dynamically expand the network. Our approach is immune to catastrophic forgetting since we store the task-adaptive calibration parameters, which contain all the task-specific knowledge and is exclusive to each task. Further, our approach does not require storing data samples from the old tasks, which is done by many replay based methods. We perform extensive experiments on multiple benchmark datasets (SVHN, CIFAR, ImageNet, and MS-Celeb), all of which show substantial improvements over state-of-the-art methods (e. g. , a 29% absolute increase in accuracy on CIFAR-100 with 10 classes at a time). On large-scale datasets, our approach yields 23. 8% and 9. 7% absolute increase in accuracy on ImageNet-100 and MS-Celeb-10K datasets, respectively, by employing very few (0. 51% and 0. 35% of model parameters) task-adaptive calibration parameters.

ICML Conference 2020 Conference Paper

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

  • Pengyu Cheng
  • Weituo Hao
  • Shuyang Dai
  • Jiachang Liu 0001
  • Zhe Gan
  • Lawrence Carin

Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. We provide a theoretical analysis of the properties of CLUB and its variational approximation. Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy. Simulation studies on Gaussian distributions show the reliable estimation ability of CLUB. Real-world MI minimization experiments, including domain adaptation and information bottleneck, demonstrate the effectiveness of the proposed method. The code is at https: //github. com/Linear95/CLUB.

AAAI Conference 2020 Conference Paper

Complementary Auxiliary Classifiers for Label-Conditional Text Generation

  • Yuan Li
  • Chunyuan Li
  • Yizhe Zhang
  • Xiujun Li
  • Guoqing Zheng
  • Lawrence Carin
  • Jianfeng Gao

Learning to generate text with a given label is a challenging task because natural language sentences are highly variable and ambiguous. It renders difficulties in trade-off between sentence quality and label fidelity. In this paper, we present CARA to alleviate the issue, where two auxiliary classifiers work simultaneously to ensure that (1) the encoder learns disentangled features and (2) the generator produces labelrelated sentences. Two practical techniques are further proposed to improve the performance, including annealing the learning signal from the auxiliary classifier, and enhancing the encoder with pre-trained language models. To establish a comprehensive benchmark fostering future research, we consider a suite of four datasets, and systematically reproduce three representative methods. CARA shows consistent improvement over the previous methods on the task of labelconditional text generation, and achieves state-of-the-art on the task of attribute transfer.

AAAI Conference 2020 Conference Paper

Dynamic Embedding on Textual Networks via a Gaussian Process

  • Pengyu Cheng
  • Yitong Li
  • Xinyuan Zhang
  • Liqun Chen
  • David Carlson
  • Lawrence Carin

Textual network embedding aims to learn low-dimensional representations of text-annotated nodes in a graph. Prior work in this area has typically focused on fixed graph structures; however, real-world networks are often dynamic. We address this challenge with a novel end-to-end node-embedding model, called Dynamic Embedding for Textual Networks with a Gaussian Process (DetGP). After training, DetGP can be applied efficiently to dynamic graphs without re-training or backpropagation. The learned representation of each node is a combination of textual and structural embeddings. Because the structure is allowed to be dynamic, our method uses the Gaussian process to take advantage of its non-parametric properties. To use both local and global graph structures, diffusion is used to model multiple hops between neighbors. The relative importance of global versus local structure for the embeddings is learned automatically. With the nonparametric nature of the Gaussian process, updating the embeddings for a changed graph structure requires only a forward pass through the learned model. Considering link prediction and node classification, experiments demonstrate the empirical effectiveness of our method compared to baseline approaches. We further show that DetGP can be straightforwardly and efficiently applied to dynamic textual networks.

NeurIPS Conference 2020 Conference Paper

GAN Memory with No Forgetting

  • Yulai Cong
  • Miaoyun Zhao
  • Jianqiao Li
  • Sijia Wang
  • Lawrence Carin

As a fundamental issue in lifelong learning, catastrophic forgetting is directly caused by inaccessible historical data; accordingly, if the data (information) were memorized perfectly, no forgetting should be expected. Motivated by that, we propose a GAN memory for lifelong learning, which is capable of remembering a stream of datasets via generative processes, with \emph{no} forgetting. Our GAN memory is based on recognizing that one can modulate the ``style'' of a GAN model to form perceptually-distant targeted generation. Accordingly, we propose to do sequential style modulations atop a well-behaved base GAN model, to form sequential targeted generative models, while simultaneously benefiting from the transferred base knowledge. The GAN memory -- that is motivated by lifelong learning -- is therefore itself manifested by a form of lifelong learning, via forward transfer and modulation of information from prior tasks. Experiments demonstrate the superiority of our method over existing approaches and its effectiveness in alleviating catastrophic forgetting for lifelong classification problems. Code is available at \url{https: //github. com/MiaoyunZhao/GANmemory_LifelongLearning}.

ICML Conference 2020 Conference Paper

Graph Optimal Transport for Cross-Domain Alignment

  • Liqun Chen 0001
  • Zhe Gan
  • Yu Cheng 0001
  • Linjie Li
  • Lawrence Carin
  • Jingjing Liu 0001

Cross-domain alignment between two sets of entities (e. g. , objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, where no training signals are provided to explicitly encourage alignment. Plus, the learned attention matrices are often dense and difficult to interpret. We propose Graph Optimal Transport (GOT), a principled framework that builds upon recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities as a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization.

AAAI Conference 2020 Conference Paper

Graph-Driven Generative Models for Heterogeneous Multi-Task Learning

  • Wenlin Wang
  • Hongteng Xu
  • Zhe Gan
  • Bai Li
  • Guoyin Wang
  • Liqun Chen
  • Qian Yang
  • Wenqi Wang

We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational autoencoders, thus embedding the nodes of the graph (i. e. , samples for the tasks) in a uniform manner, while specializing their organization and usage to different tasks. With a focus on healthcare applications (tasks), including clinical topic modeling, procedure recommendation and admission-type prediction, we demonstrate that our method successfully leverages information across different tasks, boosting performance in all tasks and outperforming existing state-of-the-art approaches.

ICML Conference 2020 Conference Paper

Learning Autoencoders with Relational Regularization

  • Hongteng Xu
  • Dixin Luo
  • Ricardo Henao
  • Svati Shah
  • Lawrence Carin

We propose a new algorithmic framework for learning autoencoders of data distributions. In this framework, we minimize the discrepancy between the model distribution and the target one, with relational regularization on learnable latent prior. This regularization penalizes the fused Gromov-Wasserstein (FGW) distance between the latent prior and its corresponding posterior, which allows us to learn a structured prior distribution associated with the generative model in a flexible way. Moreover, it helps us co-train multiple autoencoders even if they are with heterogeneous architectures and incomparable latent spaces. We implement the framework with two scalable algorithms, making it applicable for both probabilistic and deterministic autoencoders. Our relational regularized autoencoder (RAE) outperforms existing methods, e. g. , variational autoencoder, Wasserstein autoencoder, and their variants, on generating images. Additionally, our relational co-training strategy of autoencoders achieves encouraging results in both synthesis and real-world multi-view learning tasks.

ICML Conference 2020 Conference Paper

On Leveraging Pretrained GANs for Generation with Limited Data

  • Miaoyun Zhao
  • Yulai Cong
  • Lawrence Carin

Recent work has shown generative adversarial networks (GANs) can generate highly realistic images, that are often indistinguishable (by humans) from real images. Most images so generated are not contained in the training dataset, suggesting potential for augmenting training sets with GAN-generated data. While this scenario is of particular relevance when there are limited data available, there is still the issue of training the GAN itself based on that limited data. To facilitate this, we leverage existing GAN models pretrained on large-scale datasets (like ImageNet) to introduce additional knowledge (which may not exist within the limited data), following the concept of transfer learning. Demonstrated by natural-image generation, we reveal that low-level filters (those close to observations) of both the generator and discriminator of pretrained GANs can be transferred to facilitate generation in a perceptually-distinct target domain with limited training data. To further adapt the transferred filters to the target domain, we propose adaptive filter modulation (AdaFM). An extensive set of experiments is presented to demonstrate the effectiveness of the proposed techniques on generation with limited data.

NeurIPS Conference 2020 Conference Paper

Perturbing Across the Feature Hierarchy to Improve Standard and Strict Blackbox Attack Transferability

  • Nathan Inkawhich
  • Kevin Liang
  • Binghui Wang
  • Matthew Inkawhich
  • Lawrence Carin
  • Yiran Chen

We consider the blackbox transfer-based targeted adversarial attack threat model in the realm of deep neural network (DNN) image classifiers. Rather than focusing on crossing decision boundaries at the output layer of the source model, our method perturbs representations throughout the extracted feature hierarchy to resemble other classes. We design a flexible attack framework that allows for multi-layer perturbations and demonstrates state-of-the-art targeted transfer performance between ImageNet DNNs. We also show the superiority of our feature space methods under a relaxation of the common assumption that the source and target models are trained on the same dataset and label space, in some instances achieving a $10\times$ increase in targeted success rate relative to other blackbox transfer methods. Finally, we analyze why the proposed methods outperform existing attack strategies and show an extension of the method in the case when limited queries to the blackbox model are allowed.

ICLR Conference 2020 Conference Paper

RaCT: Toward Amortized Ranking-Critical Training For Collaborative Filtering

  • Sam Lobel
  • Chunyuan Li
  • Jianfeng Gao 0001
  • Lawrence Carin

We investigate new methods for training collaborative filtering models based on actor-critic reinforcement learning, to more directly maximize ranking-based objective functions. Specifically, we train a critic network to approximate ranking-based metrics, and then update the actor network to directly optimize against the learned metrics. In contrast to traditional learning-to-rank methods that require re-running the optimization procedure for new lists, our critic-based method amortizes the scoring process with a neural network, and can directly provide the (approximate) ranking scores for new lists. We demonstrate the actor-critic's ability to significantly improve the performance of a variety of prediction models, and achieve better or comparable performance to a variety of strong baselines on three large-scale datasets.

NeurIPS Conference 2020 Conference Paper

Reconsidering Generative Objectives For Counterfactual Reasoning

  • Danni Lu
  • Chenyang Tao
  • Junya Chen
  • Fan Li
  • Feng Guo
  • Lawrence Carin

There has been recent interest in exploring generative goals for counterfactual reasoning, such as individualized treatment effect (ITE) estimation. However, existing solutions often fail to address issues that are unique to causal inference, such as covariate balancing and (infeasible) counterfactual validation. As a step towards more flexible, scalable and accurate ITE estimation, we present a novel generative Bayesian estimation framework that integrates representation learning, adversarial matching and causal estimation. By appealing to the Robinson decomposition, we derive a reformulated variational bound that explicitly targets the causal effect estimation rather than specific predictive goals. Our procedure acknowledges the uncertainties in representation and solves a Fenchel mini-max game to resolve the representation imbalance for better counterfactual generalization, justified by new theory. Further, the latent variable formulation employed enables robustness to unobservable latent confounders, extending the scope of its applicability. The utility of the proposed solution is demonstrated via an extensive set of tests against competing solutions, both under various simulation setups and to real-world datasets, with encouraging results reported.

AAAI Conference 2020 Conference Paper

Sequence Generation with Optimal-Transport-Enhanced Reinforcement Learning

  • Liqun Chen
  • Ke Bai
  • Chenyang Tao
  • Yizhe Zhang
  • Guoyin Wang
  • Wenlin Wang
  • Ricardo Henao
  • Lawrence Carin

Reinforcement learning (RL) has been widely used to aid training in language generation. This is achieved by enhancing standard maximum likelihood objectives with userspecified reward functions that encourage global semantic consistency. We propose a principled approach to address the difficulties associated with RL-based solutions, namely, highvariance gradients, uninformative rewards and brittle training. By leveraging the optimal transport distance, we introduce a regularizer that significantly alleviates the above issues. Our formulation emphasizes the preservation of semantic features, enabling end-to-end training instead of ad-hoc fine-tuning, and when combined with RL, it controls the exploration space for more efficient model updates. To validate the effectiveness of the proposed solution, we perform a comprehensive evaluation covering a wide variety of NLP tasks: machine translation, abstractive text summarization and image caption, with consistent improvements over competing solutions.

ICLR Conference 2020 Conference Paper

Transferable Perturbations of Deep Feature Distributions

  • Nathan Inkawhich
  • Kevin J. Liang
  • Lawrence Carin
  • Yiran Chen 0001

Almost all current adversarial attacks of CNN classifiers rely on information derived from the output layer of the network. This work presents a new adversarial attack based on the modeling and exploitation of class-wise and layer-wise deep feature distributions. We achieve state-of-the-art targeted blackbox transfer-based attack results for undefended ImageNet models. Further, we place a priority on explainability and interpretability of the attacking process. Our methodology affords an analysis of how adversarial attacks change the intermediate feature distributions of CNNs, as well as a measure of layer-wise and class-wise feature distributional separability/entanglement. We also conceptualize a transition from task/data-specific to model-specific features within a CNN architecture that directly impacts the transferability of adversarial examples.

NeurIPS Conference 2019 Conference Paper

Certified Adversarial Robustness with Additive Noise

  • Bai Li
  • Changyou Chen
  • Wenlin Wang
  • Lawrence Carin

The existence of adversarial data examples has drawn significant attention in the deep-learning community; such data are seemingly minimally perturbed relative to the original data, but lead to very different outputs from a deep-learning algorithm. Although a significant body of work on developing defense models has been developed, most such models are heuristic and are often vulnerable to adaptive attacks. Defensive methods that provide theoretical robustness guarantees have been studied intensively, yet most fail to obtain non-trivial robustness when a large-scale model and data are present. To address these limitations, we introduce a framework that is scalable and provides certified bounds on the norm of the input manipulation for constructing adversarial examples. We establish a connection between robustness against adversarial perturbation and additive random noise, and propose a training strategy that can significantly improve the certified bounds. Our evaluation on MNIST, CIFAR-10 and ImageNet suggests that our method is scalable to complicated models and large data sets, while providing competitive robustness to state-of-the-art provable defense methods.

AAAI Conference 2019 Conference Paper

Communication-Efficient Stochastic Gradient MCMC for Neural Networks

  • Chunyuan Li
  • Changyou Chen
  • Yunchen Pu
  • Ricardo Henao
  • Lawrence Carin

Learning probability distributions on the weights of neural networks has recently proven beneficial in many applications. Bayesian methods such as Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) offer an elegant framework to reason about model uncertainty in neural networks. However, these advantages usually come with a high computational cost. We propose accelerating SG-MCMC under the masterworker framework: workers asynchronously and in parallel share responsibility for gradient computations, while the master collects the final samples. To reduce communication overhead, two protocols (downpour and elastic) are developed to allow periodic interaction between the master and workers. We provide a theoretical analysis on the finite-time estimation consistency of posterior expectations, and establish connections to sample thinning. Our experiments on various neural networks demonstrate that the proposed algorithms can greatly reduce training time while achieving comparable (or better) test accuracy/log-likelihood levels, relative to traditional SG-MCMC. When applied to reinforcement learning, it naturally provides exploration for asynchronous policy optimization, with encouraging performance improvement.

ICML Conference 2019 Conference Paper

Gromov-Wasserstein Learning for Graph Matching and Node Embedding

  • Hongteng Xu
  • Dixin Luo
  • Hongyuan Zha
  • Lawrence Carin

A novel Gromov-Wasserstein learning framework is proposed to jointly match (align) graphs and learn embedding vectors for the associated graph nodes. Using Gromov-Wasserstein discrepancy, we measure the dissimilarity between two graphs and find their correspondence, according to the learned optimal transport. The node embeddings associated with the two graphs are learned under the guidance of the optimal transport, the distance of which not only reflects the topological structure of each graph but also yields the correspondence across the graphs. These two learning steps are mutually-beneficial, and are unified here by minimizing the Gromov-Wasserstein discrepancy with structural regularizers. This framework leads to an optimization problem that is solved by a proximal point method. We apply the proposed method to matching problems in real-world networks, and demonstrate its superior performance compared to alternative approaches.

NeurIPS Conference 2019 Conference Paper

Improving Textual Network Learning with Variational Homophilic Embeddings

  • Wenlin Wang
  • Chenyang Tao
  • Zhe Gan
  • Guoyin Wang
  • Liqun Chen
  • Xinyuan Zhang
  • Ruiyi Zhang
  • Qian Yang

The performance of many network learning applications crucially hinges on the success of network embedding algorithms, which aim to encode rich network information into low-dimensional vertex-based vector representations. This paper considers a novel variational formulation of network embeddings, with special focus on textual networks. Different from most existing methods that optimize a discriminative objective, we introduce Variational Homophilic Embedding (VHE), a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural (topology) information through a novel homophilic prior design. Homophilic vertex embeddings encourage similar embedding vectors for related (connected) vertices. The VHE encourages better generalization for downstream tasks, robustness to incomplete observations, and the ability to generalize to unseen vertices. Extensive experiments on real-world networks, for multiple tasks, demonstrate that the proposed method achieves consistently superior performance relative to competing state-of-the-art approaches.

NeurIPS Conference 2019 Conference Paper

Kernel-Based Approaches for Sequence Modeling: Connections to Neural Methods

  • Kevin Liang
  • Guoyin Wang
  • Yitong Li
  • Ricardo Henao
  • Lawrence Carin

We investigate time-dependent data analysis from the perspective of recurrent kernel machines, from which models with hidden units and gated memory cells arise naturally. By considering dynamic gating of the memory cell, a model closely related to the long short-term memory (LSTM) recurrent neural network is derived. Extending this setup to $n$-gram filters, the convolutional neural network (CNN), Gated CNN, and recurrent additive network (RAN) are also recovered as special cases. Our analysis provides a new perspective on the LSTM, while also extending it to $n$-gram convolutional filters. Experiments are performed on natural language processing tasks and on analysis of local field potentials (neuroscience). We demonstrate that the variants we derive from kernels perform on par or even better than traditional neural methods. For the neuroscience application, the new models demonstrate significant improvements relative to the prior state of the art.

NeurIPS Conference 2019 Conference Paper

On Fenchel Mini-Max Learning

  • Chenyang Tao
  • Liqun Chen
  • Shuyang Dai
  • Junya Chen
  • Ke Bai
  • Dong Wang
  • Jianfeng Feng
  • Wenlian Lu

Inference, estimation, sampling and likelihood evaluation are four primary goals of probabilistic modeling. Practical considerations often force modeling approaches to make compromises between these objectives. We present a novel probabilistic learning framework, called Fenchel Mini-Max Learning (FML), that accommodates all four desiderata in a flexible and scalable manner. Our derivation is rooted in classical maximum likelihood estimation, and it overcomes a longstanding challenge that prevents unbiased estimation of unnormalized statistical models. By reformulating MLE as a mini-max game, FML enjoys an unbiased training objective that (i) does not explicitly involve the intractable normalizing constant and (ii) is directly amendable to stochastic gradient descent optimization. To demonstrate the utility of the proposed approach, we consider learning unnormalized statistical models, nonparametric density estimation and training generative models, with encouraging empirical results presented.

NeurIPS Conference 2019 Conference Paper

Ouroboros: On Accelerating Training of Transformer-Based Language Models

  • Qian Yang
  • Zhouyuan Huo
  • Wenlin Wang
  • Lawrence Carin

Language models are essential for natural language processing (NLP) tasks, such as machine translation and text summarization. Remarkable performance has been demonstrated recently across many NLP domains via a Transformer-based language model with over a billion parameters, verifying the benefits of model size. Model parallelism is required if a model is too large to fit in a single computing device. Current methods for model parallelism either suffer from backward locking in backpropagation or are not applicable to language models. We propose the first model-parallel algorithm that speeds the training of Transformer-based language models. We also prove that our proposed algorithm is guaranteed to converge to critical points for non-convex problems. Extensive experiments on Transformer and Transformer-XL language models demonstrate that the proposed algorithm obtains a much faster speedup beyond data parallelism, with comparable or better accuracy. Code to reproduce experiments is to be found at \url{https: //github. com/LaraQianYang/Ouroboros}.

ICML Conference 2019 Conference Paper

Revisiting the Softmax Bellman Operator: New Benefits and New Perspective

  • Zhao Song 0001
  • Ronald Parr
  • Lawrence Carin

The impact of softmax on the value function itself in reinforcement learning (RL) is often viewed as problematic because it leads to sub-optimal value (or Q) functions and interferes with the contraction properties of the Bellman operator. Surprisingly, despite these concerns, and independent of its effect on exploration, the softmax Bellman operator when combined with Deep Q-learning, leads to Q-functions with superior policies in practice, even outperforming its double Q-learning counterpart. To better understand how and why this occurs, we revisit theoretical properties of the softmax Bellman operator, and prove that (i) it converges to the standard Bellman operator exponentially fast in the inverse temperature parameter, and (ii) the distance of its Q function from the optimal one can be bounded. These alone do not explain its superior performance, so we also show that the softmax operator can reduce the overestimation error, which may give some insight into why a sub-optimal operator leads to better performance in the presence of value function approximation. A comparison among different Bellman operators is then presented, showing the trade-offs when selecting them.

NeurIPS Conference 2019 Conference Paper

Scalable Gromov-Wasserstein Learning for Graph Partitioning and Matching

  • Hongteng Xu
  • Dixin Luo
  • Lawrence Carin

We propose a scalable Gromov-Wasserstein learning (S-GWL) method and establish a novel and theoretically-supported paradigm for large-scale graph analysis. The proposed method is based on the fact that Gromov-Wasserstein discrepancy is a pseudometric on graphs. Given two graphs, the optimal transport associated with their Gromov-Wasserstein discrepancy provides the correspondence between their nodes and achieves graph matching. When one of the graphs is a predefined graph with isolated but self-connected nodes ($i. e. $, disconnected graph), the optimal transport indicates the clustering structure of the other graph and achieves graph partitioning. Further, we extend our method to multi-graph partitioning and matching by learning a Gromov-Wasserstein barycenter graph for multiple observed graphs. Our method combines a recursive $K$-partition mechanism with a warm-start proximal gradient algorithm, whose time complexity is $\mathcal{O}(K(E+V)\log_K V)$ for graphs with $V$ nodes and $E$ edges. To our knowledge, our method is the first attempt to make Gromov-Wasserstein discrepancy applicable to large-scale graph analysis and unify graph partitioning and matching into the same framework. It outperforms state-of-the-art graph partitioning and matching methods, achieving a trade-off between accuracy and efficiency.

ICML Conference 2019 Conference Paper

Stochastic Blockmodels meet Graph Neural Networks

  • Nikhil Mehta 0002
  • Lawrence Carin
  • Piyush Rai

Stochastic blockmodels (SBM) and their variants, $e. g. $, mixed-membership and overlapping stochastic blockmodels, are latent variable based generative models for graphs. They have proven to be successful for various tasks, such as discovering the community structure and link prediction on graph-structured data. Recently, graph neural networks, $e. g. $, graph convolutional networks, have also emerged as a promising approach to learn powerful representations (embeddings) for the nodes in the graph, by exploiting graph properties such as locality and invariance. In this work, we unify these two directions by developing a sparse variational autoencoder for graphs, that retains the interpretability of SBMs, while also enjoying the excellent predictive performance of graph neural nets. Moreover, our framework is accompanied by a fast recognition model that enables fast inference of the node embeddings (which are of independent interest for inference in SBM and its variants). Although we develop this framework for a particular type of SBM, namely the overlapping stochastic blockmodel, the proposed framework can be adapted readily for other types of SBMs. Experimental results on several benchmarks demonstrate encouraging results on link prediction while learning an interpretable latent structure that can be used for community discovery.

ICML Conference 2019 Conference Paper

Variational Annealing of GANs: A Langevin Perspective

  • Chenyang Tao
  • Shuyang Dai
  • Liqun Chen 0001
  • Ke Bai 0001
  • Junya Chen
  • Chang Liu 0030
  • Ruiyi Zhang 0002
  • Georgiy V. Bobashev

The generative adversarial network (GAN) has received considerable attention recently as a model for data synthesis, without an explicit specification of a likelihood function. There has been commensurate interest in leveraging likelihood estimates to improve GAN training. To enrich the understanding of this fast-growing yet almost exclusively heuristic-driven subject, we elucidate the theoretical roots of some of the empirical attempts to stabilize and improve GAN training with the introduction of likelihoods. We highlight new insights from variational theory of diffusion processes to derive a likelihood-based regularizing scheme for GAN training, and present a novel approach to train GANs with an unnormalized distribution instead of empirical samples. To substantiate our claims, we provide experimental evidence on how our theoretically-inspired new algorithms improve upon current practice.

AAAI Conference 2018 Conference Paper

Adaptive Feature Abstraction for Translating Video to Text

  • Yunchen Pu
  • Martin Min
  • Zhe Gan
  • Lawrence Carin

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable contextdependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach to generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, which adaptively and sequentially focuses on different layers of CNN features (levels of feature “abstraction”), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.

NeurIPS Conference 2018 Conference Paper

Adversarial Text Generation via Feature-Mover's Distance

  • Liqun Chen
  • Shuyang Dai
  • Chenyang Tao
  • Haichao Zhang
  • Zhe Gan
  • Dinghan Shen
  • Yizhe Zhang
  • Guoyin Wang

Generative adversarial networks (GANs) have achieved significant success in generating real-valued data. However, the discrete nature of text hinders the application of GAN to text-generation tasks. Instead of using the standard GAN objective, we propose to improve text-generation GAN via a novel approach inspired by optimal transport. Specifically, we consider matching the latent feature distributions of real and synthetic sentences using a novel metric, termed the feature-mover's distance (FMD). This formulation leads to a highly discriminative critic and easy-to-optimize objective, overcoming the mode-collapsing and brittle-training problems in existing methods. Extensive experiments are conducted on a variety of tasks to evaluate the proposed model empirically, including unconditional text generation, style transfer from non-parallel text, and unsupervised cipher cracking. The proposed model yields superior performance, demonstrating wide applicability and effectiveness.

ICML Conference 2018 Conference Paper

Adversarial Time-to-Event Modeling

  • Paidamoyo Chapfuwa
  • Chenyang Tao
  • Chunyuan Li
  • Courtney Page
  • Benjamin Goldstein 0001
  • Lawrence Carin
  • Ricardo Henao

Modern health data science applications leverage abundant molecular and electronic health data, providing opportunities for machine learning to build statistical models to support clinical practice. Time-to-event analysis, also called survival analysis, stands as one of the most representative examples of such statistical models. We present a deep-network-based approach that leverages adversarial learning to address a key challenge in modern time-to-event modeling: nonparametric estimation of event-time distributions. We also introduce a principled cost function to exploit information from censored events (events that occur subsequent to the observation window). Unlike most time-to-event models, we focus on the estimation of time-to-event distributions, rather than time ordering. We validate our model on both benchmark and real datasets, demonstrating that the proposed formulation yields significant performance gains relative to a parametric alternative, which we also propose.

ICML Conference 2018 Conference Paper

Chi-square Generative Adversarial Network

  • Chenyang Tao
  • Liqun Chen 0001
  • Ricardo Henao
  • Jianfeng Feng
  • Lawrence Carin

To assess the difference between real and synthetic data, Generative Adversarial Networks (GANs) are trained using a distribution discrepancy measure. Three widely employed measures are information-theoretic divergences, integral probability metrics, and Hilbert space discrepancy metrics. We elucidate the theoretical connections between these three popular GAN training criteria and propose a novel procedure, called $\chi^2$ (Chi-square) GAN, that is conceptually simple, stable at training and resistant to mode collapse. Our procedure naturally generalizes to address the problem of simultaneous matching of multiple distributions. Further, we propose a resampling strategy that significantly improves sample quality, by repurposing the trained critic function via an importance weighting mechanism. Experiments show that the proposed procedure improves stability and convergence, and yields state-of-art results on a wide range of generative modeling tasks.

ICML Conference 2018 Conference Paper

Continuous-Time Flows for Efficient Inference and Density Estimation

  • Changyou Chen
  • Chunyuan Li
  • Liquan Chen
  • Wenlin Wang
  • Yunchen Pu
  • Lawrence Carin

Two fundamental problems in unsupervised learning are efficient inference for latent-variable models and robust density estimation based on large amounts of unlabeled data. Algorithms for the two tasks, such as normalizing flows and generative adversarial networks (GANs), are often developed independently. In this paper, we propose the concept of continuous-time flows (CTFs), a family of diffusion-based methods that are able to asymptotically approach a target distribution. Distinct from normalizing flows and GANs, CTFs can be adopted to achieve the above two goals in one framework, with theoretical guarantees. Our framework includes distilling knowledge from a CTF for efficient inference, and learning an explicit energy-based distribution with CTFs for density estimation. Both tasks rely on a new technique for distribution matching within amortized learning. Experiments on various tasks demonstrate promising performance of the proposed CTF framework, compared to related techniques.

AAAI Conference 2018 Conference Paper

Deconvolutional Latent-Variable Model for Text Sequence Matching

  • Dinghan Shen
  • Yizhe Zhang
  • Ricardo Henao
  • Qinliang Su
  • Lawrence Carin

A latent-variable model is introduced for text matching, inferring sentence representations by jointly optimizing generative and discriminative objectives. To alleviate typical optimization challenges in latent-variable models for text, we employ deconvolutional networks as the sequence decoder (generator), providing learned latent codes with more semantic information and better generalization. Our model, trained in an unsupervised manner, yields stronger empirical predictive performance than a decoder based on Long Short-Term Memory (LSTM), with less parameters and considerably faster training. Further, we apply it to text sequence-matching problems. The proposed model significantly outperforms several strong sentence-encoding baselines, especially in the semisupervised setting.

NeurIPS Conference 2018 Conference Paper

Diffusion Maps for Textual Network Embedding

  • Xinyuan Zhang
  • Yitong Li
  • Dinghan Shen
  • Lawrence Carin

Textual network embedding leverages rich text information associated with the network to learn low-dimensional vectorial representations of vertices. Rather than using typical natural language processing (NLP) approaches, recent research exploits the relationship of texts on the same edge to graphically embed text. However, these models neglect to measure the complete level of connectivity between any two texts in the graph. We present diffusion maps for textual network embedding (DMTE), integrating global structural information of the graph to capture the semantic relatedness between texts, with a diffusion-convolution operation applied on the text inputs. In addition, a new objective function is designed to efficiently preserve the high-order proximity using the graph diffusion. Experimental results show that the proposed approach outperforms state-of-the-art methods on the vertex-classification and link-prediction tasks.

NeurIPS Conference 2018 Conference Paper

Distilled Wasserstein Learning for Word Embedding and Topic Modeling

  • Hongteng Xu
  • Wenlin Wang
  • Wei Liu
  • Lawrence Carin

We propose a novel Wasserstein method with a distillation mechanism, yielding joint learning of word embeddings and topics. The proposed method is based on the fact that the Euclidean distance between word embeddings may be employed as the underlying distance in the Wasserstein topic model. The word distributions of topics, their optimal transport to the word distributions of documents, and the embeddings of words are learned in a unified framework. When learning the topic model, we leverage a distilled ground-distance matrix to update the topic distributions and smoothly calculate the corresponding optimal transports. Such a strategy provides the updating of word embeddings with robust guidance, improving algorithm convergence. As an application, we focus on patient admission records, in which the proposed method embeds the codes of diseases and procedures and learns the topics of admissions, obtaining superior performance on clinically-meaningful disease network construction, mortality prediction as a function of admission codes, and procedure recommendation.

ICML Conference 2018 Conference Paper

JointGAN: Multi-Domain Joint Distribution Learning with Generative Adversarial Nets

  • Yunchen Pu
  • Shuyang Dai
  • Zhe Gan
  • Weiyao Wang 0002
  • Guoyin Wang 0002
  • Yizhe Zhang 0002
  • Ricardo Henao
  • Lawrence Carin

A new generative adversarial network is developed for joint distribution matching. Distinct from most existing approaches, that only learn conditional distributions, the proposed model aims to learn a joint distribution of multiple random variables (domains). This is achieved by learning to sample from conditional distributions between the domains, while simultaneously learning to sample from the marginals of each individual domain. The proposed framework consists of multiple generators and a single softmax-based critic, all jointly trained via adversarial learning. From a simple noise source, the proposed framework allows synthesis of draws from the marginals, conditional draws given observations from a subset of random variables, or complete draws from the full joint distribution. Most examples considered are for joint analysis of two domains, with examples for three domains also presented.

ICML Conference 2018 Conference Paper

Learning Registered Point Processes from Idiosyncratic Observations

  • Hongteng Xu
  • Lawrence Carin
  • Hongyuan Zha

A parametric point process model is developed, with modeling based on the assumption that sequential observations often share latent phenomena, while also possessing idiosyncratic effects. An alternating optimization method is proposed to learn a “registered” point process that accounts for shared structure, as well as “warping” functions that characterize idiosyncratic aspects of each observed sequence. Under reasonable constraints, in each iteration we update the sample-specific warping functions by solving a set of constrained nonlinear programming problems in parallel, and update the model by maximum likelihood estimation. The justifiability, complexity and robustness of the proposed method are investigated in detail, and the influence of sequence stitching on the learning results is examined empirically. Experiments on both synthetic and real-world data demonstrate that the method yields explainable point process models, achieving encouraging results compared to state-of-the-art methods.

IJCAI Conference 2018 Conference Paper

Online Continuous-Time Tensor Factorization Based on Pairwise Interactive Point Processes

  • Hongteng Xu
  • Dixin Luo
  • Lawrence Carin

A continuous-time tensor factorization method is developed for event sequences containing multiple "modalities. " Each data element is a point in a tensor, whose dimensions are associated with the discrete alphabet of the modalities. Each tensor data element has an associated time of occurence and a feature vector. We model such data based on pairwise interactive point processes, and the proposed framework connects pairwise tensor factorization with a feature-embedded point process. The model accounts for interactions within each modality, interactions across different modalities, and continuous-time dynamics of the interactions. Model learning is formulated as a convex optimization problem, based on online alternating direction method of multipliers. Compared to existing state-of-the-art methods, our approach captures the latent structure of the tensor and its evolution over time, obtaining superior results on real-world datasets.

ICML Conference 2018 Conference Paper

Policy Optimization as Wasserstein Gradient Flows

  • Ruiyi Zhang 0002
  • Changyou Chen
  • Chunyuan Li
  • Lawrence Carin

Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its correspondence to policy-distribution optimization has been unclear mathematically. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, policy optimization becomes convex in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-of-the-art policy-optimization algorithms. Specifically, we define gradient flows on both the parameter-distribution space and policy-distribution space, leading to what we term indirect-policy and direct-policy learning frameworks, respectively. Extensive experiments verify the effectiveness of our framework, often obtaining better performance compared to related algorithms.

ICML Conference 2018 Conference Paper

Variational Inference and Model Selection with Generalized Evidence Bounds

  • Liqun Chen 0001
  • Chenyang Tao
  • Ruiyi Zhang 0002
  • Ricardo Henao
  • Lawrence Carin

Recent advances on the scalability and flexibility of variational inference have made it successful at unravelling hidden patterns in complex data. In this work we propose a new variational bound formulation, yielding an estimator that extends beyond the conventional variational bound. It naturally subsumes the importance-weighted and Renyi bounds as special cases, and it is provably sharper than these counterparts. We also present an improved estimator for variational learning, and advocate a novel high signal-to-variance ratio update rule for the variational parameters. We discuss model-selection issues associated with existing evidence-lower-bound-based variational inference procedures, and show how to leverage the flexibility of our new formulation to address them. Empirical evidence is provided to validate our claims.

AAAI Conference 2018 Conference Paper

Video Generation From Text

  • Yitong Li
  • Martin Min
  • Dinghan Shen
  • David Carlson
  • Lawrence Carin

Generating videos from text has proven to be a significant challenge for existing generative models. We tackle this problem by training a conditional generative model to extract both static and dynamic information from text. This is manifested in a hybrid framework, employing a Variational Autoencoder (VAE) and a Generative Adversarial Network (GAN). The static features, called “gist, ” are used to sketch text-conditioned background color and object layout structure. Dynamic features are considered by transforming input text into an image filter. To obtain a large amount of data for training the deep-learning model, we develop a method to automatically create a matched text-video corpus from publicly available online videos. Experimental results show that the proposed framework generates plausible and diverse short-duration smooth videos, while accurately reflecting the input text information. It significantly outperforms baseline models that directly adapt text-to-image generation procedures to produce videos. Performance is evaluated both visually and by adapting the inception score used to evaluate image generation in GANs.

AAAI Conference 2018 Conference Paper

Zero-Shot Learning via Class-Conditioned Deep Generative Models

  • Wenlin Wang
  • Yunchen Pu
  • Vinay Verma
  • Kai Fan
  • Yizhe Zhang
  • Changyou Chen
  • Piyush Rai
  • Lawrence Carin

We present a deep generative model for Zero-Shot Learning (ZSL). Unlike most existing methods for this problem, that represent each class as a point (via a semantic embedding), we represent each seen/unseen class using a classspecific latent-space distribution, conditioned on class attributes. We use these latent-space distributions as a prior for a supervised variational autoencoder (VAE), which also facilitates learning highly discriminative feature representations for the inputs. The entire framework is learned end-to-end using only the seen-class training data. At test time, the label for an unseen-class test input is the class that maximizes the VAE lower bound. We further extend the model to a (i) semi-supervised/transductive setting by leveraging unlabeled unseen-class data via an unsupervised learning module, and (ii) few-shot learning where we also have a small number of labeled inputs from the unseen classes. We compare our model with several state-of-the-art methods through a comprehensive set of experiments on a variety of benchmark data sets.

NeurIPS Conference 2017 Conference Paper

A Probabilistic Framework for Nonlinearities in Stochastic Neural Networks

  • Qinliang Su
  • Xuejun Liao
  • Lawrence Carin

We present a probabilistic framework for nonlinearities, based on doubly truncated Gaussian distributions. By setting the truncation points appropriately, we are able to generate various types of nonlinearities within a unified framework, including sigmoid, tanh and ReLU, the most commonly used nonlinearities in neural networks. The framework readily integrates into existing stochastic neural networks (with hidden units characterized as random variables), allowing one for the first time to learn the nonlinearities alongside model weights in these networks. Extensive experiments demonstrate the performance improvements brought about by the proposed framework when integrated with the restricted Boltzmann machine (RBM), temporal RBM and the truncated Gaussian graphical model (TGGM).

ICML Conference 2017 Conference Paper

Adversarial Feature Matching for Text Generation

  • Yizhe Zhang 0002
  • Zhe Gan
  • Kai Fan 0002
  • Zhi Chen 0009
  • Ricardo Henao
  • Dinghan Shen
  • Lawrence Carin

The Generative Adversarial Network (GAN) has achieved great success in generating realistic (real-valued) synthetic data. However, convergence issues and difficulties dealing with discrete data hinder the applicability of GAN to text. We propose a framework for generating realistic text via adversarial training. We employ a long short-term memory network as generator, and a convolutional network as discriminator. Instead of using the standard objective of GAN, we propose matching the high-dimensional latent feature distributions of real and synthetic sentences, via a kernelized discrepancy metric. This eases adversarial training by alleviating the mode-collapsing problem. Our experiments show superior performance in quantitative evaluation, and demonstrate that our model can generate realistic-looking sentences.

NeurIPS Conference 2017 Conference Paper

Adversarial Symmetric Variational Autoencoder

  • Yuchen Pu
  • Weiyao Wang
  • Ricardo Henao
  • Liqun Chen
  • Zhe Gan
  • Chunyuan Li
  • Lawrence Carin

A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: (i) from observed data fed through the encoder to yield codes, and (ii) from latent codes drawn from a simple prior and propagated through the decoder to manifest data. Lower bounds are learned for marginal log-likelihood fits observed data and latent codes. When learning with the variational bound, one seeks to minimize the symmetric Kullback-Leibler divergence of joint density functions from (i) and (ii), while simultaneously seeking to maximize the two marginal log-likelihoods. To facilitate learning, a new form of adversarial training is developed. An extensive set of experiments is performed, in which we demonstrate state-of-the-art data reconstruction and generation on several image benchmarks datasets.

NeurIPS Conference 2017 Conference Paper

ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching

  • Chunyuan Li
  • Hao Liu
  • Changyou Chen
  • Yuchen Pu
  • Liqun Chen
  • Ricardo Henao
  • Lawrence Carin

We investigate the non-identifiability issues associated with bidirectional adversarial training for joint distribution matching. Within a framework of conditional entropy, we propose both adversarial and non-adversarial approaches to learn desirable matched joint distributions for unsupervised and supervised tasks. We unify a broad family of adversarial models as joint distribution matching problems. Our approach stabilizes learning of unsupervised bidirectional adversarial learning methods. Further, we introduce an extension for semi-supervised learning tasks. Theoretical results are validated in synthetic data and real-world applications.

NeurIPS Conference 2017 Conference Paper

An inner-loop free solution to inverse problems using deep neural networks

  • Kai Fan
  • Qi Wei
  • Lawrence Carin
  • Katherine Heller

We propose a new method that uses deep learning techniques to accelerate the popular alternating direction method of multipliers (ADMM) solution for inverse problems. The ADMM updates consist of a proximity operator, a least squares regression that includes a big matrix inversion, and an explicit solution for updating the dual variables. Typically, inner loops are required to solve the first two sub-minimization problems due to the intractability of the prior and the matrix inversion. To avoid such drawbacks or limitations, we propose an inner-loop free update rule with two pre-trained deep convolutional architectures. More specifically, we learn a conditional denoising auto-encoder which imposes an implicit data-dependent prior/regularization on ground-truth in the first sub-minimization problem. This design follows an empirical Bayesian strategy, leading to so-called amortized inference. For matrix inversion in the second sub-problem, we learn a convolutional neural network to approximate the matrix inversion, i. e. , the inverse mapping is learned by feeding the input through the learned forward network. Note that training this neural network does not require ground-truth or measurements, i. e. , data-independent. Extensive experiments on both synthetic data and real datasets demonstrate the efficiency and accuracy of the proposed method compared with the conventional ADMM solution using inner loops for solving inverse problems.

NeurIPS Conference 2017 Conference Paper

Cross-Spectral Factor Analysis

  • Neil Gallagher
  • Kyle Ulrich
  • Austin Talbot
  • Kafui Dzirasa
  • Lawrence Carin
  • David Carlson

In neuropsychiatric disorders such as schizophrenia or depression, there is often a disruption in the way that regions of the brain synchronize with one another. To facilitate understanding of network-level synchronization between brain regions, we introduce a novel model of multisite low-frequency neural recordings, such as local field potentials (LFPs) and electroencephalograms (EEGs). The proposed model, named Cross-Spectral Factor Analysis (CSFA), breaks the observed signal into factors defined by unique spatio-spectral properties. These properties are granted to the factors via a Gaussian process formulation in a multiple kernel learning framework. In this way, the LFP signals can be mapped to a lower dimensional space in a way that retains information of relevance to neuroscientists. Critically, the factors are interpretable. The proposed approach empirically allows similar performance in classifying mouse genotype and behavioral context when compared to commonly used approaches that lack the interpretability of CSFA. We also introduce a semi-supervised approach, termed discriminative CSFA (dCSFA). CSFA and dCSFA provide useful tools for understanding neural dynamics, particularly by aiding in the design of causal follow-up experiments.

NeurIPS Conference 2017 Conference Paper

Deconvolutional Paragraph Representation Learning

  • Yizhe Zhang
  • Dinghan Shen
  • Guoyin Wang
  • Zhe Gan
  • Ricardo Henao
  • Lawrence Carin

Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNN-based decoding (reconstruction) decreases with the length of the text. We propose a sequence-to-sequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semi-supervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data.

ICML Conference 2017 Conference Paper

Deep Generative Models for Relational Data with Side Information

  • Changwei Hu
  • Piyush Rai
  • Lawrence Carin

We present a probabilistic framework for overlapping community discovery and link prediction for relational data, given as a graph. The proposed framework has: (1) a deep architecture which enables us to infer multiple layers of latent features/communities for each node, providing superior link prediction performance on more complex networks and better interpretability of the latent features; and (2) a regression model which allows directly conditioning the node latent features on the side information available in form of node attributes. Our framework handles both (1) and (2) via a clean, unified model, which enjoys full local conjugacy via data augmentation, and facilitates efficient inference via closed form Gibbs sampling. Moreover, inference cost scales in the number of edges which is attractive for massive but sparse networks. Our framework is also easily extendable to model weighted networks with count-valued edges. We compare with various state-of-the-art methods and report results, both quantitative and qualitative, on several benchmark data sets.

NeurIPS Conference 2017 Conference Paper

Scalable Model Selection for Belief Networks

  • Zhao Song
  • Yusuke Muraoka
  • Ryohei Fujimaki
  • Lawrence Carin

We propose a scalable algorithm for model selection in sigmoid belief networks (SBNs), based on the factorized asymptotic Bayesian (FAB) framework. We derive the corresponding generalized factorized information criterion (gFIC) for the SBN, which is proven to be statistically consistent with the marginal log-likelihood. To capture the dependencies within hidden variables in SBNs, a recognition network is employed to model the variational distribution. The resulting algorithm, which we call FABIA, can simultaneously execute both model selection and inference by maximizing the lower bound of gFIC. On both synthetic and real data, our experiments suggest that FABIA, when compared to state-of-the-art algorithms for learning SBNs, $(i)$ produces a more concise model, thus enabling faster testing; $(ii)$ improves predictive performance; $(iii)$ accelerates convergence; and $(iv)$ prevents overfitting.

ICML Conference 2017 Conference Paper

Stochastic Gradient Monomial Gamma Sampler

  • Yizhe Zhang 0002
  • Changyou Chen
  • Zhe Gan
  • Ricardo Henao
  • Lawrence Carin

Scaling Markov Chain Monte Carlo (MCMC) to estimate posterior distributions from large datasets has been made possible as a result of advances in stochastic gradient techniques. Despite their success, mixing performance of existing methods when sampling from multimodal distributions can be less efficient with insufficient Monte Carlo samples; this is evidenced by slow convergence and insufficient exploration of posterior distributions. We propose a generalized framework to improve the sampling efficiency of stochastic gradient MCMC, by leveraging a generalized kinetics that delivers superior stationary mixing, especially in multimodal distributions, and propose several techniques to overcome the practical issues. We show that the proposed approach is better at exploring a complicated multimodal posterior distribution, and demonstrate improvements over other stochastic gradient MCMC methods on various applications.

NeurIPS Conference 2017 Conference Paper

Targeting EEG/LFP Synchrony with Neural Nets

  • Yitong Li
  • michael Murias
  • samantha Major
  • Geraldine Dawson
  • Kafui Dzirasa
  • Lawrence Carin
  • David Carlson

We consider the analysis of Electroencephalography (EEG) and Local Field Potential (LFP) datasets, which are “big” in terms of the size of recorded data but rarely have sufficient labels required to train complex models (e. g. , conventional deep learning methods). Furthermore, in many scientific applications, the goal is to be able to understand the underlying features related to the classification, which prohibits the blind application of deep networks. This motivates the development of a new model based on {\em parameterized} convolutional filters guided by previous neuroscience research; the filters learn relevant frequency bands while targeting synchrony, which are frequency-specific power and phase correlations between electrodes. This results in a highly expressive convolutional neural network with only a few hundred parameters, applicable to smaller datasets. The proposed approach is demonstrated to yield competitive (often state-of-the-art) predictive performance during our empirical tests while yielding interpretable features. Furthermore, a Gaussian process adapter is developed to combine analysis over distinct electrode layouts, allowing the joint processing of multiple datasets to address overfitting and improve generalizability. Finally, it is demonstrated that the proposed framework effectively tracks neural dynamics on children in a clinical trial on Autism Spectrum Disorder.

NeurIPS Conference 2017 Conference Paper

Triangle Generative Adversarial Networks

  • Zhe Gan
  • Liqun Chen
  • Weiyao Wang
  • Yuchen Pu
  • Yizhe Zhang
  • Hao Liu
  • Chunyuan Li
  • Lawrence Carin

A Triangle Generative Adversarial Network ($\Delta$-GAN) is developed for semi-supervised cross-domain joint distribution matching, where the training data consists of samples from each domain, and supervision of domain correspondence is provided by only a few paired samples. $\Delta$-GAN consists of four neural networks, two generators and two discriminators. The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs. The generators and discriminators are trained together using adversarial learning. Under mild assumptions, in theory the joint distributions characterized by the two generators concentrate to the data distribution. In experiments, three different kinds of domain pairs are considered, image-label, image-image and image-attribute pairs. Experiments on semi-supervised image classification, image-to-image translation and attribute-based image generation demonstrate the superiority of the proposed approach.

AAAI Conference 2017 Conference Paper

Unsupervised Learning with Truncated Gaussian Graphical Models

  • Qinliang Su
  • Xuejun Liao
  • Chunyuan Li
  • Zhe Gan
  • Lawrence Carin

Gaussian graphical models (GGMs) are widely used for statistical modeling, because of ease of inference and the ubiquitous use of the normal distribution in practical approximations. However, they are also known for their limited modeling abilities, due to the Gaussian assumption. In this paper, we introduce a novel variant of GGMs, which relaxes the Gaussian restriction and yet admits efficient inference. Specifically, we impose a bipartite structure on the GGM and govern the hidden variables by truncated normal distributions. The nonlinearity of the model is revealed by its connection to rectified linear unit (ReLU) neural networks. Meanwhile, thanks to the bipartite structure and appealing properties of truncated normals, we are able to train the models efficiently using contrastive divergence. We consider three output constructs, accounting for real-valued, binary and count data. We further extend the model to deep constructions and show that deep models can be used for unsupervised pre-training of rectifier neural networks. Extensive experimental results are provided to validate the proposed models and demonstrate their superiority over competing models.

NeurIPS Conference 2017 Conference Paper

VAE Learning via Stein Variational Gradient Descent

  • Yuchen Pu
  • Zhe Gan
  • Ricardo Henao
  • Chunyuan Li
  • Shaobo Han
  • Lawrence Carin

A new method for learning variational autoencoders (VAEs) is developed, based on Stein variational gradient descent. A key advantage of this approach is that one need not make parametric assumptions about the form of the encoder distribution. Performance is further enhanced by integrating the proposed encoder with importance sampling. Excellent performance is demonstrated across multiple unsupervised and semi-supervised problems, including semi-supervised analysis of the ImageNet data, demonstrating the scalability of the model to large datasets.

IJCAI Conference 2016 Conference Paper

Bayesian Dictionary Learning with Gaussian Processes and Sigmoid Belief Networks

  • Yizhe Zhang
  • Ricardo Henao
  • Chunyuan Li
  • Lawrence Carin

In dictionary learning for analysis of images, spatial correlation from extracted patches can be leveraged to improve characterization power. We propose a Bayesian framework for dictionary learning, with spatial location dependencies captured by imposing a multiplicative Gaussian process (GP) priors on the latent units representing binary activations. Data augmentation and Kronecker methods allow for efficient Markov chain Monte Carlo sampling. We further extend the model with Sigmoid Belief Networks (SBNs), linking the GPs to the top-layer latent binary units of the SBN, capturing inter-dictionary dependencies while also yielding computational savings. Applications to image denoising, inpainting and depth-information restoration demonstrate that the proposed model outperforms other leading Bayesian dictionary learning approaches.

JMLR Journal 2016 Journal Article

Electronic Health Record Analysis via Deep Poisson Factor Models

  • Ricardo Henao
  • James T. Lu
  • Joseph E. Lucas
  • Jeffrey Ferranti
  • Lawrence Carin

Electronic Health Record (EHR) phenotyping utilizes patient data captured through normal medical practice, to identify features that may represent computational medical phenotypes. These features may be used to identify at-risk patients and improve prediction of patient morbidity and mortality. We present a novel deep multi-modality architecture for EHR analysis (applicable to joint analysis of multiple forms of EHR data), based on Poisson Factor Analysis (PFA) modules. Each modality, composed of observed counts, is represented as a Poisson distribution, parameterized in terms of hidden binary units. Information from different modalities is shared via a deep hierarchy of common hidden units. Activation of these binary units occurs with probability characterized as Bernoulli- Poisson link functions, instead of more traditional logistic link functions. In addition, we demonstrate that PFA modules can be adapted to discriminative modalities. To compute model parameters, we derive efficient Markov Chain Monte Carlo (MCMC) inference that scales efficiently, with significant computational gains when compared to related models based on logistic link functions. To explore the utility of these models, we apply them to a subset of patients from the Duke-Durham patient cohort. We identified a cohort of over 16,000 patients with Type 2 Diabetes Mellitus (T2DM) based on diagnosis codes and laboratory tests out of our patient population of over 240,000. Examining the common hidden units uniting the PFA modules, we identify patient features that represent medical concepts. Experiments indicate that our learned features are better able to predict mortality and morbidity than clinical features identified previously in a large-scale clinical trial. [abs] [ pdf ][ bib ] &copy JMLR 2016. ( edit, beta )

ICML Conference 2016 Conference Paper

Factored Temporal Sigmoid Belief Networks for Sequence Learning

  • Jiaming Song
  • Zhe Gan
  • Lawrence Carin

Deep conditional generative models are developed to simultaneously learn the temporal dependencies of multiple sequences. The model is designed by introducing a three-way weight tensor to capture the multiplicative interactions between side information and sequences. The proposed model builds on the Temporal Sigmoid Belief Network (TSBN), a sequential stack of Sigmoid Belief Networks (SBNs). The transition matrices are further factored to reduce the number of parameters and improve generalization. When side information is not available, a general framework for semi-supervised learning based on the proposed model is constituted, allowing robust sequence classification. Experimental results show that the proposed approach achieves state-of-the-art predictive and classification performance on sequential data, and has the capacity to synthesize sequences, with controlled style transitioning and blending.

AAAI Conference 2016 Conference Paper

High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

  • Chunyuan Li
  • Changyou Chen
  • Kai Fan
  • Lawrence Carin

Learning in deep models using Bayesian methods has generated significant attention recently. This is largely because of the feasibility of modern Bayesian methods to yield scalable learning and inference, while maintaining a measure of uncertainty in the model parameters. Stochastic gradient MCMC algorithms (SG-MCMC) are a family of diffusion-based sampling methods for large-scale Bayesian learning. In SG-MCMC, multivariate stochastic gradient thermostats (mSGNHT) augment each parameter of interest, with a momentum and a thermostat variable to maintain stationary distributions as target posterior distributions. As the number of variables in a continuous-time diffusion increases, its numerical approximation error becomes a practical bottleneck, so better use of a numerical integrator is desirable. To this end, we propose use of an efficient symmetric splitting integrator in mSGNHT, instead of the traditional Euler integrator. We demonstrate that the proposed scheme is more accurate, robust, and converges faster. These properties are demonstrated to be desirable in Bayesian deep learning. Extensive experiments on two canonical models and their deep extensions demonstrate that the proposed scheme improves general Bayesian posterior sampling, particularly for deep models.

AAAI Conference 2016 Conference Paper

Learning a Hybrid Architecture for Sequence Regression and Annotation

  • Yizhe Zhang
  • Ricardo Henao
  • Lawrence Carin
  • Jianling Zhong
  • Alexander Hartemink

When learning a hidden Markov model (HMM), sequential observations can often be complemented by real-valued summary response variables generated from the path of hidden states. Such settings arise in numerous domains, including many applications in biology, like motif discovery and genome annotation. In this paper, we present a flexible framework for jointly modeling both latent sequence features and the functional mapping that relates the summary response variables to the hidden state sequence. The algorithm is compatible with a rich set of mapping functions. Results show that the availability of additional continuous response variables can simultaneously improve the annotation of the sequential observations and yield good prediction performance in both synthetic data and real-world datasets.

NeurIPS Conference 2016 Conference Paper

Linear Feature Encoding for Reinforcement Learning

  • Zhao Song
  • Ronald Parr
  • Xuejun Liao
  • Lawrence Carin

Feature construction is of vital importance in reinforcement learning, as the quality of a value function or policy is largely determined by the corresponding features. The recent successes of deep reinforcement learning (RL) only increase the importance of understanding feature construction. Typical deep RL approaches use a linear output layer, which means that deep RL can be interpreted as a feature construction/encoding network followed by linear value function approximation. This paper develops and evaluates a theory of linear feature encoding. We extend theoretical results on feature quality for linear value function approximation from the uncontrolled case to the controlled case. We then develop a supervised linear feature encoding method that is motivated by insights from linear value function approximation theory, as well as empirical successes from deep RL. The resulting encoder is a surprisingly effective method for linear value function approximation using raw images as inputs.

ICML Conference 2016 Conference Paper

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

  • Qinliang Su
  • Xuejun Liao
  • Changyou Chen
  • Lawrence Carin

We introduce the truncated Gaussian graphical model (TGGM) as a novel framework for designing statistical models for nonlinear learning. A TGGM is a Gaussian graphical model (GGM) with a subset of variables truncated to be nonnegative. The truncated variables are assumed latent and integrated out to induce a marginal model. We show that the variables in the marginal model are non-Gaussian distributed and their expected relations are nonlinear. We use expectation-maximization to break the inference of the nonlinear model into a sequence of TGGM inference problems, each of which is efficiently solved by using the properties and numerical methods of multivariate Gaussian distributions. We use the TGGM to design models for nonlinear regression and classification, with the performances of these models demonstrated on extensive benchmark datasets and compared to state-of-the-art competing results.

AAAI Conference 2016 Conference Paper

Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks

  • Chunyuan Li
  • Changyou Chen
  • David Carlson
  • Lawrence Carin

Effective training of deep neural networks suffers from two main issues. The first is that the parameter spaces of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inef- ficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.

NeurIPS Conference 2016 Conference Paper

Stochastic Gradient MCMC with Stale Gradients

  • Changyou Chen
  • Nan Ding
  • Chunyuan Li
  • Yizhe Zhang
  • Lawrence Carin

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w. r. t. the number of workers. Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.

NeurIPS Conference 2016 Conference Paper

Towards Unifying Hamiltonian Monte Carlo and Slice Sampling

  • Yizhe Zhang
  • Xiangyu Wang
  • Changyou Chen
  • Ricardo Henao
  • Kai Fan
  • Lawrence Carin

We unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demonstrating their connection via the Hamiltonian-Jacobi equation from Hamiltonian mechanics. This insight enables extension of HMC and slice sampling to a broader family of samplers, called Monomial Gamma Samplers (MGS). We provide a theoretical analysis of the mixing performance of such samplers, proving that in the limit of a single parameter, the MGS draws decorrelated samples from the desired target distribution. We further show that as this parameter tends toward this limit, performance gains are achieved at a cost of increasing numerical difficulty and some practical convergence issues. Our theoretical results are validated with synthetic data and real-world applications.

NeurIPS Conference 2016 Conference Paper

Variational Autoencoder for Deep Learning of Images, Labels and Captions

  • Yunchen Pu
  • Zhe Gan
  • Ricardo Henao
  • Xin Yuan
  • Chunyuan Li
  • Andrew Stevens
  • Lawrence Carin

A novel variational autoencoder is developed to model images, as well as associated labels or captions. The Deep Generative Deconvolutional Network (DGDN) is used as a decoder of the latent image features, and a deep Convolutional Neural Network (CNN) is used as an image encoder; the CNN is used to approximate a distribution for the latent DGDN features/code. The latent code is also linked to generative models for labels (Bayesian support vector machine) or captions (recurrent neural network). When predicting a label/caption for a new image at test, averaging is performed across the distribution of latent codes; this is computationally efficient as a consequence of the learned CNN-based encoder. Since the framework is capable of modeling the image in the presence/absence of associated labels/captions, a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.

ICML Conference 2015 Conference Paper

A Multitask Point Process Predictive Model

  • Wenzhao Lian
  • Ricardo Henao
  • Vinayak A. Rao
  • Joseph E. Lucas
  • Lawrence Carin

Point process data are commonly observed in fields like healthcare and social science. Designing predictive models for such event streams is an under-explored problem, due to often scarce training data. In this work we propose a multitask point process model, leveraging information from all tasks via a hierarchical Gaussian process (GP). Nonparametric learning functions implemented by a GP, which map from past events to future rates, allow analysis of flexible arrival patterns. To facilitate efficient inference, we propose a sparse construction for this hierarchical model, and derive a variational Bayes method for learning and inference. Experimental results are shown on both synthetic data and an application on real electronic health records.

AAAI Conference 2015 Conference Paper

Cross-Modal Similarity Learning via Pairs, Preferences, and Active Supervision

  • Yi Zhen
  • Piyush Rai
  • Hongyuan Zha
  • Lawrence Carin

We present a probabilistic framework for learning pairwise similarities between objects belonging to different modalities, such as drugs and proteins, or text and images. Our framework is based on learning a binary code based representation for objects in each modality, and has the following key properties: (i) it can leverage both pairwise as well as easy-to-obtain relative preference based cross-modal constraints, (ii) the probabilistic framework naturally allows querying for the most useful/informative constraints, facilitating an active learning setting (existing methods for cross-modal similarity learning do not have such a mechanism), and (iii) the binary code length is learned from the data. We demonstrate the effectiveness of the proposed approach on two problems that require computing pairwise similarities between cross-modal object pairs: cross-modal link prediction in bipartite graphs, and hashing based cross-modal similarity search.

NeurIPS Conference 2015 Conference Paper

Deep Poisson Factor Modeling

  • Ricardo Henao
  • Zhe Gan
  • James Lu
  • Lawrence Carin

We propose a new deep architecture for topic modeling, based on Poisson Factor Analysis (PFA) modules. The model is composed of a Poisson distribution to model observed vectors of counts, as well as a deep hierarchy of hidden binary units. Rather than using logistic functions to characterize the probability that a latent binary unit is on, we employ a Bernoulli-Poisson link, which allows PFA modules to be used repeatedly in the deep architecture. We also describe an approach to build discriminative topic models, by adapting PFA modules. We derive efficient inference via MCMC and stochastic variational methods, that scale with the number of non-zeros in the data and binary units, yielding significant efficiency, relative to models based on logistic links. Experiments on several corpora demonstrate the advantages of our model when compared to related deep models.

NeurIPS Conference 2015 Conference Paper

Deep Temporal Sigmoid Belief Networks for Sequence Modeling

  • Zhe Gan
  • Chunyuan Li
  • Ricardo Henao
  • David Carlson
  • Lawrence Carin

Deep dynamic generative models are developed to learn sequential dependencies in time-series data. The multi-layered model is designed by constructing a hierarchy of temporal sigmoid belief networks (TSBNs), defined as a sequential stack of sigmoid belief networks (SBNs). Each SBN has a contextual hidden state, inherited from the previous SBNs in the sequence, and is used to regulate its hidden bias. Scalable learning and inference algorithms are derived by introducing a recognition model that yields fast sampling from the variational posterior. This recognition model is trained jointly with the generative model, by maximizing its variational lower bound on the log-likelihood. Experimental results on bouncing balls, polyphonic music, motion capture, and text streams show that the proposed approach achieves state-of-the-art predictive performance, and has the capacity to synthesize various sequences.

NeurIPS Conference 2015 Conference Paper

GP Kernels for Cross-Spectrum Analysis

  • Kyle Ulrich
  • David Carlson
  • Kafui Dzirasa
  • Lawrence Carin

Multi-output Gaussian processes provide a convenient framework for multi-task problems. An illustrative and motivating example of a multi-task problem is multi-region electrophysiological time-series data, where experimentalists are interested in both power and phase coherence between channels. Recently, Wilson and Adams (2013) proposed the spectral mixture (SM) kernel to model the spectral density of a single task in a Gaussian process framework. In this paper, we develop a novel covariance kernel for multiple outputs, called the cross-spectral mixture (CSM) kernel. This new, flexible kernel represents both the power and phase relationship between multiple observation channels. We demonstrate the expressive capabilities of the CSM kernel through implementation of a Bayesian hidden Markov model, where the emission distribution is a multi-output Gaussian process with a CSM covariance kernel. Results are presented for measured multi-region electrophysiological data.

AAAI Conference 2015 Conference Paper

Integrating Features and Similarities: Flexible Models for Heterogeneous Multiview Data

  • Wenzhao Lian
  • Piyush Rai
  • Esther Salazar
  • Lawrence Carin

We present a probabilistic framework for learning with heterogeneous multiview data where some views are given as ordinal, binary, or real-valued feature matrices, and some views as similarity matrices. Our framework has the following distinguishing aspects: (i) a unified latent factor model for integrating information from diverse feature (ordinal, binary, real) and similarity based views, and predicting the missing data in each view, leveraging view correlations; (ii) seamless adaptation to binary/multiclass classification where data consists of multiple feature and/or similarity-based views; and (iii) an efficient, variational inference algorithm which is especially flexible in modeling the views with ordinalvalued data (by learning the cutpoints for the ordinal data), and extends naturally to streaming data settings. Our framework subsumes methods such as multiview learning and multiple kernel learning as special cases. We demonstrate the effectiveness of our framework on several real-world and benchmarks datasets.

NeurIPS Conference 2015 Conference Paper

Large-Scale Bayesian Multi-Label Learning via Topic-Based Label Embeddings

  • Piyush Rai
  • Changwei Hu
  • Ricardo Henao
  • Lawrence Carin

We present a scalable Bayesian multi-label learning model based on learning low-dimensional label embeddings. Our model assumes that each label vector is generated as a weighted combination of a set of topics (each topic being a distribution over labels), where the combination weights (i. e. , the embeddings) for each label vector are conditioned on the observed feature vector. This construction, coupled with a Bernoulli-Poisson link function for each label of the binary label vector, leads to a model with a computational cost that scales in the number of positive labels in the label matrix. This makes the model particularly appealing for real-world multi-label learning problems where the label matrix is usually very massive but highly sparse. Using a data-augmentation strategy leads to full local conjugacy in our model, facilitating simple and very efficient Gibbs sampling, as well as an Expectation Maximization algorithm for inference. Also, predicting the label vector at test time does not require doing an inference for the label embeddings and can be done in closed form. We report results on several benchmark data sets, comparing our model with various state-of-the art methods.

AAAI Conference 2015 Conference Paper

Leveraging Features and Networks for Probabilistic Tensor Decomposition

  • Piyush Rai
  • Yingjian Wang
  • Lawrence Carin

We present a probabilistic model for tensor decomposition where one or more tensor modes may have sideinformation about the mode entities in form of their features and/or their adjacency network. We consider a Bayesian approach based on the Canonical PARAFAC (CP) decomposition and enrich this single-layer decomposition approach with a two-layer decomposition. The second layer fits a factor model for each layer-one factor matrix and models the factor matrix via the mode entities’ features and/or the network between the mode entities. The second-layer decomposition of each factor matrix also learns a binary latent representation for the entities of that mode, which can be useful in its own right. Our model can handle both continuous as well as binary tensor observations. Another appealing aspect of our model is the simplicity of the model inference, with easy-to-sample Gibbs updates. We demonstrate the results of our model on several benchmarks datasets, consisting of both real and binary tensors.

ICML Conference 2015 Conference Paper

Non-Gaussian Discriminative Factor Models via the Max-Margin Rank-Likelihood

  • Xin Yuan 0002
  • Ricardo Henao
  • Ephraim Tsalik
  • Raymond Langley
  • Lawrence Carin

We consider the problem of discriminative factor analysis for data that are in general non-Gaussian. A Bayesian model based on the ranks of the data is proposed. We first introduce a max-margin version of the rank-likelihood. A discriminative factor model is then developed, integrating the new max-margin rank-likelihood and (linear) Bayesian support vector machines, which are also built on the max-margin principle. The discriminative factor model is further extended to the nonlinear case through mixtures of local linear classifiers, via Dirichlet processes. Fully local conjugacy of the model yields efficient inference with both Markov Chain Monte Carlo and variational Bayes approaches. Extensive experiments on benchmark and real data demonstrate superior performance of the proposed model and its potential for applications in computational biology.

NeurIPS Conference 2015 Conference Paper

On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators

  • Changyou Chen
  • Nan Ding
  • Lawrence Carin

Recent advances in Bayesian learning with large-scale data have witnessed emergence of stochastic gradient MCMC algorithms (SG-MCMC), such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian MCMC (SGHMC), and the stochastic gradient thermostat. While finite-time convergence properties of the SGLD with a 1st-order Euler integrator have recently been studied, corresponding theory for general SG-MCMCs has not been explored. In this paper we consider general SG-MCMCs with high-order integrators, and develop theory to analyze finite-time convergence properties and their asymptotic invariant measures. Our theoretical results show faster convergence rates and more accurate invariant measures for SG-MCMCs with higher-order integrators. For example, with the proposed efficient 2nd-order symmetric splitting integrator, the mean square error (MSE) of the posterior average for the SGHMC achieves an optimal convergence rate of $L^{-4/5}$ at $L$ iterations, compared to $L^{-2/3}$ for the SGHMC and SGLD with 1st-order Euler integrators. Furthermore, convergence results of decreasing-step-size SG-MCMCs are also developed, with the same convergence rates as their fixed-step-size counterparts for a specific decreasing sequence. Experiments on both synthetic and real datasets verify our theory, and show advantages of the proposed method in two large-scale real applications.

NeurIPS Conference 2015 Conference Paper

Preconditioned Spectral Descent for Deep Learning

  • David Carlson
  • Edo Collins
  • Ya-Ping Hsieh
  • Lawrence Carin
  • Volkan Cevher

Deep learning presents notorious computational challenges. These challenges include, but are not limited to, the non-convexity of learning objectives and estimating the quantities needed for optimization algorithms, such as gradients. While we do not address the non-convexity, we present an optimization solution that ex- ploits the so far unused “geometry” in the objective function in order to best make use of the estimated gradients. Previous work attempted similar goals with preconditioned methods in the Euclidean space, such as L-BFGS, RMSprop, and ADA-grad. In stark contrast, our approach combines a non-Euclidean gradient method with preconditioning. We provide evidence that this combination more accurately captures the geometry of the objective function compared to prior work. We theoretically formalize our arguments and derive novel preconditioned non-Euclidean algorithms. The results are promising in both computational time and quality when applied to Restricted Boltzmann Machines, Feedforward Neural Nets, and Convolutional Neural Nets.

ICML Conference 2015 Conference Paper

Scalable Deep Poisson Factor Analysis for Topic Modeling

  • Zhe Gan
  • Changyou Chen
  • Ricardo Henao
  • David E. Carlson
  • Lawrence Carin

A new framework for topic modeling is developed, based on deep graphical models, where interactions between topics are inferred through deep latent binary hierarchies. The proposed multi-layer model employs a deep sigmoid belief network or restricted Boltzmann machine, the bottom binary layer of which selects topics for use in a Poisson factor analysis model. Under this setting, topics live on the bottom layer of the model, while the deep specification serves as a flexible prior for revealing topic structure. Scalable inference algorithms are derived by applying Bayesian conditional density filtering algorithm, in addition to extending recently proposed work on stochastic gradient thermostats. Experimental results on several corpora show that the proposed approach readily handles very large collections of text documents, infers structured topic representations, and obtains superior test perplexities when compared with related models.

IJCAI Conference 2015 Conference Paper

Scalable Probabilistic Tensor Factorization for Binary and Count Data

  • Piyush Rai
  • Changwei Hu
  • Matthew Harding
  • Lawrence Carin

Tensor factorization methods provide a useful way to extract latent factors from complex multirelational data, and also for predicting missing data. Developing tensor factorization methods for massive tensors, especially when the data are binary- or count-valued (which is true of most real-world tensors), however, remains a challenge. We develop a scalable probabilistic tensor factorization framework that enables us to perform efficient factorization of massive binary and count tensor data. The framework is based on (i) the Pólya-Gamma augmentation strategy which makes the model fully locally conjugate and allows closed-form parameter updates when data are binary- or count-valued; and (ii) an efficient online Expectation Maximization algorithm, which allows processing data in small mini-batches, and facilitates handling massive tensor data. Moreover, various types of constraints on the factor matrices (e. g. , sparsity, non-negativity) can be incorporated under the proposed framework, providing good interpretability, which can be useful for qualitative analyses of the results. We apply the proposed framework on analyzing several binaryand count-valued real-world data sets.

IJCAI Conference 2015 Conference Paper

Stick-Breaking Policy Learning in Dec-POMDPs

  • Miao Liu
  • Christopher Amato
  • Xuejun Liao
  • Lawrence Carin
  • Jonathan P. How

Expectation maximization (EM) has recently been shown to be an efficient algorithm for learning finite-state controllers (FSCs) in large decentralized POMDPs (Dec-POMDPs). However, current methods use fixed-size FSCs and often converge to maxima that are far from the optimal value. This paper represents the local policy of each agent using variable-sized FSCs that are constructed using a stick-breaking prior, leading to a new framework called decentralized stick-breaking policy representation (Dec-SBPR). This approach learns the controller parameters with a variational Bayesian algorithm without having to assume that the Dec- POMDP model is available. The performance of Dec-SBPR is demonstrated on several benchmark problems, showing that the algorithm scales to large problems while outperforming other state-of-the-art methods.

UAI Conference 2015 Conference Paper

Zero-Truncated Poisson Tensor Factorization for Massive Binary Tensors

  • Changwei Hu
  • Piyush Rai
  • Lawrence Carin

We present a scalable Bayesian model for lowrank factorization of massive tensors with binary observations. The proposed model has the following key properties: (1) in contrast to the models based on the logistic or probit likelihood, using a zero-truncated Poisson likelihood for binary data allows our model to scale up in the number of ones in the tensor, which is especially appealing for massive but sparse binary tensors; (2) side-information in form of binary pairwise relationships (e. g. , an adjacency network) between objects in any tensor mode can also be leveraged, which can be especially useful in “cold-start” settings; and (3) the model admits simple Bayesian inference via batch, as well as online MCMC; the latter allows scaling up even for dense binary data (i. e. , when the number of ones in the tensor/network is also massive). In addition, non-negative factor matrices in our model provide easy interpretability, and the tensor rank can be inferred from the data. We evaluate our model on several large-scale realworld binary tensors, achieving excellent computational scalability, and also demonstrate its usefulness in leveraging side-information provided in form of mode-network(s).

NeurIPS Conference 2014 Conference Paper

Analysis of Brain States from Multi-Region LFP Time-Series

  • Kyle Ulrich
  • David Carlson
  • Wenzhao Lian
  • Jana Borg
  • Kafui Dzirasa
  • Lawrence Carin

The local field potential (LFP) is a source of information about the broad patterns of brain activity, and the frequencies present in these time-series measurements are often highly correlated between regions. It is believed that these regions may jointly constitute a ``brain state, '' relating to cognition and behavior. An infinite hidden Markov model (iHMM) is proposed to model the evolution of brain states, based on electrophysiological LFP data measured at multiple brain regions. A brain state influences the spectral content of each region in the measured LFP. A new state-dependent tensor factorization is employed across brain regions, and the spectral properties of the LFPs are characterized in terms of Gaussian processes (GPs). The LFPs are modeled as a mixture of GPs, with state- and region-dependent mixture weights, and with the spectral content of the data encoded in GP spectral mixture covariance kernels. The model is able to estimate the number of brain states and the number of mixture components in the mixture of GPs. A new variational Bayesian split-merge algorithm is employed for inference. The model infers state changes as a function of external covariates in two novel electrophysiological datasets, using LFP data recorded simultaneously from multiple brain regions in mice; the results are validated and interpreted by subject-matter experts.

NeurIPS Conference 2014 Conference Paper

Bayesian Nonlinear Support Vector Machines and Discriminative Factor Modeling

  • Ricardo Henao
  • Xin Yuan
  • Lawrence Carin

A new Bayesian formulation is developed for nonlinear support vector machines (SVMs), based on a Gaussian process and with the SVM hinge loss expressed as a scaled mixture of normals. We then integrate the Bayesian SVM into a factor model, in which feature learning and nonlinear classifier design are performed jointly; almost all previous work on such discriminative feature learning has assumed a linear classifier. Inference is performed with expectation conditional maximization (ECM) and Markov Chain Monte Carlo (MCMC). An extensive set of experiments demonstrate the utility of using a nonlinear Bayesian SVM within discriminative feature learning and factor modeling, from the standpoints of accuracy and interpretability

NeurIPS Conference 2014 Conference Paper

Compressive Sensing of Signals from a GMM with Sparse Precision Matrices

  • Jianbo Yang
  • Xuejun Liao
  • Minhua Chen
  • Lawrence Carin

This paper is concerned with compressive sensing of signals drawn from a Gaussian mixture model (GMM) with sparse precision matrices. Previous work has shown: (i) a signal drawn from a given GMM can be perfectly reconstructed from r noise-free measurements if the (dominant) rank of each covariance matrix is less than r; (ii) a sparse Gaussian graphical model can be efficiently estimated from fully-observed training signals using graphical lasso. This paper addresses a problem more challenging than both (i) and (ii), by assuming that the GMM is unknown and each signal is only partially observed through incomplete linear measurements. Under these challenging assumptions, we develop a hierarchical Bayesian method to simultaneously estimate the GMM and recover the signals using solely the incomplete measurements and a Bayesian shrinkage prior that promotes sparsity of the Gaussian precision matrices. In addition, we provide theoretical performance bounds to relate the reconstruction error to the number of signals for which measurements are available, the sparsity level of precision matrices, and the “incompleteness” of measurements. The proposed method is demonstrated extensively on compressive sensing of imagery and video, and the results with simulated and hardware-acquired real measurements show significant performance improvement over state-of-the-art methods.

NeurIPS Conference 2014 Conference Paper

Dynamic Rank Factor Model for Text Streams

  • Shaobo Han
  • Lin Du
  • Esther Salazar
  • Lawrence Carin

We propose a semi-parametric and dynamic rank factor model for topic modeling, capable of (1) discovering topic prevalence over time, and (2) learning contemporary multi-scale dependence structures, providing topic and word correlations as a byproduct. The high-dimensional and time-evolving ordinal/rank observations (such as word counts), after an arbitrary monotone transformation, are well accommodated through an underlying dynamic sparse factor model. The framework naturally admits heavy-tailed innovations, capable of inferring abrupt temporal jumps in the importance of topics. Posterior inference is performed through straightforward Gibbs sampling, based on the forward-filtering backward-sampling algorithm. Moreover, an efficient data subsampling scheme is leveraged to speed up inference on massive datasets. The modeling framework is illustrated on two real datasets: the US State of the Union Address and the JSTOR collection from Science.

ICML Conference 2014 Conference Paper

Modeling Correlated Arrival Events with Latent Semi-Markov Processes

  • Wenzhao Lian
  • Vinayak A. Rao
  • Brian Eriksson
  • Lawrence Carin

The analysis and characterization of correlated point process data has wide applications, ranging from biomedical research to network analysis. In this work, we model such data as generated by a latent collection of continuous-time binary semi-Markov processes, corresponding to external events appearing and disappearing. A continuous-time modeling framework is more appropriate for multichannel point process data than a binning approach requiring time discretization, and we show connections between our model and recent ideas from the discrete-time literature. We describe an efficient MCMC algorithm for posterior inference, and apply our ideas to both synthetic data and a real-world biometrics application.

ICML Conference 2014 Conference Paper

Nonlinear Information-Theoretic Compressive Measurement Design

  • Liming Wang 0004
  • Abolfazl Razi
  • Miguel Rodrigues 0001
  • A. Robert Calderbank
  • Lawrence Carin

We investigate design of general nonlinear functions for mapping high-dimensional data into a lower-dimensional (compressive) space. The nonlinear measurements are assumed contaminated by additive Gaussian noise. Depending on the application, we are either interested in recovering the high-dimensional data from the nonlinear compressive measurements, or performing classification directly based on these measurements. The latter case corresponds to classification based on nonlinearly constituted and noisy features. The nonlinear measurement functions are designed based on constrained mutual-information optimization. New analytic results are developed for the gradient of mutual information in this setting, for arbitrary input-signal statistics. We make connections to kernel-based methods, such as the support vector machine. Encouraging results are presented on multiple datasets, for both signal recovery and classification. The nonlinear approach is shown to be particularly valuable in high-noise scenarios.

NeurIPS Conference 2014 Conference Paper

On the relations of LFPs & Neural Spike Trains

  • David Carlson
  • Jana Schaich Borg
  • Kafui Dzirasa
  • Lawrence Carin

One of the goals of neuroscience is to identify neural networks that correlate with important behaviors, environments, or genotypes. This work proposes a strategy for identifying neural networks characterized by time- and frequency-dependent connectivity patterns, using convolutional dictionary learning that links spike-train data to local field potentials (LFPs) across multiple areas of the brain. Analytical contributions are: (i) modeling dynamic relationships between LFPs and spikes; (ii) describing the relationships between spikes and LFPs, by analyzing the ability to predict LFP data from one region based on spiking information from across the brain; and (iii) development of a clustering methodology that allows inference of similarities in neurons from multiple regions. Results are based on data sets in which spike and LFP data are recorded simultaneously from up to 16 brain regions in a mouse.

ICML Conference 2014 Conference Paper

Scalable Bayesian Low-Rank Decomposition of Incomplete Multiway Tensors

  • Piyush Rai
  • Yingjian Wang 0004
  • Shengbo Guo
  • Gary Chen
  • David B. Dunson
  • Lawrence Carin

We present a scalable Bayesian framework for low-rank decomposition of multiway tensor data with missing observations. The key issue of pre-specifying the rank of the decomposition is sidestepped in a principled manner using a multiplicative gamma process prior. Both continuous and binary data can be analyzed under the framework, in a coherent way using fully conjugate Bayesian analysis. In particular, the analysis in the non-conjugate binary case is facilitated via the use of the Pólya-Gamma sampling strategy which elicits closed-form Gibbs sampling updates. The resulting samplers are efficient and enable us to apply our framework to large-scale problems, with time-complexity that is linear in the number of observed entries in the tensor. This is especially attractive in analyzing very large but sparsely observed tensors with very few known entries. Moreover, our method admits easy extension to the supervised setting where entities in one or more tensor modes have labels. Our method outperforms several state-of-the-art tensor decomposition methods on various synthetic and benchmark real-world datasets.

NeurIPS Conference 2013 Conference Paper

Designed Measurements for Vector Count Data

  • Liming Wang
  • David Carlson
  • Miguel Rodrigues
  • David Wilcox
  • Robert Calderbank
  • Lawrence Carin

We consider design of linear projection measurements for a vector Poisson signal model. The projections are performed on the vector Poisson rate, $X\in\mathbb{R}_+^n$, and the observed data are a vector of counts, $Y\in\mathbb{Z}_+^m$. The projection matrix is designed by maximizing mutual information between $Y$ and $X$, $I(Y; X)$. When there is a latent class label $C\in\{1, \dots, L\}$ associated with $X$, we consider the mutual information with respect to $Y$ and $C$, $I(Y; C)$. New analytic expressions for the gradient of $I(Y; X)$ and $I(Y; C)$ are presented, with gradient performed with respect to the measurement matrix. Connections are made to the more widely studied Gaussian measurement model. Example results are presented for compressive topic modeling of a document corpora (word counting), and hyperspectral compressive sensing for chemical classification (photon counting).

NeurIPS Conference 2013 Conference Paper

Dynamic Clustering via Asymptotics of the Dependent Dirichlet Process Mixture

  • Trevor Campbell
  • Miao Liu
  • Brian Kulis
  • Jonathan How
  • Lawrence Carin

This paper presents a novel algorithm, based upon the dependent Dirichlet process mixture model (DDPMM), for clustering batch-sequential data containing an unknown number of evolving clusters. The algorithm is derived via a low-variance asymptotic analysis of the Gibbs sampling algorithm for the DDPMM, and provides a hard clustering with convergence guarantees similar to those of the k-means algorithm. Empirical results from a synthetic test with moving Gaussian clusters and a test with real ADS-B aircraft trajectory data demonstrate that the algorithm requires orders of magnitude less computational time than contemporary probabilistic and hard clustering algorithms, while providing higher accuracy on the examined datasets.

ICML Conference 2013 Conference Paper

Exploring the Mind: Integrating Questionnaires and fMRI

  • Esther Salazar
  • Ryan Bogdan
  • Adam Gorka
  • Ahmad Hariri
  • Lawrence Carin

A new model is developed for joint analysis of ordered, categorical, real and count data. The ordered and categorical data are answers to questionnaires, the (word) count data correspond to the text questions from the questionnaires, and the real data correspond to fMRI responses for each subject. The Bayesian model employs the von Mises distribution in a novel manner to infer sparse graphical models jointly across people, questions, fMRI stimuli and brain region, with this integrated within a new matrix factorization based on latent binary features. The model is compared with simpler alternatives on two real datasets. We also demonstrate the ability to predict the response of the brain to visual stimuli (as measured by fMRI), based on knowledge of how the associated person answered classical questionnaires.

NeurIPS Conference 2013 Conference Paper

Integrated Non-Factorized Variational Inference

  • Shaobo Han
  • Xuejun Liao
  • Lawrence Carin

We present a non-factorized variational method for full posterior inference in Bayesian hierarchical models, with the goal of capturing the posterior variable dependencies via efficient and possibly parallel computation. Our approach unifies the integrated nested Laplace approximation (INLA) under the variational framework. The proposed method is applicable in more challenging scenarios than typically assumed by INLA, such as Bayesian Lasso, which is characterized by the non-differentiability of the $\ell_{1}$ norm arising from independent Laplace priors. We derive an upper bound for the Kullback-Leibler divergence, which yields a fast closed-form solution via decoupled optimization. Our method is a reliable analytic alternative to Markov chain Monte Carlo (MCMC), and it results in a tighter evidence lower bound than that of mean-field variational Bayes (VB) method.

IJCAI Conference 2013 Conference Paper

Online Expectation Maximization for Reinforcement Learning in POMDPs

  • Miao Liu
  • Xuejun Liao
  • Lawrence Carin

We present online nested expectation maximization for model-free reinforcement learning in a POMDP. The algorithm evaluates the policy only in the current learning episode, discarding the episode after the evaluation and memorizing the sufficient statistic, from which the policy is computed in closedform. As a result, the online algorithm has a time complexity O n and a memory complexity O(1), compared to O n2 and O(n) for the corresponding batch-mode algorithm, where n is the number of learning episodes. The online algorithm, which has a provable convergence, is demonstrated on five benchmark POMDP problems.

NeurIPS Conference 2013 Conference Paper

Real-Time Inference for a Gamma Process Model of Neural Spiking

  • David Carlson
  • Vinayak Rao
  • Joshua Vogelstein
  • Lawrence Carin

With simultaneous measurements from ever increasing populations of neurons, there is a growing need for sophisticated tools to recover signals from individual neurons. In electrophysiology experiments, this classically proceeds in a two-step process: (i) threshold the waveforms to detect putative spikes and (ii) cluster the waveforms into single units (neurons). We extend previous Bayesian nonparamet- ric models of neural spiking to jointly detect and cluster neurons using a Gamma process model. Importantly, we develop an online approximate inference scheme enabling real-time analysis, with performance exceeding the previous state-of-the- art. Via exploratory data analysis—using data with partial ground truth as well as two novel data sets—we find several features of our model collectively contribute to our improved performance including: (i) accounting for colored noise, (ii) de- tecting overlapping spikes, (iii) tracking waveform dynamics, and (iv) using mul- tiple channels. We hope to enable novel experiments simultaneously measuring many thousands of neurons and possibly adapting stimuli dynamically to probe ever deeper into the mysteries of the brain.

NeurIPS Conference 2012 Conference Paper

Augment-and-Conquer Negative Binomial Processes

  • Mingyuan Zhou
  • Lawrence Carin

By developing data augmentation methods unique to the negative binomial (NB) distribution, we unite seemingly disjoint count and mixture models under the NB process framework. We develop fundamental properties of the models and derive efficient Gibbs sampling inference. We show that the gamma-NB process can be reduced to the hierarchical Dirichlet process with normalization, highlighting its unique theoretical, structural and computational advantages. A variety of NB processes with distinct sharing mechanisms are constructed and applied to topic modeling, with connections to existing algorithms, showing the importance of inferring both the NB dispersion and probability parameters.

NeurIPS Conference 2012 Conference Paper

Joint Modeling of a Matrix with Associated Text via Latent Binary Features

  • Xianxing Zhang
  • Lawrence Carin

A new methodology is developed for joint analysis of a matrix and accompanying documents, with the documents associated with the matrix rows/columns. The documents are modeled with a focused topic model, inferring latent binary features (topics) for each document. A new matrix decomposition is developed, with latent binary features associated with the rows/columns, and with imposition of a low-rank constraint. The matrix decomposition and topic model are coupled by sharing the latent binary feature vectors associated with each. The model is applied to roll-call data, with the associated documents defined by the legislation. State-of-the-art results are manifested for prediction of votes on a new piece of legislation, based only on the observed text legislation. The coupling of the text and legislation is also demonstrated to yield insight into the properties of the matrix decomposition for roll-call data.

UAI Conference 2012 Conference Paper

Nested Dictionary Learning for Hierarchical Organization of Imagery and Text

  • Lingbo Li 0002
  • XianXing Zhang
  • Mingyuan Zhou
  • Lawrence Carin

A tree-based dictionary learning model is developed for joint analysis of imagery and associated text. The dictionary learning may be applied directly to the imagery from patches, or to general feature vectors extracted from patches or superpixels (using any existing method for image feature extraction). Each image is associated with a path through the tree (from root to a leaf), and each of the multiple patches in a given image is associated with one node in that path. Nodes near the tree root are shared between multiple paths, representing image characteristics that are common among different types of images. Moving toward the leaves, nodes become specialized, representing details in image classes. If available, words (text) are also jointly modeled, with a path-dependent probability over words. The tree structure is inferred via a nested Dirichlet process, and a retrospective stick-breaking sampler is used to infer the tree depth and width.

NeurIPS Conference 2011 Conference Paper

Hierarchical Topic Modeling for Analysis of Time-Evolving Personal Choices

  • Xianxing Zhang
  • Lawrence Carin
  • David Dunson

The nested Chinese restaurant process is extended to design a nonparametric topic-model tree for representation of human choices. Each tree branch corresponds to a type of person, and each node (topic) has a corresponding probability vector over items that may be selected. The observed data are assumed to have associated temporal covariates (corresponding to the time at which choices are made), and we wish to impose that with increasing time it is more probable that topics deeper in the tree are utilized. This structure is imposed by developing a new “change point" stick-breaking model that is coupled with a Poisson and product-of-gammas construction. To share topics across the tree nodes, topic distributions are drawn from a Dirichlet process. As a demonstration of this concept, we analyze real data on course selections of undergraduate students at Duke University, with the goal of uncovering and concisely representing structure in the curriculum and in the characteristics of the student body.

JMLR Journal 2011 Journal Article

Logistic Stick-Breaking Process

  • Lu Ren
  • Lan Du
  • Lawrence Carin
  • David Dunson

A logistic stick-breaking process (LSBP) is proposed for non-parametric clustering of general spatially- or temporally-dependent data, imposing the belief that proximate data are more likely to be clustered together. The sticks in the LSBP are realized via multiple logistic regression functions, with shrinkage priors employed to favor contiguous and spatially localized segments. The LSBP is also extended for the simultaneous processing of multiple data sets, yielding a hierarchical logistic stick-breaking process (H-LSBP). The model parameters (atoms) within the H-LSBP are shared across the multiple learning tasks. Efficient variational Bayesian inference is derived, and comparisons are made to related techniques in the literature. Experimental analysis is performed for audio waveforms and images, and it is demonstrated that for segmentation applications the LSBP yields generally homogeneous segments with sharp boundaries. [abs] [ pdf ][ bib ] &copy JMLR 2011. ( edit, beta )

NeurIPS Conference 2011 Conference Paper

On the Analysis of Multi-Channel Neural Spike Data

  • Bo Chen
  • David Carlson
  • Lawrence Carin

Nonparametric Bayesian methods are developed for analysis of multi-channel spike-train data, with the feature learning and spike sorting performed jointly. The feature learning and sorting are performed simultaneously across all channels. Dictionary learning is implemented via the beta-Bernoulli process, with spike sorting performed via the dynamic hierarchical Dirichlet process (dHDP), with these two models coupled. The dHDP is augmented to eliminate refractoryperiod violations, it allows the “appearance” and “disappearance” of neurons over time, and it models smooth variation in the spike statistics.

NeurIPS Conference 2011 Conference Paper

The Kernel Beta Process

  • Lu Ren
  • Yingjian Wang
  • Lawrence Carin
  • David Dunson

A new Le ́vy process prior is proposed for an uncountable collection of covariate- dependent feature-learning measures; the model is called the kernel beta process (KBP). Available covariates are handled efficiently via the kernel construction, with covariates assumed observed with each data sample (“customer”), and latent covariates learned for each feature (“dish”). Each customer selects dishes from an infinite buffet, in a manner analogous to the beta process, with the added constraint that a customer first decides probabilistically whether to “consider” a dish, based on the distance in covariate space between the customer and dish. If a customer does consider a particular dish, that dish is then selected probabilistically as in the beta process. The beta process is recovered as a limiting case of the KBP. An efficient Gibbs sampler is developed for computations, and state-of-the-art results are presented for image processing and music analysis tasks.

JMLR Journal 2010 Journal Article

Classification with Incomplete Data Using Dirichlet Process Priors

  • Chunping Wang
  • Xuejun Liao
  • Lawrence Carin
  • David B. Dunson

A non-parametric hierarchical Bayesian framework is developed for designing a classifier, based on a mixture of simple (linear) classifiers. Each simple classifier is termed a local "expert", and the number of experts and their construction are manifested via a Dirichlet process formulation. The simple form of the "experts" allows analytical handling of incomplete data. The model is extended to allow simultaneous design of classifiers on multiple data sets, termed multi-task learning, with this also performed non-parametrically via the Dirichlet process. Fast inference is performed using variational Bayesian (VB) analysis, and example results are presented for several data sets. We also perform inference via Gibbs sampling, to which we compare the VB results. [abs] [ pdf ][ bib ] &copy JMLR 2010. ( edit, beta )

NeurIPS Conference 2010 Conference Paper

Joint Analysis of Time-Evolving Binary Matrices and Associated Documents

  • Eric Wang
  • Dehong Liu
  • Jorge Silva
  • Lawrence Carin
  • David Dunson

We consider problems for which one has incomplete binary matrices that evolve with time (e. g. , the votes of legislators on particular legislation, with each year characterized by a different such matrix). An objective of such analysis is to infer structure and inter-relationships underlying the matrices, here defined by latent features associated with each axis of the matrix. In addition, it is assumed that documents are available for the entities associated with at least one of the matrix axes. By jointly analyzing the matrices and documents, one may be used to inform the other within the analysis, and the model offers the opportunity to predict matrix values (e. g. , votes) based only on an associated document (e. g. , legislation). The research presented here merges two areas of machine-learning that have previously been investigated separately: incomplete-matrix analysis and topic modeling. The analysis is performed from a Bayesian perspective, with efficient inference constituted via Gibbs sampling. The framework is demonstrated by considering all voting data and available documents (legislation) during the 220-year lifetime of the United States Senate and House of Representatives.

NeurIPS Conference 2009 Conference Paper

A Bayesian Model for Simultaneous Image Clustering, Annotation and Object Segmentation

  • Lan Du
  • Lu Ren
  • Lawrence Carin
  • David Dunson

A non-parametric Bayesian model is proposed for processing multiple images. The analysis employs image features and, when present, the words associated with accompanying annotations. The model clusters the images into classes, and each image is segmented into a set of objects, also allowing the opportunity to assign a word to each object (localized labeling). Each object is assumed to be represented as a heterogeneous mix of components, with this realized via mixture models linking image features to object types. The number of image classes, number of object types, and the characteristics of the object-feature mixture models are inferred non-parametrically. To constitute spatially contiguous objects, a new logistic stick-breaking process is developed. Inference is performed efficiently via variational Bayesian analysis, with example results presented on two image databases.

NeurIPS Conference 2009 Conference Paper

Learning to Explore and Exploit in POMDPs

  • Chenghui Cai
  • Xuejun Liao
  • Lawrence Carin

A fundamental objective in reinforcement learning is the maintenance of a proper balance between exploration and exploitation. This problem becomes more challenging when the agent can only partially observe the states of its environment. In this paper we propose a dual-policy method for jointly learning the agent behavior and the balance between exploration exploitation, in partially observable environments. The method subsumes traditional exploration, in which the agent takes actions to gather information about the environment, and active learning, in which the agent queries an oracle for optimal actions (with an associated cost for employing the oracle). The form of the employed exploration is dictated by the specific problem. Theoretical guarantees are provided concerning the optimality of the balancing of exploration and exploitation. The effectiveness of the method is demonstrated by experimental results on benchmark problems.

JMLR Journal 2009 Journal Article

Multi-task Reinforcement Learning in Partially Observable Stochastic Environments

  • Hui Li
  • Xuejun Liao
  • Lawrence Carin

We consider the problem of multi-task reinforcement learning (MTRL) in multiple partially observable stochastic environments. We introduce the regionalized policy representation (RPR) to characterize the agent's behavior in each environment. The RPR is a parametric model of the conditional distribution over current actions given the history of past actions and observations; the agent's choice of actions is directly based on this conditional distribution, without an intervening model to characterize the environment itself. We propose off-policy batch algorithms to learn the parameters of the RPRs, using episodic data collected when following a behavior policy, and show their linkage to policy iteration. We employ the Dirichlet process as a nonparametric prior over the RPRs across multiple environments. The intrinsic clustering property of the Dirichlet process imposes sharing of episodes among similar environments, which effectively reduces the number of episodes required for learning a good policy in each environment, when data sharing is appropriate. The number of distinct RPRs and the associated clusters (the sharing patterns) are automatically discovered by exploiting the episodic data as well as the nonparametric nature of the Dirichlet process. We demonstrate the effectiveness of the proposed RPR as well as the RPR-based MTRL framework on various problems, including grid-world navigation and multi-aspect target classification. The experimental results show that the RPR is a competitive reinforcement learning algorithm in partially observable domains, and the MTRL consistently achieves better performance than single task reinforcement learning. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

NeurIPS Conference 2009 Conference Paper

Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations

  • Mingyuan Zhou
  • Haojun Chen
  • Lu Ren
  • Guillermo Sapiro
  • Lawrence Carin
  • John Paisley

Non-parametric Bayesian techniques are considered for learning dictionaries for sparse image representations, with applications in denoising, inpainting and compressive sensing (CS). The beta process is employed as a prior for learning the dictionary, and this non-parametric method naturally infers an appropriate dictionary size. The Dirichlet process and a probit stick-breaking process are also considered to exploit structure within an image. The proposed method can learn a sparse dictionary in situ; training images may be exploited if available, but they are not required. Further, the noise variance need not be known, and can be non-stationary. Another virtue of the proposed method is that sequential inference can be readily employed, thereby allowing scaling to large images. Several example results are presented, using both Gibbs and variational Bayesian inference, with comparisons to other state-of-the-art approaches.

ICML Conference 2009 Conference Paper

Nonparametric factor analysis with beta process priors

  • John Paisley
  • Lawrence Carin

We propose a nonparametric extension to the factor analysis problem using a beta process prior. This beta process factor analysis (BP-FA) model allows for a dataset to be decomposed into a linear combination of a sparse set of factors, providing information on the underlying structure of the observations. As with the Dirichlet process, the beta process is a fully Bayesian conjugate prior, which allows for analytical posterior calculation and straightforward inference. We derive a varia-tional Bayes inference algorithm and demonstrate the model on the MNIST digits and HGDP-CEPH cell line panel datasets.

ICML Conference 2007 Conference Paper

Bayesian compressive sensing and projection optimization

  • Shihao Ji 0001
  • Lawrence Carin

This paper introduces a new problem for which machine-learning tools may make an impact. The problem considered is termed "compressive sensing", in which a real signal of dimension N is measured accurately based on K << N real measurements. This is achieved under the assumption that the underlying signal has a sparse representation in some basis (e.g., wavelets). In this paper we demonstrate how techniques developed in machine learning, specifically sparse Bayesian regression and active learning, may be leveraged to this new problem. We also point out future research directions in compressive sensing of interest to the machine-learning community.

JMLR Journal 2007 Journal Article

Multi-Task Learning for Classification with Dirichlet Process Priors

  • Ya Xue
  • Xuejun Liao
  • Lawrence Carin
  • Balaji Krishnapuram

Consider the problem of learning logistic-regression models for multiple classification tasks, where the training data set for each task is not drawn from the same statistical distribution. In such a multi-task learning (MTL) scenario, it is necessary to identify groups of similar tasks that should be learned jointly. Relying on a Dirichlet process (DP) based statistical model to learn the extent of similarity between classification tasks, we develop computationally efficient algorithms for two different forms of the MTL problem. First, we consider a symmetric multi-task learning (SMTL) situation in which classifiers for multiple tasks are learned jointly using a variational Bayesian (VB) algorithm. Second, we consider an asymmetric multi-task learning (AMTL) formulation in which the posterior density function from the SMTL model parameters (from previous tasks) is used as a prior for a new task: this approach has the significant advantage of not requiring storage and use of all previous data from prior tasks. The AMTL formulation is solved with a simple Markov Chain Monte Carlo (MCMC) construction. Experimental results on two real life MTL problems indicate that the proposed algorithms: (a) automatically identify subgroups of related tasks whose training data appear to be drawn from similar distributions; and (b) are more accurate than simpler approaches such as single-task learning, pooling of data across all tasks, and simplified approximations to DP. [abs] [ pdf ][ bib ] &copy JMLR 2007. ( edit, beta )

NeurIPS Conference 2007 Conference Paper

Semi-Supervised Multitask Learning

  • Qiuhua Liu
  • Xuejun Liao
  • Lawrence Carin

A semi-supervised multitask learning (MTL) framework is presented, in which M parameterized semi-supervised classifiers, each associated with one of M par- tially labeled data manifolds, are learned jointly under the constraint of a soft- sharing prior imposed over the parameters of the classifiers. The unlabeled data are utilized by basing classifier learning on neighborhoods, induced by a Markov random walk over a graph representation of each manifold. Experimental results on real data sets demonstrate that semi-supervised MTL yields significant im- provements in generalization performance over either semi-supervised single-task learning (STL) or supervised MTL.

ICML Conference 2005 Conference Paper

Incomplete-data classification using logistic regression

  • David Williams
  • Xuejun Liao
  • Ya Xue
  • Lawrence Carin

A logistic regression classification algorithm is developed for problems in which the feature vectors may be missing data (features). Single or multiple imputation for the missing data is avoided by performing analytic integration with an estimated conditional density function (conditioned on the non-missing data). Conditional density functions are estimated using a Gaussian mixture model (GMM), with parameter estimation performed using both expectation maximization (EM) and Variational Bayesian EM (VB-EM). Using widely available real data, we demonstrate the general advantage of the VB-EM GMM estimation for handling incomplete data, vis-à-vis the EM algorithm. Moreover, it is demonstrated that the approach proposed here is generally superior to standard imputation procedures.

NeurIPS Conference 2005 Conference Paper

Radial Basis Function Network for Multi-task Learning

  • Xuejun Liao
  • Lawrence Carin

We extend radial basis function (RBF) networks to the scenario in which multiple correlated tasks are learned simultaneously, and present the cor- responding learning algorithms. We develop the algorithms for learn- ing the network structure, in either a supervised or unsupervised manner. Training data may also be actively selected to improve the network’s gen- eralization to test data. Experimental results based on real data demon- strate the advantage of the proposed algorithms and support our conclu- sions.

NeurIPS Conference 2004 Conference Paper

On Semi-Supervised Classification

  • Balaji Krishnapuram
  • David Williams
  • Ya Xue
  • Lawrence Carin
  • Mário Figueiredo
  • Alexander Hartemink

A graph-based prior is proposed for parametric semi-supervised classi- fication. The prior utilizes both labelled and unlabelled data; it also in- tegrates features from multiple views of a given sample (e. g. , multiple sensors), thus implementing a Bayesian form of co-training. An EM algorithm for training the classifier automatically adjusts the tradeoff be- tween the contributions of: (a) the labelled data; (b) the unlabelled data; and (c) the co-training information. Active label query selection is per- formed using a mutual information based criterion that explicitly uses the unlabelled data and the co-training information. Encouraging results are presented on public benchmarks and on measured data from single and multiple sensors. 1 Introduction In many pattern classification problems, the acquisition of labelled training data is costly and/or time consuming, whereas unlabelled samples can be obtained easily. Semi- supervised algorithms that learn from both labelled and unlabelled samples have been the focus of much research in the last few years; a comprehensive review up to 2001 can be found in [13], while more recent references include [1, 2, 6, 7, 1618]. Most recent semi-supervised learning algorithms work by formulating the assumption that "nearby" points, and points in the same structure (e. g. , cluster), should have similar labels [6, 7, 16]. This can be seen as a form of regularization, pushing the class boundaries toward regions of low data density. This regularization is often implemented by associating the vertices of a graph to all the (labelled and unlabelled) samples, and then formulating the problem on the vertices of the graph [6, 1618]. While current graph-based algorithms are inherently transductive -- i. e. , they cannot be used directly to classify samples not present when training -- our classifier is paramet- ric and the learned classifier can be used directly on new samples. Furthermore, our al- gorithm is trained discriminatively by maximizing a concave objective function; thus we avoid thorny local maxima issues that plague many earlier methods. Unlike existing methods, our algorithm automatically learns the relative importance of the labelled and unlabelled data. When multiple views of the same sample are provided (e. g. features from different sensors), we develop a new Bayesian form of co-training [4]. In addition, we also show how to exploit the unlabelled data and the redundant views of the sample (from co-training) in order to improve active label query selection [15]. The paper is organized as follows. Sec. 2 briefly reviews multinomial logistic regression. Sec. 3 describes the priors for semi-supervised learning and co-training. The EM algorithm derived to learn the classifiers is presented in Sec. 4. Active label selection is discussed in Sec. 5. Experimental results are shown in Sec. 6, followed by conclusions in Sec. 7. 2 Multinomial Logistic Regression In an m-class supervised learning problem, one is given a labelled training set DL = {(x d 1, y ), .. ., (x )} 1 L, yL, where xi R is a feature vector and yi the corresponding (1) (m) class label. In "1-of-m" encoding, y = [y, .. ., y ] i is a binary vector, such that i i (c) (j) y = 1 and y = 0, for j = c, indicates that sample i belongs to class c. In multinomial i i logistic regression [5], the posterior class probabilities are modelled as log P (y(c) = 1|x) = xT w(c) - log m exp(xT w(k)), for c = 1, .. ., m, (1) k=1 where w(c) d R is the class-c weight vector. Notice that since m P (y(c)= 1|x) = 1, c=1 one of the weight vectors is redundant; we arbitrarily choose to set w(m) = 0, and consider the (d (m-1))-dimensional vector w = [(w(1))T, .. ., (w(m-1))T ]T. Estimation of w may be achieved by maximizing the log-likelihood (with Y {y, .. ., y } 1 L ) [5] (c) (w) log P (Y|w) = L m y xT w(c) - log m exp(xT w(j)). (2) i=1 c=1 i i j=1 i In the presence of a prior p(w), we seek a maximum a posteriori (MAP) estimate, w = arg max { (w) + log p(w)} w. Actually, if the training data is separable, (w) is unbounded, and a prior is crucial. Although we focus on linear classifiers, we may see the d-dimensional feature vectors x as having resulted from some deterministic, maybe nonlinear, transformation of an input raw feature vector r; e. g. , in a kernel classifier, xi = [1, K(ri, r1), .. ., K(ri, rL)] (d = L + 1). 3 Graph-Based Data-Dependent Priors 3. 1 Graph Laplacians and Regularization for Semi-Supervised Learning Consider a scalar function f = [f1, .. ., f|V |]T, defined on the set V = {1, 2, .. ., |V |} of vertices of an undirected graph (V, E). Each edge of the graph, joining vertices i and j, is given a weight kij = kji 0, and we collect all the weights in a |V | |V | matrix K. A natural way to measure how much f varies across the graph is by the quantity kij(fi - fj)2 = 2 f T f, (3) i j where = diag{ k k j 1j, .. ., j |V |j } - K is the so-called graph Laplacian [2]. Notice that kij 0 (for all i, j) guarantees that is positive semi-definite and also that has (at least) one null eigenvalue (1T1 = 0, where 1 has all elements equal to one). In semi-supervised learning, in addition to DL, we are given U unlabelled samples DU = {xL+1, .. ., xL+U }. To use (3) for semi-supervised learning, the usual choice is to assign one vertex of the graph to each sample in X = [x1, .. ., xL+U ]T (thus |V | = L + U ), and to let kij represent some (non-negative) measure of "similarity" between xi and xj. A Gaussian random field (GRF) is defined on the vertices of V (with inverse variance ) p(f ) exp{- f T f /2}, in which configurations that vary more (according to (3)) are less probable. Most graph- based approaches estimate the values of f, given the labels, using p(f ) (or some modifica- tion thereof) as a prior. Accordingly, they work in a strictly transductive manner. 3. 2 Non-Transductive Semi-Supervised Learning We first consider two-class problems (m = 2, thus w d R ). In contrast to previous uses of graph-based priors, we define f as the real function f (defined over the entire observation space) evaluated at the graph nodes. Specifically, f is defined as a linear function of x, and at the graph node i, fi f (xi) = wT xi. Then, f = [f1, .. ., f|V |]T = Xw, and p(f ) induces a Gaussian prior on w, with precision matrix A = XT X, p(w) exp{-(/2) wT XT Xw} = exp{-(/2) wT Aw}. (4) Notice that since is singular, A may also be singular, and the corresponding prior may therefore be improper. This is no problem for MAP estimation of w because (as is well known) the normalization factor of the prior plays no role in this estimate. If we include extra regularization, by adding a non-negative diagonal matrix to A, the prior becomes p(w) exp -(1/2) wT (0A + ) w, (5) where we may choose = diag{1, .. ., d}, = 1I, or even = 0. For m > 2, we define (m-1) identical independent priors, one for each w(c), c = 1, .. ., m. The joint prior on w = [(w(1))T, .. ., (w(m-1))T ]T is then m-1 1 (c) 1 p(w|) exp{- (w(c))T A + (c) w(c)} = exp{- wT ()w}, (6) 2 0 2 c=1 (c) (c) (c) where is a vector containing all the parameters, (c) = diag{, .. ., }, and i 1 d (1) (m-1) () = diag{, .. ., } A + block-diag{(1), .. ., (m-1)}. (7) 0 0 Finally, since all the 's are inverses of variances, the conjugate priors are Gamma [3]: (c) (c) (c) (c) p( | | | | 0 0, 0) = Ga(0 0, 0), and p(i 1, 1) = Ga(i 1, 1), for c = 1, .. ., m - 1 and i = 1, .. ., d. Usually, 0, 0, 1, and 1 are given small values indicating diffuse priors. In the zero limit, we obtain scale-invariant (improper) Jeffreys hyper-priors. Summarizing, our model for semi-supervised learning includes the log-likelihood (2), a prior (6), and Gamma hyper-priors. In Section 4, we present a simple and computationally efficient expectation-maximization (EM) algorithm for obtaining the MAP estimate of w. 3. 3 Exploiting Features from Multiple Sensors: The Co-Training Prior In some applications several sensors are available, each providing a different set of features. For simplicity, we assume two sensors s {1, 2}, but everything discussed here is easily (s) extended to any number of sensors. Denote the features from sensor s, for sample i, as x, i and Ss as the set of sample indices for which we have features from sensor s (S1 S2 = {1, .. ., L + U }). Let O = S1 S2 be the indices for which both sensors are available, and OU = O {L + 1, .. ., L + U } the unlabelled subset of O. By using the samples in S1 and S2 as two independent training sets, we may obtain two sep- arate classifiers (denoted w1 and w2). However, we can coordinate the information from both sensors by using an idea known as co-training [4]: on the OU samples, classifiers w1 and w2 should agree as much as possible. Notice that, in a logistic regression framework, the disagreement between the two classifiers on the OU samples can be measured by (1) (2) [(w1)T x - (w2)T x ]2 = T C, (8) iOU i i where = [(w1)T (w2)T ]T and C = [(x1)T (-x2)T ]T [(x1)T (-x2)T ]. This iOU i i i i suggests the "co-training prior" (where co is an inverse variance): p(w1, w2) = p() exp -(co/2) TC. (9) This Gaussian prior can be combined with two smoothness Gaussian priors on w1 and w2 (obtained as described in Section 3. 2); this leads to a prior which is still Gaussian, p(w1, w2) = p() exp -(1/2) T coC + block-diag{1, 2}, (10) where 1 and 2 are the two graph-based precision matrices (see (7)) for w1 and w2. We can again adopt a Gamma hyper-prior for co. Under this prior, and with a logistic regression likelihood as above, estimates of w1 and w2 can easily be found using minor modifications to the EM algorithm described in Section 4. Computationally, this is only slightly more expensive than separately training the two classifiers. 4 Learning Via EM To find the MAP estimate w, we use the EM algorithm, with as missing data, which is equivalent to integrating out from the full posterior before maximization [8]. For simplicity, we will only describe the single sensor case (no co-training). E-step: We compute the expected value of the complete log-posterior, given Y and the current parameter estimate w: Q(w|w) E[log p(w, |Y)|w]. Since log p(w, |Y) = log p(Y|w) - (1/2)wT ()w + K, (11) (where K collects all terms independent of w) is linear w. r. t. all the parameters (see (6) and (7)), we just have to plug their conditional expectations into (11): Q(w|w) = log p(Y|w) - (1/2)wT E[()|w] w = (w) - (1/2)wT (w) w. (12) We consider several different choices for the structure of the matrix. The necessary expectations have well-known closed forms, due to the use of conjugate Gamma hyper- (c) priors [3]. For example, if the are m - 1 free non-negative parameters, we have 0 (c) (c) E[ |w] = (2 0 0 0 + d) [2 0 + (w(c))T Aw(c)]-1. (c) for c = 1, .. ., m - 1. For = 0 0, we still have a simple closed-form expres- (c) sion for E[0|w], and the same is true for the parameters, for i > 0. Finally, i (w) E[()|w] results from replacing the 's in (7) by the corresponding conditional expectations. M-step: Given matrix (w), the M-step reduces to a logistic regression problem with a quadratic regularizer, i. e. , maximizing (12). To this end, we adopt the bound optimization approach (see details in [5, 11]). Let B be a positive definite matrix such that -B bounds below (in the matrix sense) the Hessian of (w), which is negative definite, and g(w) is the gradient of (w). Then, we have the following lower bound on Q(w|w): Q(w|w) l(w) + (w - w)T g(w) - [(w - w)T B(w - w) + wT (w)w]/2. - The maximizer of this lower bound, wnew = (B + (w)) 1 (Bw + g(w)), is guaranteed to increase the Q-function, Q(wnew|w) Q(w|w), and we thus obtain a monotonic gen- eralized EM algorithm [5, 11]. This (maybe costly) matrix inversion can be avoided by a sequential approach where we only maximize w. r. t. one element of w at a time, preserving the monotonicity of the procedure. The sequential algorithm visits one particular element of w, say wu, and updates its estimate by maximizing the bound derived above, while keeping all other variables fixed at their previous values. This leads to - wnew = w ] [(B + (w)) 1, u u + [gu(w) - ((w)w) (13) u uu] and wnew = w v v, for v = u. The total time required by a full sweep for all u = 1, .. ., d is O(md(L + d)); this may be much better than the O((dm)3) of the matrix inversion. 5 Active Label Selection If we are allowed to obtain the label for one of the unlabelled samples, the following ques- tion arises: which sample, if labelled, would provide the most information? Consider the MAP estimate w provided by EM. Our approach uses a Laplace approxima- tion of the posterior p(w|Y) N (w|w, H-1), where H is the posterior precision matrix, i. e. , the Hessian of minus the log-posterior H = 2(- log p(w|Y)). This approximation is known to be accurate for logistic regression under a Gaussian prior [14]. By treating (w) (the expectation of ()) as deterministic, we obtain an evidence-type approximation [14] H = 2[- log(p(Y|w)p(w|(w)))] = (w) + L (diag{p } - p pT ) x, i=1 i i i ixT i where pi is the (m - 1)-dimensional vector computed from (1), the c-th element of which indicates the probability that sample xi belongs to class c. Now let x DU be an unlabelled sample and y its label. Assume that the MAP esti- mate w remains unchanged after including y. In Sec. 7 we will discuss the merits and shortcomings of this assumption, which is only strictly valid when L. Accepting it implies that after labeling x, and regardless of y, the posterior precision changes to H = H + (diag{p} - ppT ) xxT. (14) Since the entropy of a Gaussian with precision H is (-1/2) log |H| (up to an additive constant), the mutual information (MI) between y and w (i. e. , the expected decrease in entropy of w when y is observed) is I(w; y) = (1/2) log {|H |/|H|}. Our criterion is then: the best sample to label is the one that maximizes I(w; y). Further insight into I(w; y) can be obtained in the binary case (where p is a scalar); here, the matrix identity |H + p(1 - p)xxT | = |H|(1 + p(1 - p)xT H-1x) yields I(w; y) = (1/2) log(1 + p(1 - p)xT H-1x). (15) This MI is larger when p 0. 5, i. e. , for samples with uncertain classifications. On the other hand, with p fixed, I(w; y) grows with xT H-1x, i. e. , it is large for samples with high variance of the corresponding class probability estimate. Summarizing, (15) favors samples with uncertain class labels and high uncertainty in the class probability estimate. 6 Experimental Results We begin by presenting two-dimensional synthetic examples to visually illustrate our semi- supervised classifier. Fig. 1 shows the utility of using unlabelled data to improve the deci- Figure 1: Synthetic two-dimensional examples. (a) Comparison of the supervised logistic linear classifier (boundary shown as dashed line) learned only from the labelled data (shown in color) with the proposed semi-supervised classifier (boundary shown as solid line) which also uses the unlabelled samples (shown as dots). (b) A RBF kernel classifier obtained by our algorithm, using two labelled samples (shaded circles) and many unlabelled samples. Figure 2: (a)-(c) Accuracy (on UCI datasets) of the proposed method, the supervised SVM, and the other semi-supervised classifiers mentioned in the text; a subset of samples is la- belled and the others are treated as unlabelled samples. In (d), a separate holdout set is used to evaluate the accuracy of our method versus the amount of labelled and unlabelled data. sion boundary in linear and non-linear (kernel) classifiers (see figure caption for details). Next we show results with linear classifiers on three UCI benchmark datasets. Results with nonlinear kernels are similar, and therefore omitted to save space. We compare our method against state-of-the-art semi-supervised classifiers: the GRF method of [18], the SGT method of [10], and the transductive SVM (TSVM) of [9]. For reference, we also present results for a standard SVM. To avoid unduly helping our method, we always use a k=5 nearest neighbors graph, though our algorithm is not very sensitive to k. To avoid disadvantaging other methods that do depend on such parameters, we use their best settings. Since these adjustments cannot be made in practice, the difference between our algorithm and the others is under-represented. Each point on the plots in Fig. 2(a)-(c) is an average of 20 trials: we randomly select 20 labelled sets which are used by every method. All remaining samples are used as unlabelled by the semi-supervised algorithms. Figs. 2(a)-(c) are transductive, in the sense that the unlabelled and test data are the same. Our logistic GRF is non-transductive: after being trained, it may be applied to classify new data without re-training. In Fig. 2(d) we present non-transductive results for the Ionosphere data. Training took place using labelled and unlabelled data, and testing was performed on 200 new unseen samples. The results suggest that semi-supervised classifiers are most relevant when the labelled set is small relative to the unlabelled set (as is often the case). Our final set of results address co-training (Sec. 3. 3) and active learning (Sec. 5), applied to airborne sensing data for the detection of surface and subsurface land mines. Two sensors were used: (1) a 70-band hyper-spectral electro-optic (EOIR) sensor; (2) an X-band syn- thetic aperture radar (SAR). A simple (energy) "prescreener" detected potential targets; for each of these, two feature vectors were extracted, of sizes 420 and 9, for the EOIR and SAR sensors, respectively. 123 samples have features from the EOIR sensor alone, 398 from the Figure 3: (a) Land mine detection ROC curves of classifiers designed using only hyper- spectral (EOIR) features, only SAR features, and both. (b) Number of landmines detected during the active querying process (dotted lines), for active training and random selection (for the latter the bars reflect one standard deviation about the mean). ROC curves (solid) are for the learned classifier as applied to the remaining samples. SAR sensor alone, and 316 from both. This data will be made available upon request. We first consider supervised and semi-supervised classification. For the purely supervised case, a sparseness prior is used (as in [14]). In both cases a linear classifier is employed. For the data for which only one sensor is available, 20% of it is labelled (selected randomly). For the data for which both sensors are available, 80% is labelled (again selected randomly). The results presented in Fig. 3(a) show that, in general, the semi-supervised classifiers outperform the corresponding supervised ones, and the classifier learned from both sensors is markedly superior to classifiers learned from either sensor alone. In a second illustration, we use the active-learning algorithm (Sec. 5) to only acquire the 100 most informative labels. For comparison, we also show average results over 100 in- dependent realizations for random label query selection (error bars indicate one standard deviation). The results in Fig. 3(b) are plotted in two stages: first, mines and clutter are se- lected during the labeling process (dashed curves); then, the 100 labelled examples are used to build the final semi-supervised classifier, for which the ROC curve is obtained using the remaining unlabelled data (solid curves). Interestingly, the active-learning algorithm finds almost half of the mines while querying for labels. Due to physical limitations of the sen- sors, the rate at which mines are detected drops precipitously after approximately 90 mines are detected -- i. e. , the remaining mines are poorly matched to the sensor physics.