Arrow Research search

Author name cluster

Changyou Chen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

56 papers
2 author rows

Possible papers

56

AAAI Conference 2026 Conference Paper

ALPHA: Action-Based Learning for Pluralistic Human Alignment in Large Language Models

  • Aanisha Bhattacharyya
  • Susmit Agrawal
  • Yaman Kumar Singla
  • Tarun Ram Menta
  • Nikitha Sr
  • Rajiv Ratn Shah
  • Changyou Chen
  • Balaji Krishnamurthy

Large language models are widely used, but aligning them with societal values remains challenging. Current approaches often rely on human annotations, which are hard to scale, or synthetic data produced by models that may themselves be misaligned, making it difficult to capture genuine public opinion. This limits scalability and introduces demographic biases that reduce the representativeness and fairness of model behavior. We introduce a novel approach to pluralistic alignment through behavioral learning, grounded in the psychological principle that actions (behavior) have strong consistency with opinions. Specifically, we present ALPHA50M, a dataset of over 50 million samples derived from 1.5 million real-world advertisements, incorporating rich behavioral signals inferred from demographic engagement patterns. Models trained on this data achieve state-of-the-art zero-shot performance on diverse alignment benchmarks spanning cultural reasoning, political views, and social values. We also propose two new benchmarks: OpinionQA-XL, which covers surveys across 100+ societal topics, and GSS, which evaluates temporal opinion shift modeling over decades. Our results demonstrate that learning from behavioral signals, derived from observed human actions, enables models to align with diverse demographic opinions, capture underlying social and cultural norms, and generalize to new topics and surveys beyond training data. This behavioral learning paradigm offers a scalable and demographically broad alternative to existing alignment techniques.

ICLR Conference 2025 Conference Paper

Measuring And Improving Engagement of Text-to-Image Generation Models

  • Varun Khurana
  • Yaman Kumar Singla
  • Jayakumar Subramanian
  • Changyou Chen
  • Rajiv Ratn Shah
  • Zhiqiang Xu
  • Balaji Krishnamurthy

Recent advances in text-to-image generation have achieved impressive aesthetic quality, making these models usable for both personal and commercial purposes. However, in the fields of marketing and advertising, images are often created to be more engaging, as reflected in user behaviors such as increasing clicks, likes, and purchases, in addition to being aesthetically pleasing. To this end, we introduce the challenge of optimizing the image generation process for improved viewer engagement. In order to study image engagement and utility in real-world marketing scenarios, we collect *EngagingImageNet*, the first large-scale dataset of images, along with associated user engagement metrics. Further, we find that existing image evaluation metrics like aesthetics, CLIPScore, PickScore, ImageReward, *etc.* are unable to capture viewer engagement. To address the lack of reliable metrics for assessing image utility, we use the *EngagingImageNet* dataset to train *EngageNet*, an engagement-aware Vision Language Model (VLM) that predicts viewer engagement of images by leveraging contextual information about the tweet content, enterprise details, and posting time. We then explore methods to enhance the engagement of text-to-image models, making initial strides in this direction. These include conditioning image generation on improved prompts, supervised fine-tuning of stable diffusion on high-performing images, and reinforcement learning to align stable diffusion with *EngageNet*-based reward signals, all of which lead to the generation of images with higher viewer engagement. Finally, we propose the *Engagement Arena*, to benchmark text-to-image models based on their ability to generate engaging images, using *EngageNet* as the evaluator, thereby encouraging the research community to measure further advances in the engagement of text-to-image modeling. These contributions provide a new pathway for advancing utility-driven image generation, with significant implications for the commercial application of image generation. We have released our code and dataset on [behavior-in-the-wild.github.io/image-engagement](https://behavior-in-the-wild.github.io/image-engagement).

NeurIPS Conference 2025 Conference Paper

SPRO: Improving Image Generation via Self-Play

  • Ritika Jha
  • Aanisha Bhattacharyya
  • Yaman Singla
  • Rajiv Ratn Shah
  • Changyou Chen
  • Balaji Krishnamurthy

Recent advances in diffusion models have dramatically improved image fidelity and diversity. However, aligning these models with nuanced human preferences -such as aesthetics, engagement, and subjective appeal remains a key challenge due to the scarcity of large-scale human annotations. Collecting such data is both expensive and limited in diversity. To address this, we leverage the reasoning capabilities of vision-language models (VLMs) and propose Self-Play Reward Optimization (SPRO), a scalable, annotation-free training framework based on multimodal self-play. SPRO learns to jointly align prompt and image generation with human preferences by iteratively generating, evaluating, and learning to refine outputs using synthetic reward signals such as aesthetics and human engagement. This self-improving feedback loop eliminates the need for external supervision. SPRO comprises three stages: (1) SPRO-Prompt, which trains a Guider-VLM via self-play to generate diverse, high-reward prompts targeting objectives such as PickScore (user preference), LAION-Aesthetics, and EngageNet (engagement); (2) SPRO-Image, which fine-tunes the diffusion model on high-reward images derived from these prompts; and (3) SPRO-Multimodal (SPRO-MM), which integrates both components for full end-to-end alignment. Without relying on human-labeled data, SPRO achieves an average 30\% improvement across preference objectives. Moreover, its generated prompts generalize across both open- and closed-source diffusion models. Through iterative self-play, SPRO discovers prompting strategies rarely authored by humans such as emphasizing visual harmony for aesthetics or leveraging shadow-based cues for engagement. SPRO offers a scalable path toward aligning generative models with complex subjective human values.

ICLR Conference 2025 Conference Paper

SV-RAG: LoRA-Contextualizing Adaptation of MLLMs for Long Document Understanding

  • Jian Chen 0043
  • Ruiyi Zhang 0002
  • Yufan Zhou 0001
  • Tong Yu 0001
  • Franck Dernoncourt
  • Jiuxiang Gu
  • Ryan A. Rossi
  • Changyou Chen

Multimodal large language models (MLLMs) have recently shown great progress in text-rich image understanding, yet they still struggle with complex, multi-page visually-rich documents. Traditional methods using document parsers for retrieval-augmented generation suffer from performance and efficiency limitations, while directly presenting all pages to MLLMs leads to inefficiencies, especially with lengthy ones. In this work, we present a novel framework named **S**elf-**V**isual **R**etrieval-**A**ugmented **G**eneration (SV-RAG), which can broaden horizons of *any* MLLM to support long-document understanding. We demonstrate that **MLLMs themselves can be an effective multimodal retriever** to fetch relevant pages and then answer user questions based on these pages. SV-RAG is implemented with two specific MLLM adapters, one for evidence page retrieval and the other for question answering. Empirical results show state-of-the-art performance on public benchmarks, demonstrating the effectiveness of SV-RAG.

ICLR Conference 2025 Conference Paper

Teaching Human Behavior Improves Content Understanding Abilities Of VLMs

  • Somesh Kumar Singh
  • Harini S. I
  • Yaman Kumar Singla
  • Changyou Chen
  • Rajiv Ratn Shah
  • Veeky Baths
  • Balaji Krishnamurthy

Communication is defined as "*Who* says *what* to *whom* with *what* effect." A message from a communicator generates downstream receiver effects, also known as behavior. Receiver behavior, being a downstream effect of the message, carries rich signals about it. Even after carrying signals about the message, the behavior signal is often ignored while training vision-language models. We show that training VLMs on receiver behavior can actually help improve their content-understanding abilities. We demonstrate that training VLMs to predict receiver behaviors, such as likes, comments, and replay graphs, which are available at scale, enhances the VLM's performance across a broad range of downstream content understanding tasks. We show this performance increase over 6 types of behavior, 46 different tasks covering image, video, text, and audio over 26 benchmark datasets across both zero-shot and fine-tuning settings, outperforming many supervised baselines on diverse tasks ranging from emotion recognition to captioning by up to 150%. We note that since receiver behavior, such as likes, comments, and replay graphs, is collected by default on the internet and does not need any human annotations to be useful, the performance improvement we get after training on this data is essentially free lunch. We also release **BLIFT**, our **Behaviour-LLaVA IFT** dataset comprising 730k images and videos with their receiver behavior collected from multiple platforms on which we train our models to achieve this. The dataset and code are available at [behavior-in-the-wild.github.io/behavior-llava](https://behavior-in-the-wild.github.io/behavior-llava).

TMLR Journal 2025 Journal Article

Understanding Fine-tuning in Approximate Unlearning: A Theoretical Perspective

  • Meng Ding
  • Rohan Sharma
  • Changyou Chen
  • Jinhui Xu
  • Kaiyi Ji

Machine Unlearning has emerged as a significant area of research, focusing on `removing' specific subsets of data from a trained model. Fine-tuning (FT) methods have become one of the fundamental approaches for approximating unlearning, as they effectively retain model performance. However, it is consistently observed that naive FT methods struggle to forget the targeted data. In this paper, we present the first theoretical analysis of FT methods for machine unlearning within a linear regression framework, providing a deeper exploration of this phenomenon. Our analysis reveals that while FT models can achieve zero remaining loss, they fail to forget the forgetting data, as the pretrained model retains its influence and the fine-tuning process does not adequately mitigate it. To address this, we propose a novel Retention-Based Masking (RBM) strategy that constructs a weight saliency map based on the remaining dataset, unlike existing methods that focus on the forgetting dataset. Our theoretical analysis demonstrates that RBM not only significantly improves unlearning accuracy (UA) but also ensures higher retaining accuracy (RA) by preserving overlapping features shared between the forgetting and remaining datasets. Experiments on synthetic and real-world datasets validate our theoretical insights, showing that RBM outperforms existing masking approaches in balancing UA, RA, and disparity metrics.

NeurIPS Conference 2024 Conference Paper

A probability contrastive learning framework for 3D molecular representation learning

  • Jiayu Qin
  • Jian Chen
  • Rohan Sharma
  • Jingchen Sun
  • Changyou Chen

Contrastive Learning (CL) plays a crucial role in molecular representation learning, enabling unsupervised learning from large scale unlabeled molecule datasets. It has inspired various applications in molecular property prediction and drug design. However, existing molecular representation learning methods often introduce potential false positive and false negative pairs through conventional graph augmentations like node masking and subgraph removal. The issue can lead to suboptimal performance when applying standard contrastive learning techniques to molecular datasets. To address the issue of false positive and negative pairs in molecular representation learning, we propose a novel probability-based contrastive learning (CL) framework. Unlike conventional methods, our approach introduces a learnable weight distribution via Bayesian modeling to automatically identify and mitigate false positive and negative pairs. This method is particularly effective because it dynamically adjusts to the data, improving the accuracy of the learned representations. Our model is learned by a stochastic expectation-maximization process, which optimizes the model by iteratively refining the probability estimates of sample weights and updating the model parameters. Experimental results indicate that our method outperforms existing approaches in 13 out of 15 molecular property prediction benchmarks in MoleculeNet dataset and 8 out of 12 benchmarks in the QM9 benchmark, achieving new state-of-the-art results on average.

ICLR Conference 2024 Conference Paper

AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning

  • Rohan Sharma
  • Kaiyi Ji
  • Zhiqiang Xu
  • Changyou Chen

Self-supervised learning through contrastive representations is an emergent and promising avenue, aiming at alleviating the availability of labeled data. Recent research in the field also demonstrates its viability for several downstream tasks, henceforth leading to works that implement the contrastive principle through innovative loss functions and methods. However, despite achieving impressive progress, most methods depend on prohibitively large batch sizes and compute requirements for good performance. In this work, we propose the $\textbf{AUC}$-$\textbf{C}$ontrastive $\textbf{L}$earning, a new approach to contrastive learning that demonstrates robust and competitive performance in compute-limited regimes. We propose to incorporate the contrastive objective within the AUC-maximization framework, by noting that the AUC metric is maximized upon enhancing the probability of the network's binary prediction difference between positive and negative samples which inspires adequate embedding space arrangements in representation learning. Unlike standard contrastive methods, when performing stochastic optimization, our method maintains unbiased stochastic gradients and thus is more robust to batchsizes as opposed to standard stochastic optimization problems. Remarkably, our method with a batch size of 256, outperforms several state-of-the-art methods that may need much larger batch sizes (e.g., 4096), on ImageNet and other standard datasets. Experiments on transfer learning, few-shot learning, and other downstream tasks also demonstrate the viability of our method.

ICLR Conference 2024 Conference Paper

Diffusion Models for Multi-Task Generative Modeling

  • Changyou Chen
  • Han Ding 0004
  • Bunyamin Sisman
  • Yi Xu 0011
  • Ouye Xie
  • Benjamin Z. Yao
  • Son Dinh Tran
  • Belinda Zeng

Diffusion-based generative modeling has been achieving state-of-the-art results on various generation tasks. Most diffusion models, however, are limited to a single-generation modeling. Can we generalize diffusion models with the ability of multi-modal generative training for more generalizable modeling? In this paper, we propose a principled way to define a diffusion model by constructing a unified multi-modal diffusion model in a common {\em diffusion space}. We define the forward diffusion process to be driven by an information aggregation from multiple types of task-data, {\it e.g.}, images for a generation task and labels for a classification task. In the reverse process, we enforce information sharing by parameterizing a shared backbone denoising network with additional modality-specific decoder heads. Such a structure can simultaneously learn to generate different types of multi-modal data with a multi-task loss, which is derived from a new multi-modal variational lower bound that generalizes the standard diffusion model. We propose several multi-modal generation settings to verify our framework, including image transition, masked-image training, joint image-label and joint image-representation generative modeling. Extensive experimental results on ImageNet indicate the effectiveness of our framework for various multi-modal generative modeling, which we believe is an important research direction worthy of more future explorations.

ICLR Conference 2024 Conference Paper

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior

  • Ashmit Khandelwal
  • Aditya Agrawal
  • Aanisha Bhattacharyya
  • Yaman Kumar Singla
  • Somesh Singh 0003
  • Uttaran Bhattacharya
  • Ishita Dasgupta 0002
  • Stefano Petrangeli

Shannon and Weaver's seminal information theory divides communication into three levels: technical, semantic, and effectiveness. While the technical level deals with the accurate reconstruction of transmitted symbols, the semantic and effectiveness levels deal with the inferred meaning and its effect on the receiver. Large Language Models (LLMs), with their wide generalizability, make some progress towards the second level. However, LLMs and other communication models are not conventionally designed for predicting and optimizing communication for desired receiver behaviors and intents. As a result, the effectiveness level remains largely untouched by modern communication systems. In this paper, we introduce the receivers' "behavior tokens," such as shares, likes, clicks, purchases, and retweets, in the LLM's training corpora to optimize content for the receivers and predict their behaviors. Other than showing similar performance to LLMs on content understanding tasks, our trained models show generalization capabilities on the behavior dimension for behavior simulation, content simulation, behavior understanding, and behavior domain adaptation. We show results on all these capabilities using a wide range of tasks on three corpora. We call these models Large Content and Behavior Models (LCBMs). Further, to spur more research on LCBMs, we release our new Content Behavior Corpus (CBC), a repository containing communicator, message, and corresponding receiver behavior (https://behavior-in-the-wild.github.io/LCBM).

ICLR Conference 2024 Conference Paper

Towards Aligned Layout Generation via Diffusion Model with Aesthetic Constraints

  • Jian Chen 0043
  • Ruiyi Zhang 0002
  • Yufan Zhou 0001
  • Changyou Chen

Controllable layout generation refers to the process of creating a plausible visual arrangement of elements within a graphic design (*e.g.*, document and web designs) with constraints representing design intentions. Although recent diffusion-based models have achieved state-of-the-art FID scores, they tend to exhibit more pronounced misalignment compared to earlier transformer-based models. In this work, we propose the **LA**yout **C**onstraint diffusion mod**E**l (LACE), a unified model to handle a broad range of layout generation tasks, such as arranging elements with specified attributes and refining or completing a coarse layout design. The model is based on continuous diffusion models. Compared with existing methods that use discrete diffusion models, continuous state-space design can enable the incorporation of continuous aesthetic constraint functions in training more naturally. For conditional generation, we propose injecting layout conditions in the form of masks or gradient guidance during inference. Empirical results show that LACE produces high-quality layouts and outperforms existing state-of-the-art baselines. We will release our source code and model checkpoints.

AAAI Conference 2023 Conference Paper

AUC Maximization for Low-Resource Named Entity Recognition

  • Ngoc Dang Nguyen
  • Wei Tan
  • Lan Du
  • Wray Buntine
  • Richard Beare
  • Changyou Chen

Current work in named entity recognition (NER) uses either cross entropy (CE) or conditional random fields (CRF) as the objective/loss functions to optimize the underlying NER model. Both of these traditional objective functions for the NER problem generally produce adequate performance when the data distribution is balanced and there are sufficient annotated training examples. But since NER is inherently an imbalanced tagging problem, the model performance under the low-resource settings could suffer using these standard objective functions. Based on recent advances in area under the ROC curve (AUC) maximization, we propose to optimize the NER model by maximizing the AUC score. We give evidence that by simply combining two binary-classifiers that maximize the AUC score, significant performance improvement over traditional loss functions is achieved under low-resource NER settings. We also conduct extensive experiments to demonstrate the advantages of our method under the low-resource and highly-imbalanced data distribution settings. To the best of our knowledge, this is the first work that brings AUC maximization to the NER setting. Furthermore, we show that our method is agnostic to different types of NER embeddings, models and domains. The code of this work is available at https://github.com/dngu0061/NER-AUC-2T.

ICML Conference 2023 Conference Paper

Fed-CBS: A Heterogeneity-Aware Client Sampling Mechanism for Federated Learning via Class-Imbalance Reduction

  • Jianyi Zhang
  • Ang Li 0005
  • Minxue Tang
  • Jingwei Sun 0002
  • Xiang Chen 0010
  • Fan Zhang 0069
  • Changyou Chen
  • Yiran Chen 0001

Due to the often limited communication bandwidth of edge devices, most existing federated learning (FL) methods randomly select only a subset of devices to participate in training at each communication round. Compared with engaging all the available clients, such a random-selection mechanism could lead to significant performance degradation on non-IID (independent and identically distributed) data. In this paper, we present our key observation that the essential reason resulting in such performance degradation is the class-imbalance of the grouped data from randomly selected clients. Based on this observation, we design an efficient heterogeneity-aware client sampling mechanism, namely, Federated Class-balanced Sampling (Fed-CBS), which can effectively reduce class-imbalance of the grouped dataset from the intentionally selected clients. We first propose a measure of class-imbalance which can be derived in a privacy-preserving way. Based on this measure, we design a computation-efficient client sampling strategy such that the actively selected clients will generate a more class-balanced grouped dataset with theoretical guarantees. Experimental results show that Fed-CBS outperforms the status quo approaches in terms of test accuracy and the rate of convergence while achieving comparable or even better performance than the ideal setting where all the available clients participate in the FL training.

NeurIPS Conference 2023 Conference Paper

Label-Retrieval-Augmented Diffusion Models for Learning from Noisy Labels

  • Jian Chen
  • Ruiyi Zhang
  • Tong Yu
  • Rohan Sharma
  • Zhiqiang Xu
  • Tong Sun
  • Changyou Chen

Learning from noisy labels is an important and long-standing problem in machine learning for real applications. One of the main research lines focuses on learning a label corrector to purify potential noisy labels. However, these methods typically rely on strict assumptions and are limited to certain types of label noise. In this paper, we reformulate the label-noise problem from a generative-model perspective, i. e. , labels are generated by gradually refining an initial random guess. This new perspective immediately enables existing powerful diffusion models to seamlessly learn the stochastic generative process. Once the generative uncertainty is modeled, we can perform classification inference using maximum likelihood estimation of labels. To mitigate the impact of noisy labels, we propose the L abel- R etrieval- A ugmented (LRA) diffusion model, which leverages neighbor consistency to effectively construct pseudo-clean labels for diffusion training. Our model is flexible and general, allowing easy incorporation of different types of conditional information, e. g. , use of pre-trained models, to further boost model performance. Extensive experiments are conducted for evaluation. Our model achieves new state-of-the-art (SOTA) results on all the standard real-world benchmark datasets. Remarkably, by incorporating conditional information from the powerful CLIP model, our method can boost the current SOTA accuracy by 10-20 absolute points in many cases. Code is available: https: //anonymous. 4open. science/r/LRA-diffusion-5F2F

ICML Conference 2023 Conference Paper

Learning Unnormalized Statistical Models via Compositional Optimization

  • Wei Jiang 0029
  • Jiayu Qin 0001
  • Lingyu Wu
  • Changyou Chen
  • Tianbao Yang
  • Lijun Zhang 0005

Learning unnormalized statistical models (e. g. , energy-based models) is computationally challenging due to the complexity of handling the partition function. To eschew this complexity, noise-contrastive estimation (NCE) has been proposed by formulating the objective as the logistic loss of the real data and the artificial noise. However, as found in previous works, NCE may perform poorly in many tasks due to its flat loss landscape and slow convergence. In this paper, we study a direct approach for optimizing the negative log-likelihood of unnormalized models from the perspective of compositional optimization. To tackle the partition function, a noise distribution is introduced such that the log partition function can be written as a compositional function whose inner function can be estimated with stochastic samples. Hence, the objective can be optimized by stochastic compositional optimization algorithms. Despite being a simple method, we demonstrate that it is more favorable than NCE by (1) establishing a fast convergence rate and quantifying its dependence on the noise distribution through the variance of stochastic estimators; (2) developing better results for one-dimensional Gaussian mean estimation by showing our objective has a much favorable loss landscape and hence our method enjoys faster convergence; (3) demonstrating better performance on multiple applications, including density estimation, out-of-distribution detection, and real image generation.

AAAI Conference 2023 Conference Paper

Persuasion Strategies in Advertisements

  • Yaman Kumar
  • Rajat Jha
  • Arunim Gupta
  • Milan Aggarwal
  • Aditya Garg
  • Tushar Malyan
  • Ayush Bhardwaj
  • Rajiv Ratn Shah

Modeling what makes an advertisement persuasive, i.e., eliciting the desired response from consumer, is critical to the study of propaganda, social psychology, and marketing. Despite its importance, computational modeling of persuasion in computer vision is still in its infancy, primarily due to the lack of benchmark datasets that can provide persuasion-strategy labels associated with ads. Motivated by persuasion literature in social psychology and marketing, we introduce an extensive vocabulary of persuasion strategies and build the first ad image corpus annotated with persuasion strategies. We then formulate the task of persuasion strategy prediction with multi-modal learning, where we design a multi-task attention fusion model that can leverage other ad-understanding tasks to predict persuasion strategies. The dataset also provides image segmentation masks, which labels persuasion strategies in the corresponding ad images on the test split. We publicly release our code and dataset at https://midas-research.github.io/persuasion-advertisements/.

AAAI Conference 2022 Conference Paper

MINIMAL: Mining Models for Universal Adversarial Triggers

  • Yaman Kumar Singla
  • Swapnil Parekh
  • Somesh Singh
  • Changyou Chen
  • Balaji Krishnamurthy
  • Rajiv Ratn Shah

It is well known that natural language models are vulnerable to adversarial attacks, which are mostly input-specific in nature. Recently, it has been shown that there also exist input-agnostic attacks in NLP models, special text sequences called universal adversarial triggers. However, existing methods to craft universal triggers are data intensive. They require large amounts of data samples to generate adversarial triggers, which are typically inaccessible by attackers. For instance, previous works take 3000 data samples per class for the SNLI dataset to generate adversarial triggers. In this paper, we present a novel data-free approach, MINIMAL, to mine input-agnostic adversarial triggers from models. Using the triggers produced with our data-free algorithm, we reduce the accuracy of Stanford Sentiment Treebank’s positive class from 93. 6% to 9. 6%. Similarly, for the Stanford Natural Language Inference (SNLI), our single-word trigger reduces the accuracy of the entailment class from 90. 95% to less than 0. 6%. Despite being completely data-free, we get equivalent accuracy drops as data-dependent methods.

AAAI Conference 2022 Conference Paper

TiGAN: Text-Based Interactive Image Generation and Manipulation

  • Yufan Zhou
  • Ruiyi Zhang
  • Jiuxiang Gu
  • Chris Tensmeyer
  • Tong Yu
  • Changyou Chen
  • Jinhui Xu
  • Tong Sun

Using natural-language feedback to guide image generation and manipulation can greatly lower the required efforts and skills. This topic has received increased attention in recent years through refinement of Generative Adversarial Networks (GANs); however, most existing works are limited to singleround interaction, which is not reflective of real world interactive image editing workflows. Furthermore, previous works dealing with multi-round scenarios are limited to predefined feedback sequences, which is also impractical. In this paper, we propose a novel framework for Text-based interactive image generation and manipulation (TiGAN) that responds to users’ natural-language feedback. TiGAN utilizes the powerful pre-trained CLIP model to understand users’ naturallanguage feedback and exploits contrastive learning for a better text-to-image mapping. To maintain the image consistency during interactions, TiGAN generates intermediate feature vectors aligned with the feedback and selectively feeds these vectors to our proposed generative model. Empirical results on several datasets show that TiGAN improves both interaction efficiency and image quality while better avoids undesirable image manipulation during interactions.

NeurIPS Conference 2022 Conference Paper

Why do We Need Large Batchsizes in Contrastive Learning? A Gradient-Bias Perspective

  • Changyou Chen
  • Jianyi Zhang
  • Yi Xu
  • Liqun Chen
  • Jiali Duan
  • Yiran Chen
  • Son Tran
  • Belinda Zeng

Contrastive learning (CL) has been the de facto technique for self-supervised representation learning (SSL), with impressive empirical success such as multi-modal representation learning. However, traditional CL loss only considers negative samples from a minibatch, which could cause biased gradients due to the non-decomposibility of the loss. For the first time, we consider optimizing a more generalized contrastive loss, where each data sample is associated with an infinite number of negative samples. We show that directly using minibatch stochastic optimization could lead to gradient bias. To remedy this, we propose an efficient Bayesian data augmentation technique to augment the contrastive loss into a decomposable one, where standard stochastic optimization can be directly applied without gradient bias. Specifically, our augmented loss defines a joint distribution over the model parameters and the augmented parameters, which can be conveniently optimized by a proposed stochastic expectation-maximization algorithm. Our framework is more general and is related to several popular SSL algorithms. We verify our framework on both small scale models and several large foundation models, including SSL of ImageNet and SSL for vision-language representation learning. Experiment results indicate the existence of gradient bias in all cases, and demonstrate the effectiveness of the proposed method on improving previous state of the arts. Remarkably, our method can outperform the strong MoCo-v3 under the same hyper-parameter setting with only around half of the minibatch size; and also obtains strong results in the recent public benchmark ELEVATER for few-shot image classification.

ICLR Conference 2021 Conference Paper

Meta-Learning with Neural Tangent Kernels

  • Yufan Zhou 0001
  • Zhenyi Wang 0001
  • Jiayi Xian
  • Changyou Chen
  • Jinhui Xu 0001

Model Agnostic Meta-Learning (MAML) has emerged as a standard framework for meta-learning, where a meta-model is learned with the ability of fast adapting to new tasks. However, as a double-looped optimization problem, MAML needs to differentiate through the whole inner-loop optimization path for every outer-loop training step, which may lead to both computational inefficiency and sub-optimal solutions. In this paper, we generalize MAML to allow meta-learning to be defined in function spaces, and propose the first meta-learning paradigm in the Reproducing Kernel Hilbert Space (RKHS) induced by the meta-model's Neural Tangent Kernel (NTK). Within this paradigm, we introduce two meta-learning algorithms in the RKHS, which no longer need a sub-optimal iterative inner-loop adaptation as in the MAML framework. We achieve this goal by 1) replacing the adaptation with a fast-adaptive regularizer in the RKHS; and 2) solving the adaptation analytically based on the NTK theory. Extensive experimental studies demonstrate advantages of our paradigm in both efficiency and quality of solutions compared to related meta-learning algorithms. Another interesting feature of our proposed methods is that they are demonstrated to be more robust to adversarial attacks and out-of-distribution adaptation than popular baselines, as demonstrated in our experiments.

ICLR Conference 2021 Conference Paper

MixKD: Towards Efficient Distillation of Large-scale Language Models

  • Kevin J. Liang
  • Weituo Hao
  • Dinghan Shen
  • Yufan Zhou 0001
  • Weizhu Chen
  • Changyou Chen
  • Lawrence Carin

Large-scale language models have recently demonstrated impressive empirical performance. Nevertheless, the improved results are attained at the price of bigger models, more power consumption, and slower inference, which hinder their applicability to low-resource (both memory and computation) platforms. Knowledge distillation (KD) has been demonstrated as an effective framework for compressing such big models. However, large-scale neural network systems are prone to memorize training instances, and thus tend to make inconsistent predictions when the data distribution is altered slightly. Moreover, the student model has few opportunities to request useful information from the teacher model when there is limited task-specific data available. To address these issues, we propose MixKD, a data-agnostic distillation framework that leverages mixup, a simple yet efficient data augmentation approach, to endow the resulting model with stronger generalization ability. Concretely, in addition to the original training examples, the student model is encouraged to mimic the teacher's behavior on the linear interpolation of example pairs as well. We prove from a theoretical perspective that under reasonable conditions MixKD gives rise to a smaller gap between the generalization error and the empirical error. To verify its effectiveness, we conduct experiments on the GLUE benchmark, where MixKD consistently leads to significant gains over the standard KD training, and outperforms several competitive baselines. Experiments under a limited-data setting and ablation studies further demonstrate the advantages of the proposed approach.

IJCAI Conference 2021 Conference Paper

Unsupervised Hashing with Contrastive Information Bottleneck

  • Zexuan Qiu
  • Qinliang Su
  • Zijing Ou
  • Jianxing Yu
  • Changyou Chen

Many unsupervised hashing methods are implicitly established on the idea of reconstructing the input data, which basically encourages the hashing codes to retain as much information of original data as possible. However, this requirement may force the models spending lots of their effort on reconstructing the unuseful background information, while ignoring to preserve the discriminative semantic information that is more important for the hashing task. To tackle this problem, inspired by the recent success of contrastive learning in learning continuous representations, we propose to adapt this framework to learn binary hashing codes. Specifically, we first propose to modify the objective function to meet the specific requirement of hashing and then introduce a probabilistic binary representation layer into the model to facilitate end-to-end training of the entire model. We further prove the strong connection between the proposed contrastive-learning-based hashing method and the mutual information, and show that the proposed model can be considered under the broader framework of the information bottleneck (IB). Under this perspective, a more general hashing model is naturally obtained. Extensive experimental results on three benchmark image datasets demonstrate that the proposed hashing method significantly outperforms existing baselines.

ICLR Conference 2020 Conference Paper

Bayesian Meta Sampling for Fast Uncertainty Adaptation

  • Zhenyi Wang 0001
  • Yang Zhao
  • Ping Yu
  • Ruiyi Zhang 0002
  • Changyou Chen

Meta learning has been making impressive progress for fast model adaptation. However, limited work has been done on learning fast uncertainty adaption for Bayesian modeling. In this paper, we propose to achieve the goal by placing meta learning on the space of probability measures, inducing the concept of meta sampling for fast uncertainty adaption. Specifically, we propose a Bayesian meta sampling framework consisting of two main components: a meta sampler and a sample adapter. The meta sampler is constructed by adopting a neural-inverse-autoregressive-flow (NIAF) structure, a variant of the recently proposed neural autoregressive flows, to efficiently generate meta samples to be adapted. The sample adapter moves meta samples to task-specific samples, based on a newly proposed and general Bayesian sampling technique, called optimal-transport Bayesian sampling. The combination of the two components allows a simple learning procedure for the meta sampler to be developed, which can be efficiently optimized via standard back-propagation. Extensive experimental results demonstrate the efficiency and effectiveness of the proposed framework, obtaining better sample quality and faster uncertainty adaption compared to related methods.

NeurIPS Conference 2020 Conference Paper

Bayesian Multi-type Mean Field Multi-agent Imitation Learning

  • Fan Yang
  • Alina Vereshchaka
  • Changyou Chen
  • Wen Dong

Multi-agent Imitation learning (MAIL) refers to the problem that agents learn to perform a task interactively in a multi-agent system through observing and mimicking expert demonstrations, without any knowledge of a reward function from the environment. MAIL has received a lot of attention due to promising results achieved on synthesized tasks, with the potential to be applied to complex real-world multi-agent tasks. Key challenges for MAIL include sample efficiency and scalability. In this paper, we proposed Bayesian multi-type mean field multi-agent imitation learning (BM3IL). Our method improves sample efficiency through establishing a Bayesian formulation for MAIL, and enhances scalability through introducing a new multi-type mean field approximation. We demonstrate the performance of our algorithm through benchmarking with three state-of-the-art multi-agent imitation learning algorithms on several tasks, including solving a multi-agent traffic optimization problem in a real-world transportation network. Experimental results indicate that our algorithm significantly outperforms all other algorithms in all scenarios.

ICLR Conference 2020 Conference Paper

Cyclical Stochastic Gradient MCMC for Bayesian Deep Learning

  • Ruqi Zhang
  • Chunyuan Li
  • Jianyi Zhang
  • Changyou Chen
  • Andrew Gordon Wilson

The posteriors over neural network weights are high dimensional and multimodal. Each mode typically characterizes a meaningfully different representation of the data. We develop Cyclical Stochastic Gradient MCMC (SG-MCMC) to automatically explore such distributions. In particular, we propose a cyclical stepsize schedule, where larger steps discover new modes, and smaller steps characterize each mode. We prove non-asymptotic convergence theory of our proposed algorithm. Moreover, we provide extensive experimental results, including ImageNet, to demonstrate the effectiveness of cyclical SG-MCMC in learning complex multimodal distributions, especially for fully Bayesian inference with modern deep neural networks.

ICML Conference 2020 Conference Paper

Feature Quantization Improves GAN Training

  • Yang Zhao
  • Chunyuan Li
  • Ping Yu
  • Jianfeng Gao 0001
  • Changyou Chen

The instability in GANs’ training has been a long-standing problem despite remarkable research efforts. We identify that instability issues stem from difficulties of performing feature matching with mini-batch statistics, due to a fragile balance between the fixed target distribution and the progressively generated distribution. In this work, we propose feature quantizatoin (FQ) for the discriminator, to embed both true and fake data samples into a shared discrete space. The quantized values of FQ are constructed as an evolving dictionary, which is consistent with feature statistics of the recent distribution history. Hence, FQ implicitly enables robust feature matching in a compact space. Our method can be easily plugged into existing GAN models, with little computational overhead in training. Extensive experimental results show that the proposed FQ-GAN can improve the FID scores of baseline methods by a large margin on a variety of tasks, including three representative GAN models on 10 benchmarks, achieving new state-of-the-art performance.

AAAI Conference 2020 Conference Paper

Learning Diverse Stochastic Human-Action Generators by Learning Smooth Latent Transitions

  • Zhenyi Wang
  • Ping Yu
  • Yang Zhao
  • Ruiyi Zhang
  • Yufan Zhou
  • Junsong Yuan
  • Changyou Chen

Human-motion generation is a long-standing challenging task due to the requirement of accurately modeling complex and diverse dynamic patterns. Most existing methods adopt sequence models such as RNN to directly model transitions in the original action space. Due to high dimensionality and potential noise, such modeling of action transitions is particularly challenging. In this paper, we focus on skeleton-based action generation and propose to model smooth and diverse transitions on a latent space of action sequences with much lower dimensionality. Conditioned on a latent sequence, actions are generated by a frame-wise decoder shared by all latent action-poses. Specifically, an implicit RNN is defined to model smooth latent sequences, whose randomness (diversity) is controlled by noise from the input. Different from standard action-prediction methods, our model can generate action sequences from pure noise without any conditional action poses. Remarkably, it can also generate unseen actions from mixed classes during training. Our model is learned with a bi-directional generative-adversarial-net framework, which can not only generate diverse action sequences of a particular class or mix classes, but also learns to classify action sequences within the same model. Experimental results show the superiority of our method in both diverse action-sequence generation and classification, relative to existing methods.

NeurIPS Conference 2020 Conference Paper

Learning Manifold Implicitly via Explicit Heat-Kernel Learning

  • Yufan Zhou
  • Changyou Chen
  • Jinhui Xu

Manifold learning is a fundamental problem in machine learning with numerous applications. Most of the existing methods directly learn the low-dimensional embedding of the data in some high-dimensional space, and usually lack the flexibility of being directly applicable to down-stream applications. In this paper, we propose the concept of implicit manifold learning, where manifold information is implicitly obtained by learning the associated heat kernel. A heat kernel is the solution of the corresponding heat equation, which describes how ``heat'' transfers on the manifold, thus containing ample geometric information of the manifold. We provide both practical algorithm and theoretical analysis of our framework. The learned heat kernel can be applied to various kernel-based machine learning models, including deep generative models (DGM) for data generation and Stein Variational Gradient Descent for Bayesian inference. Extensive experiments show that our framework can achieve the state-of-the-art results compared to existing methods for the two tasks.

ICML Conference 2020 Conference Paper

Variance Reduction in Stochastic Particle-Optimization Sampling

  • Jianyi Zhang
  • Yang Zhao
  • Changyou Chen

Stochastic particle-optimization sampling (SPOS) is a recently-developed scalable Bayesian sampling framework unifying stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD) algorithms based on Wasserstein gradient flows. With a rigorous non-asymptotic convergence theory developed, SPOS can avoid the particle-collapsing pitfall of SVGD. However, the variance-reduction effect in SPOS has not been clear. In this paper, we address this gap by presenting several variance-reduction techniques for SPOS. Specifically, we propose three variants of variance-reduced SPOS, called SAGA particle-optimization sampling (SAGA-POS), SVRG particle-optimization sampling (SVRG-POS) and a variant of SVRG-POS which avoids full gradient computations, denoted as SVRG-POS$^+$. Importantly, we provide non-asymptotic convergence guarantees for these algorithms in terms of the 2-Wasserstein metric and analyze their complexities. The results show our algorithms yield better convergence rates than existing variance-reduced variants of stochastic Langevin dynamics, though more space is required to store the particles in training. Our theory aligns well with experimental results on both synthetic and real datasets.

AAAI Conference 2020 Conference Paper

Variational Adversarial Kernel Learned Imitation Learning

  • Fan Yang
  • Alina Vereshchaka
  • Yufan Zhou
  • Changyou Chen
  • Wen Dong

Imitation learning refers to the problem where an agent learns to perform a task through observing and mimicking expert demonstrations, without knowledge of the cost function. Stateof-the-art imitation learning algorithms reduce imitation learning to distribution-matching problems by minimizing some distance measures. However, the distance measure may not always provide informative signals for a policy update. To this end, we propose the variational adversarial kernel learned imitation learning (VAKLIL), which measures the distance using the maximum mean discrepancy with variational kernel learning. Our method optimizes over a large cost-function space and is sample efficient and robust to overfitting. We demonstrate the performance of our algorithm through benchmarking with four state-of-the-art imitation learning algorithms over five high-dimensional control tasks, and a complex transportation control task. Experimental results indicate that our algorithm significantly outperforms related algorithms in all scenarios.

IJCAI Conference 2019 Conference Paper

Bayesian Uncertainty Matching for Unsupervised Domain Adaptation

  • Jun Wen
  • Nenggan Zheng
  • Junsong Yuan
  • Zhefeng Gong
  • Changyou Chen

Domain adaptation is an important technique to alleviate performance degradation caused by domain shift, e. g. , when training and test data come from different domains. Most existing deep adaptation methods focus on reducing domain shift by matching marginal feature distributions through deep transformations on the input features, due to the unavailability of target domain labels. We show that domain shift may still exist via label distribution shift at the classifier, thus deteriorating model performances. To alleviate this issue, we propose an approximate joint distribution matching scheme by exploiting prediction uncertainty. Specifically, we use a Bayesian neural network to quantify prediction uncertainty of a classifier. By imposing distribution matching on both features and labels (via uncertainty), label distribution mismatching in source and target data is effectively alleviated, encouraging the classifier to produce consistent predictions across domains. We also propose a few techniques to improve our method by adaptively reweighting domain adaptation loss to achieve nontrivial distribution matching and stable training. Comparisons with state of the art unsupervised domain adaptation methods on three popular benchmark datasets demonstrate the superiority of our approach, especially on the effectiveness of alleviating negative transfer.

NeurIPS Conference 2019 Conference Paper

Certified Adversarial Robustness with Additive Noise

  • Bai Li
  • Changyou Chen
  • Wenlin Wang
  • Lawrence Carin

The existence of adversarial data examples has drawn significant attention in the deep-learning community; such data are seemingly minimally perturbed relative to the original data, but lead to very different outputs from a deep-learning algorithm. Although a significant body of work on developing defense models has been developed, most such models are heuristic and are often vulnerable to adaptive attacks. Defensive methods that provide theoretical robustness guarantees have been studied intensively, yet most fail to obtain non-trivial robustness when a large-scale model and data are present. To address these limitations, we introduce a framework that is scalable and provides certified bounds on the norm of the input manipulation for constructing adversarial examples. We establish a connection between robustness against adversarial perturbation and additive random noise, and propose a training strategy that can significantly improve the certified bounds. Our evaluation on MNIST, CIFAR-10 and ImageNet suggests that our method is scalable to complicated models and large data sets, while providing competitive robustness to state-of-the-art provable defense methods.

AAAI Conference 2019 Conference Paper

Communication-Efficient Stochastic Gradient MCMC for Neural Networks

  • Chunyuan Li
  • Changyou Chen
  • Yunchen Pu
  • Ricardo Henao
  • Lawrence Carin

Learning probability distributions on the weights of neural networks has recently proven beneficial in many applications. Bayesian methods such as Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) offer an elegant framework to reason about model uncertainty in neural networks. However, these advantages usually come with a high computational cost. We propose accelerating SG-MCMC under the masterworker framework: workers asynchronously and in parallel share responsibility for gradient computations, while the master collects the final samples. To reduce communication overhead, two protocols (downpour and elastic) are developed to allow periodic interaction between the master and workers. We provide a theoretical analysis on the finite-time estimation consistency of posterior expectations, and establish connections to sample thinning. Our experiments on various neural networks demonstrate that the proposed algorithms can greatly reduce training time while achieving comparable (or better) test accuracy/log-likelihood levels, relative to traditional SG-MCMC. When applied to reinforcement learning, it naturally provides exploration for asynchronous policy optimization, with encouraging performance improvement.

IJCAI Conference 2019 Conference Paper

Deep Metric Learning: The Generalization Analysis and an Adaptive Algorithm

  • Mengdi Huai
  • Hongfei Xue
  • Chenglin Miao
  • Liuyi Yao
  • Lu Su
  • Changyou Chen
  • Aidong Zhang

As an effective way to learn a distance metric between pairs of samples, deep metric learning (DML) has drawn significant attention in recent years. The key idea of DML is to learn a set of hierarchical nonlinear mappings using deep neural networks, and then project the data samples into a new feature space for comparing or matching. Although DML has achieved practical success in many applications, there is no existing work that theoretically analyzes the generalization error bound for DML, which can measure how good a learned DML model is able to perform on unseen data. In this paper, we try to fill up this research gap and derive the generalization error bound for DML. Additionally, based on the derived generalization bound, we propose a novel DML method (called ADroDML), which can adaptively learn the retention rates for the DML models with dropout in a theoretically justified way. Compared with existing DML works that require predefined retention rates, ADroDML can learn the retention rates in an optimal way and achieve better performance. We also conduct experiments on real-world datasets to verify the findings derived from the generalization error bound and demonstrate the effectiveness of the proposed adaptive DML method.

ICML Conference 2019 Conference Paper

Differentially Private Empirical Risk Minimization with Non-convex Loss Functions

  • Di Wang 0015
  • Changyou Chen
  • Jinhui Xu 0001

We study the problem of Empirical Risk Minimization (ERM) with (smooth) non-convex loss functions under the differential-privacy (DP) model. Existing approaches for this problem mainly adopt gradient norms to measure the error, which in general cannot guarantee the quality of the solution. To address this issue, we first study the expected excess empirical (or population) risk, which was primarily used as the utility to measure the quality for convex loss functions. Specifically, we show that the excess empirical (or population) risk can be upper bounded by $\tilde{O}(\frac{d\log (1/\delta)}{\log n\epsilon^2})$ in the $(\epsilon, \delta)$-DP settings, where $n$ is the data size and $d$ is the dimensionality of the space. The $\frac{1}{\log n}$ term in the empirical risk bound can be further improved to $\frac{1}{n^{\Omega(1)}}$ (when $d$ is a constant) by a highly non-trivial analysis on the time-average error. To obtain more efficient solutions, we also consider the connection between achieving differential privacy and finding approximate local minimum. Particularly, we show that when the size $n$ is large enough, there are $(\epsilon, \delta)$-DP algorithms which can find an approximate local minimum of the empirical risk with high probability in both the constrained and non-constrained settings. These results indicate that one can escape saddle points privately.

AAAI Conference 2019 Conference Paper

Distributionally Adversarial Attack

  • Tianhang Zheng
  • Changyou Chen
  • Kui Ren

Recent work on adversarial attack has shown that Projected Gradient Descent (PGD) Adversary is a universal first-order adversary, and the classifier adversarially trained by PGD is robust against a wide range of first-order attacks. It is worth noting that the original objective of an attack/defense model relies on a data distribution p(x), typically in the form of risk maximization/minimization, e. g. , max/min Ep(x)L(x) with p(x) some unknown data distribution and L(·) a loss function. However, since PGD generates attack samples independently for each data sample based on L(·), the procedure does not necessarily lead to good generalization in terms of risk optimization. In this paper, we achieve the goal by proposing distributionally adversarial attack (DAA), a framework to solve an optimal adversarial-data distribution, a perturbed distribution that satisfies the L∞ constraint but deviates from the original data distribution to increase the generalization risk maximally. Algorithmically, DAA performs optimization on the space of potential data distributions, which introduces direct dependency between all data points when generating adversarial samples. DAA is evaluated by attacking stateof-the-art defense models, including the adversarially-trained models provided by MIT MadryLab. Notably, DAA ranks the first place on MadryLab’s white-box leaderboards, reducing the accuracy of their secret MNIST model to 88. 56% (with l∞ perturbations of = 0. 3) and the accuracy of their secret CIFAR model to 44. 71% (with l∞ perturbations of = 8. 0). Code for the experiments is released on https: //github. com/ tianzheng4/Distributionally-Adversarial-Attack.

AAAI Conference 2019 Conference Paper

Self-Adversarially Learned Bayesian Sampling

  • Yang Zhao
  • Jianyi Zhang
  • Changyou Chen

Scalable Bayesian sampling is playing an important role in modern machine learning, especially in the fast-developed unsupervised-(deep)-learning models. While tremendous progresses have been achieved via scalable Bayesian sampling such as stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD), the generated samples are typically highly correlated. Moreover, their sample-generation processes are often criticized to be inefficient. In this paper, we propose a novel self-adversarial learning framework that automatically learns a conditional generator to mimic the behavior of a Markov kernel (transition kernel). High-quality samples can be efficiently generated by direct forward passes though a learned generator. Most importantly, the learning process adopts a self-learning paradigm, requiring no information on existing Markov kernels, e. g. , knowledge of how to draw samples from them. Specifically, our framework learns to use current samples, either from the generator or pre-provided training data, to update the generator such that the generated samples progressively approach a target distribution, thus it is called self-learning. Experiments on both synthetic and real datasets verify advantages of our framework, outperforming related methods in terms of both sampling efficiency and sample quality.

NeurIPS Conference 2019 Conference Paper

Text-Based Interactive Recommendation via Constraint-Augmented Reinforcement Learning

  • Ruiyi Zhang
  • Tong Yu
  • Yilin Shen
  • Hongxia Jin
  • Changyou Chen

Text-based interactive recommendation provides richer user preferences and has demonstrated advantages over traditional interactive recommender systems. However, recommendations can easily violate preferences of users from their past natural-language feedback, since the recommender needs to explore new items for further improvement. To alleviate this issue, we propose a novel constraint-augmented reinforcement learning (RL) framework to efficiently incorporate user preferences over time. Specifically, we leverage a discriminator to detect recommendations violating user historical preference, which is incorporated into the standard RL objective of maximizing expected cumulative future rewards. Our proposed framework is general and is further extended to the task of constrained text generation. Empirical results show that the proposed method yields consistent improvement relative to standard RL methods.

UAI Conference 2018 Conference Paper

A Unified Particle-Optimization Framework for Scalable Bayesian Sampling

  • Changyou Chen
  • Ruiyi Zhang 0002
  • Wenlin Wang
  • Bai Li 0001
  • Liqun Chen 0001

There has been recent interest in developing scalable Bayesian sampling methods such as stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD) for big-data analysis. A standard SG-MCMC algorithm simulates samples from a discrete-time Markov chain to approximate a target distribution, thus samples could be highly correlated, an undesired property for SG-MCMC. In contrary, SVGD directly optimizes a set of particles to approximate a target distribution, and thus is able to obtain good approximations with relatively much fewer samples. In this paper, we propose a principle particle-optimization framework based on Wasserstein gradient flows to unify SG-MCMC and SVGD, and to allow new algorithms to be developed. Our framework interprets SG-MCMC as particle optimization on the space of probability measures, revealing a strong connection between SG-MCMC and SVGD. The key component of our framework is several particle-approximate techniques to efficiently solve the original partial differential equations on the space of probability measures. Extensive experiments on both synthetic data and deep neural networks demonstrate the effectiveness and efficiency of our framework for scalable Bayesian sampling.

ICML Conference 2018 Conference Paper

Continuous-Time Flows for Efficient Inference and Density Estimation

  • Changyou Chen
  • Chunyuan Li
  • Liquan Chen
  • Wenlin Wang
  • Yunchen Pu
  • Lawrence Carin

Two fundamental problems in unsupervised learning are efficient inference for latent-variable models and robust density estimation based on large amounts of unlabeled data. Algorithms for the two tasks, such as normalizing flows and generative adversarial networks (GANs), are often developed independently. In this paper, we propose the concept of continuous-time flows (CTFs), a family of diffusion-based methods that are able to asymptotically approach a target distribution. Distinct from normalizing flows and GANs, CTFs can be adopted to achieve the above two goals in one framework, with theoretical guarantees. Our framework includes distilling knowledge from a CTF for efficient inference, and learning an explicit energy-based distribution with CTFs for density estimation. Both tasks rely on a new technique for distribution matching within amortized learning. Experiments on various tasks demonstrate promising performance of the proposed CTF framework, compared to related techniques.

IROS Conference 2018 Conference Paper

Deep Q-Learning for Dry Stacking Irregular Objects

  • Yifang Liu
  • Seyed Mahdi Shamsi
  • Le Fang 0002
  • Changyou Chen
  • Nils Napp

We propose a reinforcement learning approach for automatically building dry stacked (i. e. no mortar) structures with irregular objects. Stacking irregular objects is a challenging problem since each assembly action can be drawn from a continuous space of poses for an object, and several local geometric and physical considerations strongly affect the stability. To tackle this challenge, we concentrate on a simplified 2D version of the problem. We present a reinforcement learning algorithm based on deep Q-learning, where the learned Q-function, which maps state-action pairs into expected long-term rewards, is represented by a deep neural network. As the action space is continuous the Q-network is trained by sampling a finite number of actions that consider both geometric and physical constraints to approximate the target Q-values, Experiments show that the proposed method outperforms previous heuristics-based planning, leading to super construction with objects containing a significant amount of variations. We validate the generated stacking plans by executing them using a robot arm and manufactured, irregular objects.

ICML Conference 2018 Conference Paper

Policy Optimization as Wasserstein Gradient Flows

  • Ruiyi Zhang 0002
  • Changyou Chen
  • Chunyuan Li
  • Lawrence Carin

Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its correspondence to policy-distribution optimization has been unclear mathematically. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, policy optimization becomes convex in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-of-the-art policy-optimization algorithms. Specifically, we define gradient flows on both the parameter-distribution space and policy-distribution space, leading to what we term indirect-policy and direct-policy learning frameworks, respectively. Extensive experiments verify the effectiveness of our framework, often obtaining better performance compared to related algorithms.

AAAI Conference 2018 Conference Paper

Zero-Shot Learning via Class-Conditioned Deep Generative Models

  • Wenlin Wang
  • Yunchen Pu
  • Vinay Verma
  • Kai Fan
  • Yizhe Zhang
  • Changyou Chen
  • Piyush Rai
  • Lawrence Carin

We present a deep generative model for Zero-Shot Learning (ZSL). Unlike most existing methods for this problem, that represent each class as a point (via a semantic embedding), we represent each seen/unseen class using a classspecific latent-space distribution, conditioned on class attributes. We use these latent-space distributions as a prior for a supervised variational autoencoder (VAE), which also facilitates learning highly discriminative feature representations for the inputs. The entire framework is learned end-to-end using only the seen-class training data. At test time, the label for an unseen-class test input is the class that maximizes the VAE lower bound. We further extend the model to a (i) semi-supervised/transductive setting by leveraging unlabeled unseen-class data via an unsupervised learning module, and (ii) few-shot learning where we also have a small number of labeled inputs from the unseen classes. We compare our model with several state-of-the-art methods through a comprehensive set of experiments on a variety of benchmark data sets.

NeurIPS Conference 2017 Conference Paper

ALICE: Towards Understanding Adversarial Learning for Joint Distribution Matching

  • Chunyuan Li
  • Hao Liu
  • Changyou Chen
  • Yuchen Pu
  • Liqun Chen
  • Ricardo Henao
  • Lawrence Carin

We investigate the non-identifiability issues associated with bidirectional adversarial training for joint distribution matching. Within a framework of conditional entropy, we propose both adversarial and non-adversarial approaches to learn desirable matched joint distributions for unsupervised and supervised tasks. We unify a broad family of adversarial models as joint distribution matching problems. Our approach stabilizes learning of unsupervised bidirectional adversarial learning methods. Further, we introduce an extension for semi-supervised learning tasks. Theoretical results are validated in synthetic data and real-world applications.

ICML Conference 2017 Conference Paper

Stochastic Gradient Monomial Gamma Sampler

  • Yizhe Zhang 0002
  • Changyou Chen
  • Zhe Gan
  • Ricardo Henao
  • Lawrence Carin

Scaling Markov Chain Monte Carlo (MCMC) to estimate posterior distributions from large datasets has been made possible as a result of advances in stochastic gradient techniques. Despite their success, mixing performance of existing methods when sampling from multimodal distributions can be less efficient with insufficient Monte Carlo samples; this is evidenced by slow convergence and insufficient exploration of posterior distributions. We propose a generalized framework to improve the sampling efficiency of stochastic gradient MCMC, by leveraging a generalized kinetics that delivers superior stationary mixing, especially in multimodal distributions, and propose several techniques to overcome the practical issues. We show that the proposed approach is better at exploring a complicated multimodal posterior distribution, and demonstrate improvements over other stochastic gradient MCMC methods on various applications.

AAAI Conference 2016 Conference Paper

High-Order Stochastic Gradient Thermostats for Bayesian Learning of Deep Models

  • Chunyuan Li
  • Changyou Chen
  • Kai Fan
  • Lawrence Carin

Learning in deep models using Bayesian methods has generated significant attention recently. This is largely because of the feasibility of modern Bayesian methods to yield scalable learning and inference, while maintaining a measure of uncertainty in the model parameters. Stochastic gradient MCMC algorithms (SG-MCMC) are a family of diffusion-based sampling methods for large-scale Bayesian learning. In SG-MCMC, multivariate stochastic gradient thermostats (mSGNHT) augment each parameter of interest, with a momentum and a thermostat variable to maintain stationary distributions as target posterior distributions. As the number of variables in a continuous-time diffusion increases, its numerical approximation error becomes a practical bottleneck, so better use of a numerical integrator is desirable. To this end, we propose use of an efficient symmetric splitting integrator in mSGNHT, instead of the traditional Euler integrator. We demonstrate that the proposed scheme is more accurate, robust, and converges faster. These properties are demonstrated to be desirable in Bayesian deep learning. Extensive experiments on two canonical models and their deep extensions demonstrate that the proposed scheme improves general Bayesian posterior sampling, particularly for deep models.

ICML Conference 2016 Conference Paper

Nonlinear Statistical Learning with Truncated Gaussian Graphical Models

  • Qinliang Su
  • Xuejun Liao
  • Changyou Chen
  • Lawrence Carin

We introduce the truncated Gaussian graphical model (TGGM) as a novel framework for designing statistical models for nonlinear learning. A TGGM is a Gaussian graphical model (GGM) with a subset of variables truncated to be nonnegative. The truncated variables are assumed latent and integrated out to induce a marginal model. We show that the variables in the marginal model are non-Gaussian distributed and their expected relations are nonlinear. We use expectation-maximization to break the inference of the nonlinear model into a sequence of TGGM inference problems, each of which is efficiently solved by using the properties and numerical methods of multivariate Gaussian distributions. We use the TGGM to design models for nonlinear regression and classification, with the performances of these models demonstrated on extensive benchmark datasets and compared to state-of-the-art competing results.

AAAI Conference 2016 Conference Paper

Preconditioned Stochastic Gradient Langevin Dynamics for Deep Neural Networks

  • Chunyuan Li
  • Changyou Chen
  • David Carlson
  • Lawrence Carin

Effective training of deep neural networks suffers from two main issues. The first is that the parameter spaces of these models exhibit pathological curvature. Recent methods address this problem by using adaptive preconditioning for Stochastic Gradient Descent (SGD). These methods improve convergence by adapting to the local geometry of parameter space. A second issue is overfitting, which is typically addressed by early stopping. However, recent work has demonstrated that Bayesian model averaging mitigates this problem. The posterior can be sampled by using Stochastic Gradient Langevin Dynamics (SGLD). However, the rapidly changing curvature renders default SGLD methods inef- ficient. Here, we propose combining adaptive preconditioners with SGLD. In support of this idea, we give theoretical properties on asymptotic convergence and predictive risk. We also provide empirical results for Logistic Regression, Feedforward Neural Nets, and Convolutional Neural Nets, demonstrating that our preconditioned SGLD method gives state-of-the-art performance on these models.

NeurIPS Conference 2016 Conference Paper

Stochastic Gradient MCMC with Stale Gradients

  • Changyou Chen
  • Nan Ding
  • Chunyuan Li
  • Yizhe Zhang
  • Lawrence Carin

Stochastic gradient MCMC (SG-MCMC) has played an important role in large-scale Bayesian learning, with well-developed theoretical convergence properties. In such applications of SG-MCMC, it is becoming increasingly popular to employ distributed systems, where stochastic gradients are computed based on some outdated parameters, yielding what are termed stale gradients. While stale gradients could be directly used in SG-MCMC, their impact on convergence properties has not been well studied. In this paper we develop theory to show that while the bias and MSE of an SG-MCMC algorithm depend on the staleness of stochastic gradients, its estimation variance (relative to the expected estimate, based on a prescribed number of samples) is independent of it. In a simple Bayesian distributed system with SG-MCMC, where stale gradients are computed asynchronously by a set of workers, our theory indicates a linear speedup on the decrease of estimation variance w. r. t. the number of workers. Experiments on synthetic data and deep neural networks validate our theory, demonstrating the effectiveness and scalability of SG-MCMC with stale gradients.

NeurIPS Conference 2016 Conference Paper

Towards Unifying Hamiltonian Monte Carlo and Slice Sampling

  • Yizhe Zhang
  • Xiangyu Wang
  • Changyou Chen
  • Ricardo Henao
  • Kai Fan
  • Lawrence Carin

We unify slice sampling and Hamiltonian Monte Carlo (HMC) sampling, demonstrating their connection via the Hamiltonian-Jacobi equation from Hamiltonian mechanics. This insight enables extension of HMC and slice sampling to a broader family of samplers, called Monomial Gamma Samplers (MGS). We provide a theoretical analysis of the mixing performance of such samplers, proving that in the limit of a single parameter, the MGS draws decorrelated samples from the desired target distribution. We further show that as this parameter tends toward this limit, performance gains are achieved at a cost of increasing numerical difficulty and some practical convergence issues. Our theoretical results are validated with synthetic data and real-world applications.

NeurIPS Conference 2015 Conference Paper

On the Convergence of Stochastic Gradient MCMC Algorithms with High-Order Integrators

  • Changyou Chen
  • Nan Ding
  • Lawrence Carin

Recent advances in Bayesian learning with large-scale data have witnessed emergence of stochastic gradient MCMC algorithms (SG-MCMC), such as stochastic gradient Langevin dynamics (SGLD), stochastic gradient Hamiltonian MCMC (SGHMC), and the stochastic gradient thermostat. While finite-time convergence properties of the SGLD with a 1st-order Euler integrator have recently been studied, corresponding theory for general SG-MCMCs has not been explored. In this paper we consider general SG-MCMCs with high-order integrators, and develop theory to analyze finite-time convergence properties and their asymptotic invariant measures. Our theoretical results show faster convergence rates and more accurate invariant measures for SG-MCMCs with higher-order integrators. For example, with the proposed efficient 2nd-order symmetric splitting integrator, the mean square error (MSE) of the posterior average for the SGHMC achieves an optimal convergence rate of $L^{-4/5}$ at $L$ iterations, compared to $L^{-2/3}$ for the SGHMC and SGLD with 1st-order Euler integrators. Furthermore, convergence results of decreasing-step-size SG-MCMCs are also developed, with the same convergence rates as their fixed-step-size counterparts for a specific decreasing sequence. Experiments on both synthetic and real datasets verify our theory, and show advantages of the proposed method in two large-scale real applications.

ICML Conference 2015 Conference Paper

Scalable Deep Poisson Factor Analysis for Topic Modeling

  • Zhe Gan
  • Changyou Chen
  • Ricardo Henao
  • David E. Carlson
  • Lawrence Carin

A new framework for topic modeling is developed, based on deep graphical models, where interactions between topics are inferred through deep latent binary hierarchies. The proposed multi-layer model employs a deep sigmoid belief network or restricted Boltzmann machine, the bottom binary layer of which selects topics for use in a Poisson factor analysis model. Under this setting, topics live on the bottom layer of the model, while the deep specification serves as a flexible prior for revealing topic structure. Scalable inference algorithms are derived by applying Bayesian conditional density filtering algorithm, in addition to extending recently proposed work on stochastic gradient thermostats. Experimental results on several corpora show that the proposed approach readily handles very large collections of text documents, infers structured topic representations, and obtains superior test perplexities when compared with related models.

NeurIPS Conference 2014 Conference Paper

Bayesian Sampling Using Stochastic Gradient Thermostats

  • Nan Ding
  • Youhan Fang
  • Ryan Babbush
  • Changyou Chen
  • Robert Skeel
  • Hartmut Neven

Dynamics-based sampling methods, such as Hybrid Monte Carlo (HMC) and Langevin dynamics (LD), are commonly used to sample target distributions. Recently, such approaches have been combined with stochastic gradient techniques to increase sampling efficiency when dealing with large datasets. An outstanding problem with this approach is that the stochastic gradient introduces an unknown amount of noise which can prevent proper sampling after discretization. To remedy this problem, we show that one can leverage a small number of additional variables in order to stabilize momentum fluctuations induced by the unknown noise. Our method is inspired by the idea of a thermostat in statistical physics and is justified by a general theory.

NeurIPS Conference 2014 Conference Paper

Robust Bayesian Max-Margin Clustering

  • Changyou Chen
  • Jun Zhu
  • Xinhua Zhang

We present max-margin Bayesian clustering (BMC), a general and robust framework that incorporates the max-margin criterion into Bayesian clustering models, as well as two concrete models of BMC to demonstrate its flexibility and effectiveness in dealing with different clustering tasks. The Dirichlet process max-margin Gaussian mixture is a nonparametric Bayesian clustering model that relaxes the underlying Gaussian assumption of Dirichlet process Gaussian mixtures by incorporating max-margin posterior constraints, and is able to infer the number of clusters from data. We further extend the ideas to present max-margin clustering topic model, which can learn the latent topic representation of each document while at the same time cluster documents in the max-margin fashion. Extensive experiments are performed on a number of real datasets, and the results indicate superior clustering performance of our methods compared to related baselines.

ICML Conference 2013 Conference Paper

Dependent Normalized Random Measures

  • Changyou Chen
  • Vinayak A. Rao
  • Wray L. Buntine
  • Yee Whye Teh

In this paper we propose two constructions of dependent normalized random measures, a class of nonparametric priors over dependent probability measures. Our constructions, which we call mixed normalized random measures (MNRM) and thinned normalized random measures (TNRM), involve (respectively) weighting and thinning parts of a shared underlying Poisson process before combining them together. We show that both MNRM and TNRM are marginally normalized random measures, resulting in well understood theoretical properties. We develop marginal and slice samplers for both models, the latter necessary for inference in TNRM. In time-varying topic modelling experiments, both models exhibit superior performance over related dependent models such as the hierarchical Dirichlet process and the spatial normalized Gamma process.