Arrow Research search

Author name cluster

Zhe Gan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

50 papers
2 author rows

Possible papers

50

ICML Conference 2025 Conference Paper

Contrastive Localized Language-Image Pre-Training

  • Hong-You Chen
  • Zhengfeng Lai
  • Haotian Zhang 0005
  • Xinze Wang
  • Marcin Eichner
  • Keen You
  • Meng Cao
  • Bowen Zhang 0002

CLIP has been a celebrated method for training vision encoders to generate image/text representations facilitating various applications. Recently, it has been widely adopted as the vision backbone of multimodal large language models (MLLMs). The success of CLIP relies on aligning web-crawled noisy text annotations at image levels. However, such criteria may be insufficient for downstream tasks in need of fine-grained vision representations, especially when understanding region-level is demanding for MLLMs. We improve the localization capability of CLIP with several advances. Our proposed pre-training method, Contrastive Localized Language-Image Pre-training (CLOC), complements CLIP with region-text contrastive loss and modules. We formulate a new concept, promptable embeddings, of which the encoder produces image embeddings easy to transform into region representations given spatial hints. To support large-scale pre-training, we design a visually-enriched and spatially-localized captioning framework to effectively generate region-text labels. By scaling up to billions of annotated images, CLOC enables high-quality regional embeddings for recognition and retrieval tasks, and can be a drop-in replacement of CLIP to enhance MLLMs, especially on referring and grounding tasks.

ICLR Conference 2025 Conference Paper

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

  • Zhangheng Li
  • Keen You
  • Haotian Zhang 0005
  • Di Feng
  • Harsh Agrawal
  • Xiujun Li
  • Mohana Prasad Sathya Moorthy
  • Jeffrey Nichols 0001

Building a generalist model for user interface (UI) understanding is challenging due to various foundational issues, such as platform diversity, resolution variation, and data limitation. In this paper, we introduce Ferret-UI 2, a multimodal large language model (MLLM) designed for universal UI understanding across a wide range of platforms, including iPhone, Android, iPad, Webpage, and AppleTV. Building on the foundation of Ferret-UI, Ferret-UI 2 introduces three key innovations: support for multiple platform types, high-resolution perception through adaptive scaling, and advanced task training data generation powered by GPT-4o with set-of-mark visual prompting. These advancements enable Ferret-UI 2 to perform complex, user-centered interactions, making it highly versatile and adaptable for the expanding diversity of platform ecosystems. Extensive empirical experiments on referring, grounding, user-centric advanced tasks (comprising 9 subtasks $\times$ 5 platforms), GUIDE next-action prediction dataset, and GUI-World multi-platform benchmark demonstrate that Ferret-UI 2 significantly outperforms Ferret-UI, and also shows strong cross-platform transfer capabilities.

ICLR Conference 2025 Conference Paper

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

  • Yusu Qian
  • Hanrong Ye
  • Jean-Philippe Fauconnier
  • Peter Grasch
  • Yinfei Yang
  • Zhe Gan

Effective evaluation of Multimodal Large Language Models (MLLMs) is essential for understanding their capabilities and limitations. In this paper, we introduce MIA-Bench, a benchmark designed to assess MLLMs’ ability to strictly adhere to complex instructions. Our benchmark comprises a diverse set of 400 image-prompt pairs, each crafted to challenge the models’ compliance with layered instructions in generating accurate and contextually appropriate responses. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning and direct preference optimization to enhance the models’ ability to strictly follow instructions without compromising performance on other tasks. We hope this benchmark not only serves as a tool for measuring MLLM adherence to instructions, but also guides future developments in MLLM training methods.

ICLR Conference 2025 Conference Paper

MM1. 5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning

  • Haotian Zhang 0005
  • Mingfei Gao
  • Zhe Gan
  • Philipp Dufter
  • Nina Wenzel
  • Forrest Huang
  • Dhruti Shah
  • Xianzhi Du

We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video understanding, and MM1.5-UI, tailored for mobile UI understanding. Through extensive empirical studies and ablations, we provide detailed insights into the training processes and decisions that inform our final designs, offering valuable guidance for future research in MLLM development.

ICLR Conference 2025 Conference Paper

MMEgo: Towards Building Egocentric Multimodal LLMs for Video QA

  • Hanrong Ye
  • Haotian Zhang 0005
  • Erik A. Daxberger
  • Lin Chen 0010
  • Zongyu Lin
  • Yanghao Li
  • Bowen Zhang 0002
  • Haoxuan You

This research aims to comprehensively explore building a multimodal foundation model for egocentric video understanding. To achieve this goal, we work on three fronts. First, as there is a lack of QA data for egocentric video understanding, we automatically generate 7M high-quality QA samples for egocentric videos ranging from 30 seconds to one hour long in Ego4D based on human-annotated data. This is one of the largest egocentric QA datasets. Second, we contribute a challenging egocentric QA benchmark with 629 videos and 7,026 questions to evaluate the models' ability in recognizing and memorizing visual details across videos of varying lengths. We introduce a new de-biasing evaluation method to help mitigate the unavoidable language bias present in the models being evaluated. Third, we propose a specialized multimodal architecture featuring a novel ``Memory Pointer Prompting" mechanism. This design includes a global glimpse step to gain an overarching understanding of the entire video and identify key visual information, followed by a fallback step that utilizes the key visual information to generate responses. This enables the model to more effectively comprehend extended video content. With the data, benchmark, and model, we build MM-Ego, an egocentric multimodal LLM that shows powerful performance on egocentric video understanding.

ICLR Conference 2025 Conference Paper

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

  • Zhengfeng Lai
  • Vasileios Saveris
  • Chen Chen 0005
  • Hong-You Chen
  • Haotian Zhang 0005
  • Bowen Zhang 0002
  • Wenze Hu
  • Juan Lao Tebar

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining short synthetic captions (SSC) and descriptive synthetic captions (DSC) as case studies, we systematically explore their effects and interactions with AltTexts across models such as CLIP, multimodal LLMs, and diffusion models. Our findings reveal that a hybrid approach that keeps both synthetic captions and AltTexts can outperform the use of synthetic captions alone, improving both alignment and performance, with each model demonstrating preferences for particular caption formats. This comprehensive analysis provides valuable insights into optimizing captioning strategies, thereby advancing the pre-training of multimodal foundation models.

ICLR Conference 2024 Conference Paper

Compressing LLMs: The Truth is Rarely Pure and Never Simple

  • Ajay Kumar Jaiswal
  • Zhe Gan
  • Xianzhi Du
  • Bowen Zhang
  • Zhangyang Wang
  • Yinfei Yang

Despite their remarkable achievements, modern Large Language Models (LLMs) encounter exorbitant computational and memory footprints. Recently, several works have shown significant success in *training-free* and *data-free* compression (pruning and quantization) of LLMs achieving 50-60\% sparsity and reducing the bit-width down to 3 or 4 bits per weight, with negligible perplexity degradation over the uncompressed baseline. As recent research efforts are focused on developing increasingly sophisticated compression methods, our work takes a step back, and re-evaluates the effectiveness of existing SoTA compression methods, which rely on a fairly simple and widely questioned metric, perplexity (even for dense LLMs). We introduce **K**nowledge-**I**ntensive **C**ompressed LLM Benchmar**K** **(LLM-KICK)**, a collection of carefully-curated tasks to re-define the evaluation protocol for compressed LLMs, which have significant alignment with their dense counterparts, and perplexity fail to capture subtle change in their true capabilities. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods: all pruning methods suffer significant performance degradation, sometimes at trivial sparsity ratios (*e.g.*, 25-30\%), and fail for N:M sparsity on knowledge-intensive tasks; current quantization methods are more successful than pruning; yet, pruned LLMs even at $\geq 50$\% sparsity are robust in-context retrieval and summarization systems; among others. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, *etc.* We hope our study can foster the development of better LLM compression methods. The reproduced codes are available at https://github.com/VITA-Group/llm-kick.

ICLR Conference 2024 Conference Paper

Ferret: Refer and Ground Anything Anywhere at Any Granularity

  • Haoxuan You
  • Haotian Zhang 0005
  • Zhe Gan
  • Xianzhi Du
  • Bowen Zhang 0002
  • Zirui Wang
  • Liangliang Cao
  • Shih-Fu Chang

We introduce Ferret, a new Multimodal Large Language Model (MLLM) capable of understanding spatial referring of any shape or granularity within an image and accurately grounding open-vocabulary descriptions. To unify referring and grounding in the LLM paradigm, Ferret employs a novel and powerful hybrid region representation that integrates discrete coordinates and continuous features jointly to represent a region in the image. To extract the continuous features of versatile regions, we propose a spatial-aware visual sampler, adept at handling varying sparsity across different shapes. Consequently, Ferret can accept diverse region inputs, such as points, bounding boxes, and free-form shapes. To bolster the desired capability of Ferret, we curate GRIT, a comprehensive refer-and-ground instruction tuning dataset including 1.1M samples that contain rich hierarchical spatial knowledge, with an additional 130K hard negative data to promote model robustness. The resulting model not only achieves superior performance in classical referring and grounding tasks, but also greatly outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Our evaluations also reveal a significantly improved capability of describing image details and a remarkable alleviation in object hallucination.

ICLR Conference 2024 Conference Paper

Guiding Instruction-based Image Editing via Multimodal Large Language Models

  • Tsu-Jui Fu
  • Wenze Hu
  • Xianzhi Du
  • William Yang Wang
  • Yinfei Yang
  • Zhe Gan

Instruction-based image editing improves the controllability and flexibility of image manipulation via natural commands without elaborate descriptions or regional masks. However, human instructions are sometimes too brief for current methods to capture and follow. Multimodal large language models (MLLMs) show promising capabilities in cross-modal understanding and visual-aware response generation via LMs. We investigate how MLLMs facilitate edit instructions and present MLLM-Guided Image Editing (MGIE). MGIE learns to derive expressive instructions and provides explicit guidance. The editing model jointly captures this visual imagination and performs manipulation through end-to-end training. We evaluate various aspects of Photoshop-style modification, global photo optimization, and local editing. Extensive experimental results demonstrate that expressive instructions are crucial to instruction-based image editing, and our MGIE can lead to a notable improvement in automatic metrics and human evaluation while maintaining competitive inference efficiency.

ICLR Conference 2023 Conference Paper

Prompting GPT-3 To Be Reliable

  • Chenglei Si
  • Zhe Gan
  • Zhengyuan Yang
  • Shuohang Wang
  • Jianfeng Wang
  • Jordan L. Boyd-Graber
  • Lijuan Wang

Large language models (LLMs) show impressive abilities via few-shot prompting. Commercialized APIs such as OpenAI GPT-3 further increase their use in real-world language applications. However, the crucial problem of how to improve the reliability of GPT-3 is still under-explored. While reliability is a broad and vaguely defined term, we decompose reliability into four main facets that correspond to the existing framework of ML safety and are well-recognized to be important: generalizability, social biases, calibration, and factuality. Our core contribution is to establish simple and effective prompts that improve GPT-3’s reliability as it: 1) generalizes out-of-distribution, 2) balances demographic distribution and uses natural language instructions to reduce social biases, 3) calibrates output probabilities, and 4) updates the LLM’s factual knowledge and reasoning chains. With appropriate prompts, GPT-3 is more reliable than smaller-scale supervised models on all these facets. We release all processed datasets, evaluation scripts, and model predictions. Our systematic empirical study not only sheds new insights on the reliability of prompting LLMs, but more importantly, our prompting strategies can help practitioners more reliably use LLMs like GPT-3.

TMLR Journal 2022 Journal Article

Adversarial Feature Augmentation and Normalization for Visual Recognition

  • Tianlong Chen
  • Yu Cheng
  • Zhe Gan
  • Jianfeng Wang
  • Lijuan Wang
  • Jingjing Liu
  • Zhangyang Wang

Recent advances in computer vision take advantage of adversarial data augmentation to improve the generalization of classification models. Here, we present an effective and efficient alternative that advocates adversarial augmentation on intermediate feature embeddings, instead of relying on computationally-expensive pixel-level perturbations. We propose $\textbf{A}$dversarial $\textbf{F}$eature $\textbf{A}$ugmentation and $\textbf{N}$ormalization (A-FAN), which ($i$) first augments visual recognition models with adversarial features that integrate flexible scales of perturbation strengths, ($ii$) then extracts adversarial feature statistics from batch normalization, and re-injects them into clean features through feature normalization. We validate the proposed approach across diverse visual recognition tasks with representative backbone networks, including ResNets and EfficientNets for classification, Faster-RCNN for detection, and Deeplab V3+ for segmentation. Extensive experiments show that A-FAN yields consistent generalization improvement over strong baselines across various datasets for classification, detection, and segmentation tasks, such as CIFAR-10, CIFAR-100, ImageNet, Pascal VOC2007, Pascal VOC2012, COCO2017, and Cityspaces. Comprehensive ablation studies and detailed analyses also demonstrate that adding perturbations to specific modules and layers of classification/detection/segmentation backbones yields optimal performance. Codes and pre-trained models are available in: https://github.com/VITA-Group/CV_A-FAN.

AAAI Conference 2022 Conference Paper

An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA

  • Zhengyuan Yang
  • Zhe Gan
  • Jianfeng Wang
  • Xiaowei Hu
  • Yumao Lu
  • Zicheng Liu
  • Lijuan Wang

Knowledge-based visual question answering (VQA) involves answering questions that require external knowledge not present in the image. Existing methods first retrieve knowledge from external resources, then reason over the selected knowledge, the input image, and question for answer prediction. However, this two-step approach could lead to mismatches that potentially limit the VQA performance. For example, the retrieved knowledge might be noisy and irrelevant to the question, and the re-embedded knowledge features during reasoning might deviate from their original meanings in the knowledge base (KB). To address this challenge, we propose PICa, a simple yet effective method that Prompts GPT- 3 via the use of Image Captions, for knowledge-based VQA. Inspired by GPT-3’s power in knowledge retrieval and question answering, instead of using structured KBs as in previous work, we treat GPT-3 as an implicit and unstructured KB that can jointly acquire and process relevant knowledge. Specifically, we first convert the image into captions (or tags) that GPT-3 can understand, then adapt GPT-3 to solve the VQA task in a few-shot manner by just providing a few in-context VQA examples. We further boost performance by carefully investigating: (i) what text formats best describe the image content, and (ii) how in-context examples can be better selected and used. PICa unlocks the first use of GPT-3 for multimodal tasks. By using only 16 examples, PICa surpasses the supervised state of the art by an absolute +8. 6 points on the OK-VQA dataset. We also benchmark PICa on VQAv2, where PICa also shows a decent few-shot performance.

NeurIPS Conference 2022 Conference Paper

Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone

  • Zi-Yi Dou
  • Aishwarya Kamath
  • Zhe Gan
  • Pengchuan Zhang
  • Jianfeng Wang
  • Linjie Li
  • Zicheng Liu
  • Ce Liu

Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https: //github. com/microsoft/FIBER.

AAAI Conference 2022 Conference Paper

Efficient Robust Training via Backward Smoothing

  • Jinghui Chen
  • Yu Cheng
  • Zhe Gan
  • Quanquan Gu
  • Jingjing Liu

Adversarial training is so far the most effective strategy in defending against adversarial examples. However, it suffers from high computational costs due to the iterative adversarial attacks in each training step. Recent studies show that it is possible to achieve fast Adversarial Training by performing a single-step attack with random initialization. However, such an approach still lags behind state-of-the-art adversarial training algorithms on both stability and model robustness. In this work, we develop a new understanding towards Fast Adversarial Training, by viewing random initialization as performing randomized smoothing for better optimization of the inner maximization problem. Following this new perspective, we also propose a new initialization strategy, backward smoothing, to further improve the stability and model robustness over single-step robust training methods. Experiments on multiple benchmarks demonstrate that our method achieves similar model robustness as the original TRADES method while using much less training time (∼3x improvement with the same training schedule).

TMLR Journal 2022 Journal Article

GIT: A Generative Image-to-text Transformer for Vision and Language

  • Jianfeng Wang
  • Zhengyuan Yang
  • Xiaowei Hu
  • Linjie Li
  • Kevin Lin
  • Zhe Gan
  • Zicheng Liu
  • Ce Liu

In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify vision-language tasks such as image/video captioning and question answering. While generative models provide a consistent network architecture between pre-training and fine-tuning, existing work typically contains complex structures (uni/multi-modal encoder/decoder) and depends on external modules such as object detectors/taggers and optical character recognition (OCR). In GIT, we simplify the architecture as one image encoder and one text decoder under a single language modeling task. We also scale up the pre-training data and the model size to boost the model performance. Without bells and whistles, our GIT establishes new state of the arts on numerous challenging benchmarks with a large margin. For instance, our model surpasses the human performance for the first time on TextCaps (138.2 vs. 125.5 in CIDEr). Furthermore, we present a new scheme of generation-based image classification and scene text recognition, achieving decent performance on standard benchmarks.

NeurIPS Conference 2022 Conference Paper

K-LITE: Learning Transferable Visual Models with External Knowledge

  • Sheng Shen
  • Chunyuan Li
  • Xiaowei Hu
  • Yujia Xie
  • Jianwei Yang
  • Pengchuan Zhang
  • Zhe Gan
  • Lijuan Wang

The new generation of state-of-the-art computer vision systems are trained from natural language supervision, ranging from simple object category names to descriptive captions. This form of supervision ensures high generality and usability of the learned visual models, based on the broad concept coverage achieved through large-scale data collection process. Alternatively, we argue that learning with external knowledge about images is a promising way which leverages a much more structured source of supervision and offers sample efficiency. In this paper, we propose K-LITE (Knowledge-augmented Language-Image Training and Evaluation), a simple strategy to leverage external knowledge for building transferable visual systems: In training, it enriches entities in natural language with WordNet and Wiktionary knowledge, leading to an efficient and scalable approach to learning image representations that uses knowledge about the visual concepts; In evaluation, the natural language is also augmented with external knowledge and then used to reference learned visual concepts (or describe new ones) to enable zero-shot and few-shot transfer of the pre-trained models. We study the performance of K-LITE on two important computer vision problems, image classification and object detection, benchmarking on 20 and 13 different existing datasets, respectively. The proposed knowledge-augmented models show significant improvement in transfer learning performance over existing methods. Our code is released at https: //github. com/microsoft/klite.

NeurIPS Conference 2022 Conference Paper

NUWA-Infinity: Autoregressive over Autoregressive Generation for Infinite Visual Synthesis

  • Jian Liang
  • Chenfei Wu
  • Xiaowei Hu
  • Zhe Gan
  • Jianfeng Wang
  • Lijuan Wang
  • Zicheng Liu
  • Yuejian Fang

Infinite visual synthesis aims to generate high-resolution images, long-duration videos, and even visual generation of infinite size. Some recent work tried to solve this task by first dividing data into processable patches and then training the models on them without considering the dependencies between patches. However, since they fail to model global dependencies between patches, the quality and consistency of the generation can be limited. To address this issue, we propose NUWA-Infinity, a patch-level \emph{``render-and-optimize''} strategy for infinite visual synthesis. Given a large image or a long video, NUWA-Infinity first splits it into non-overlapping patches and uses the ordered patch chain as a complete training instance, a rendering model autoregressively predicts each patch based on its contexts. Once a patch is predicted, it is optimized immediately and its hidden states are saved as contexts for the next \emph{``render-and-optimize''} process. This brings two advantages: ($i$) The autoregressive rendering process with information transfer between contexts provides an implicit global probabilistic distribution modeling; ($ii$) The timely optimization process alleviates the optimization stress of the model and helps convergence. Based on the above designs, NUWA-Infinity shows a strong synthesis ability on high-resolution images and long-duration videos. The homepage link is \url{https: //nuwa-infinity. microsoft. com}.

AAAI Conference 2022 Conference Paper

Playing Lottery Tickets with Vision and Language

  • Zhe Gan
  • Yen-Chun Chen
  • Linjie Li
  • Tianlong Chen
  • Yu Cheng
  • Shuohang Wang
  • Jingjing Liu
  • Lijuan Wang

Large-scale pre-training has recently revolutionized visionand-language (VL) research. Models such as LXMERT and UNITER have significantly lifted the state of the art over a wide range of VL tasks. However, the large number of parameters in such models hinders their application in practice. In parallel, work on the lottery ticket hypothesis (LTH) has shown that deep neural networks contain small matching subnetworks that can achieve on par or even better performance than the dense networks when trained in isolation. In this work, we perform the first empirical study to assess whether such trainable subnetworks also exist in pre-trained VL models. We use UNITER as the main testbed (also test on LXMERT and ViLT), and consolidate 7 representative VL tasks for experiments, including visual question answering, visual commonsense reasoning, visual entailment, referring expression comprehension, image-text retrieval, GQA, and NLVR2. Through comprehensive analysis, we summarize our main findings as follows. (i) It is difficult to find subnetworks that strictly match the performance of the full model. However, we can find “relaxed” winning tickets at 50%-70% sparsity that maintain 99% of the full accuracy. (ii) Subnetworks found by task-specific pruning transfer reasonably well to the other tasks, while those found on the pre-training tasks at 60%/70% sparsity transfer universally, matching 98%/96% of the full accuracy on average over all the tasks. (iii) Besides UNITER, other models such as LXMERT and ViLT can also play lottery tickets. However, the highest sparsity we can achieve for ViLT is far lower than LXMERT and UNITER (30% vs. 70%). (iv) LTH also remains relevant when using other training methods (e. g. , adversarial training).

NeurIPS Conference 2021 Conference Paper

Adversarial GLUE: A Multi-Task Benchmark for Robustness Evaluation of Language Models

  • Boxin Wang
  • Chejian Xu
  • Shuohang Wang
  • Zhe Gan
  • Yu Cheng
  • Jianfeng Gao
  • Ahmed Awadallah
  • Bo Li

Large-scale pre-trained language models have achieved tremendous success across a wide range of natural language understanding (NLU) tasks, even surpassing human performance. However, recent studies reveal that the robustness of these models can be challenged by carefully crafted textual adversarial examples. While several individual datasets have been proposed to evaluate model robustness, a principled and comprehensive benchmark is still missing. In this paper, we present Adversarial GLUE (AdvGLUE), a new multi-task benchmark to quantitatively and thoroughly explore and evaluate the vulnerabilities of modern large-scale language models under various types of adversarial attacks. In particular, we systematically apply 14 textual adversarial attack methods to GLUE tasks to construct AdvGLUE, which is further validated by humans for reliable annotations. Our findings are summarized as follows. (i) Most existing adversarial attack algorithms are prone to generating invalid or ambiguous adversarial examples, with around 90% of them either changing the original semantic meanings or misleading human annotators as well. Therefore, we perform a careful filtering process to curate a high-quality benchmark. (ii) All the language models and robust training methods we tested perform poorly on AdvGLUE, with scores lagging far behind the benign accuracy. We hope our work will motivate the development of new adversarial attacks that are more stealthy and semantic-preserving, as well as new robust language models against sophisticated adversarial attacks. AdvGLUE is available at https: //adversarialglue. github. io.

NeurIPS Conference 2021 Conference Paper

Chasing Sparsity in Vision Transformers: An End-to-End Exploration

  • Tianlong Chen
  • Yu Cheng
  • Zhe Gan
  • Lu Yuan
  • Lei Zhang
  • Zhangyang Wang

Vision transformers (ViTs) have recently received explosive popularity, but their enormous model sizes and training costs remain daunting. Conventional post-training pruning often incurs higher training budgets. In contrast, this paper aims to trim down both the training memory overhead and the inference complexity, without sacrificing the achievable accuracy. We carry out the first-of-its-kind comprehensive exploration, on taking a unified approach of integrating sparsity in ViTs "from end to end''. Specifically, instead of training full ViTs, we dynamically extract and train sparse subnetworks, while sticking to a fixed small parameter budget. Our approach jointly optimizes model parameters and explores connectivity throughout training, ending up with one sparse network as the final output. The approach is seamlessly extended from unstructured to structured sparsity, the latter by considering to guide the prune-and-grow of self-attention heads inside ViTs. We further co-explore data and architecture sparsity for additional efficiency gains by plugging in a novel learnable token selector to adaptively determine the currently most vital patches. Extensive results on ImageNet with diverse ViT backbones validate the effectiveness of our proposals which obtain significantly reduced computational cost and almost unimpaired generalization. Perhaps most surprisingly, we find that the proposed sparse (co-)training can sometimes \textit{improve the ViT accuracy} rather than compromising it, making sparsity a tantalizing "free lunch''. For example, our sparsified DeiT-Small at ($5\%$, $50\%$) sparsity for (data, architecture), improves $\mathbf{0. 28\%}$ top-1 accuracy, and meanwhile enjoys $\mathbf{49. 32\%}$ FLOPs and $\mathbf{4. 40\%}$ running time savings. Our codes are available at https: //github. com/VITA-Group/SViTE.

NeurIPS Conference 2021 Conference Paper

Data-Efficient GAN Training Beyond (Just) Augmentations: A Lottery Ticket Perspective

  • Tianlong Chen
  • Yu Cheng
  • Zhe Gan
  • Jingjing Liu
  • Zhangyang Wang

Training generative adversarial networks (GANs) with limited real image data generally results in deteriorated performance and collapsed models. To conquer this challenge, we are inspired by the latest observation, that one can discover independently trainable and highly sparse subnetworks (a. k. a. , lottery tickets) from GANs. Treating this as an inductive prior, we suggest a brand-new angle towards data-efficient GAN training: by first identifying the lottery ticket from the original GAN using the small training set of real images; and then focusing on training that sparse subnetwork by re-using the same set. We find our coordinated framework to offer orthogonal gains to existing real image data augmentation methods, and we additionally present a new feature-level augmentation that can be applied together with them. Comprehensive experiments endorse the effectiveness of our proposed framework, across various GAN architectures (SNGAN, BigGAN, and StyleGAN-V2) and diverse datasets (CIFAR-10, CIFAR-100, Tiny-ImageNet, ImageNet, and multiple few-shot generation datasets). Codes are available at: https: //github. com/VITA-Group/Ultra-Data-Efficient-GAN-Training.

AAAI Conference 2021 Conference Paper

FILTER: An Enhanced Fusion Method for Cross-lingual Language Understanding

  • Yuwei Fang
  • Shuohang Wang
  • Zhe Gan
  • Siqi Sun
  • Jingjing Liu

Large-scale cross-lingual language models (LM), such as mBERT, Unicoder and XLM, have achieved great success in cross-lingual representation learning. However, when applied to zero-shot cross-lingual transfer tasks, most existing methods use only single-language input for LM finetuning, without leveraging the intrinsic cross-lingual alignment between different languages that proves essential for multilingual tasks. In this paper, we propose FILTER, an enhanced fusion method that takes cross-lingual data as input for XLM finetuning. Specifically, FILTER first encodes text input in the source language and its translation in the target language independently in the shallow layers, then performs crosslanguage fusion to extract multilingual knowledge in the intermediate layers, and finally performs further languagespecific encoding. During inference, the model makes predictions based on the text input in the target language and its translation in the source language. For simple tasks such as classification, translated text in the target language shares the same label as the source language. However, this shared label becomes less accurate or even unavailable for more complex tasks such as question answering, NER and POS tagging. To tackle this issue, we further propose an additional KL-divergence self-teaching loss for model training, based on auto-generated soft pseudo-labels for translated text in the target language. Extensive experiments demonstrate that FIL- TER achieves new state of the art on two challenging multilingual multi-task benchmarks, XTREME and XGLUE.

ICLR Conference 2021 Conference Paper

Improving Zero-Shot Voice Style Transfer via Disentangled Representation Learning

  • Siyang Yuan
  • Pengyu Cheng
  • Ruiyi Zhang 0002
  • Weituo Hao
  • Zhe Gan
  • Lawrence Carin

Voice style transfer, also called voice conversion, seeks to modify one speaker's voice to generate speech as if it came from another (target) speaker. Previous works have made progress on voice conversion with parallel training data and pre-known speakers. However, zero-shot voice style transfer, which learns from non-parallel data and generates voices for previously unseen speakers, remains a challenging problem. In this paper we propose a novel zero-shot voice transfer method via disentangled representation learning. The proposed method first encodes speaker-related style and voice content of each input voice into separate low-dimensional embedding spaces, and then transfers to a new voice by combining the source content embedding and target style embedding through a decoder. With information-theoretic guidance, the style and content embedding spaces are representative and (ideally) independent of each other. On real-world datasets, our method outperforms other baselines and obtains state-of-the-art results in terms of transfer accuracy and voice naturalness.

ICLR Conference 2021 Conference Paper

InfoBERT: Improving Robustness of Language Models from An Information Theoretic Perspective

  • Boxin Wang
  • Shuohang Wang
  • Yu Cheng 0001
  • Zhe Gan
  • Ruoxi Jia 0001
  • Bo Li 0026
  • Jingjing Liu 0001

Large-scale language models such as BERT have achieved state-of-the-art performance across a wide range of NLP tasks. Recent studies, however, show that such BERT-based models are vulnerable facing the threats of textual adversarial attacks. We aim to address this problem from an information-theoretic perspective, and propose InfoBERT, a novel learning framework for robust fine-tuning of pre-trained language models. InfoBERT contains two mutual-information-based regularizers for model training: (i) an Information Bottleneck regularizer, which suppresses noisy mutual information between the input and the feature representation; and (ii) a Robust Feature regularizer, which increases the mutual information between local robust features and global features. We provide a principled way to theoretically analyze and improve the robustness of representation learning for language models in both standard and adversarial training. Extensive experiments demonstrate that InfoBERT achieves state-of-the-art robust accuracy over several adversarial datasets on Natural Language Inference (NLI) and Question Answering (QA) tasks. Our code is available at https://github.com/AI-secure/InfoBERT.

NeurIPS Conference 2021 Conference Paper

The Elastic Lottery Ticket Hypothesis

  • Xiaohan Chen
  • Yu Cheng
  • Shuohang Wang
  • Zhe Gan
  • Jingjing Liu
  • Zhangyang Wang

Lottery Ticket Hypothesis (LTH) raises keen attention to identifying sparse trainable subnetworks, or winning tickets, which can be trained in isolation to achieve similar or even better performance compared to the full models. Despite many efforts being made, the most effective method to identify such winning tickets is still Iterative Magnitude-based Pruning (IMP), which is computationally expensive and has to be run thoroughly for every different network. A natural question that comes in is: can we “transform” the winning ticket found in one network to another with a different architecture, yielding a winning ticket for the latter at the beginning, without re-doing the expensive IMP? Answering this question is not only practically relevant for efficient “once-for-all” winning ticket finding, but also theoretically appealing for uncovering inherently scalable sparse patterns in networks. We conduct extensive experiments on CIFAR-10 and ImageNet, and propose a variety of strategies to tweak the winning tickets found from different networks of the same model family (e. g. , ResNets). Based on these results, we articulate the Elastic Lottery Ticket Hypothesis (E-LTH): by mindfully replicating (or dropping) and re-ordering layers for one network, its corresponding winning ticket could be stretched (or squeezed) into a subnetwork for another deeper (or shallower) network from the same family, whose performance is nearly the same competitive as the latter’s winning ticket directly found by IMP. We have also extensively compared E-LTH with pruning-at-initialization and dynamic sparse training methods, as well as discussed the generalizability of E-LTH to different model families, layer types, and across datasets. Code is available at https: //github. com/VITA-Group/ElasticLTH.

NeurIPS Conference 2021 Conference Paper

VALUE: A Multi-Task Benchmark for Video-and-Language Understanding Evaluation

  • Linjie Li
  • Jie Lei
  • Zhe Gan
  • Licheng Yu
  • Yen-Chun Chen
  • Rohit Pillai
  • Yu Cheng
  • Luowei Zhou

Most existing video-and-language (VidL) research focuses on a single dataset, or multiple datasets of a single task. In reality, a truly useful VidL system is expected to be easily generalizable to diverse tasks, domains, and datasets. To facilitate the evaluation of such systems, we introduce Video-And-Language Understanding Evaluation (VALUE) benchmark, an assemblage of 11 VidL datasets over 3 popular tasks: (i) text-to-video retrieval; (ii) video question answering; and (iii) video captioning. VALUE benchmark aims to cover a broad range of video genres, video lengths, data volumes, and task difficulty levels. Rather than focusing on single-channel videos with visual information only, VALUE promotes models that leverage information from both video frames and their associated subtitles, as well as models that share knowledge across multiple tasks. We evaluate various baseline methods with and without large-scale VidL pre-training, and systematically investigate the impact of video input channels, fusion methods, and different video representations. We also study the transferability between tasks, and conduct multi-task learning under different settings. The significant gap between our best model and human performance calls for future study for advanced VidL models. VALUE is available at https: //value-benchmark. github. io/.

ICML Conference 2020 Conference Paper

CLUB: A Contrastive Log-ratio Upper Bound of Mutual Information

  • Pengyu Cheng
  • Weituo Hao
  • Shuyang Dai
  • Jiachang Liu 0001
  • Zhe Gan
  • Lawrence Carin

Mutual information (MI) minimization has gained considerable interests in various machine learning tasks. However, estimating and minimizing MI in high-dimensional spaces remains a challenging problem, especially when only samples, rather than distribution forms, are accessible. Previous works mainly focus on MI lower bound approximation, which is not applicable to MI minimization problems. In this paper, we propose a novel Contrastive Log-ratio Upper Bound (CLUB) of mutual information. We provide a theoretical analysis of the properties of CLUB and its variational approximation. Based on this upper bound, we introduce a MI minimization training scheme and further accelerate it with a negative sampling strategy. Simulation studies on Gaussian distributions show the reliable estimation ability of CLUB. Real-world MI minimization experiments, including domain adaptation and information bottleneck, demonstrate the effectiveness of the proposed method. The code is at https: //github. com/Linear95/CLUB.

ICLR Conference 2020 Conference Paper

FreeLB: Enhanced Adversarial Training for Natural Language Understanding

  • Chen Zhu 0001
  • Yu Cheng 0001
  • Zhe Gan
  • Siqi Sun
  • Tom Goldstein
  • Jingjing Liu 0001

Adversarial training, which minimizes the maximal risk for label-preserving input perturbations, has proved to be effective for improving the generalization of language models. In this work, we propose a novel adversarial training algorithm, FreeLB, that promotes higher invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples. To validate the effectiveness of the proposed approach, we apply it to Transformer-based models for natural language understanding and commonsense reasoning tasks. Experiments on the GLUE benchmark show that when applied only to the finetuning stage, it is able to improve the overall test scores of BERT-base model from 78.3 to 79.4, and RoBERTa-large model from 88.5 to 88.8. In addition, the proposed approach achieves state-of-the-art single-model test accuracies of 85.44% and 67.75% on ARC-Easy and ARC-Challenge. Experiments on CommonsenseQA benchmark further demonstrate that FreeLB can be generalized and boost the performance of RoBERTa-large model on other tasks as well.

ICML Conference 2020 Conference Paper

Graph Optimal Transport for Cross-Domain Alignment

  • Liqun Chen 0001
  • Zhe Gan
  • Yu Cheng 0001
  • Linjie Li
  • Lawrence Carin
  • Jingjing Liu 0001

Cross-domain alignment between two sets of entities (e. g. , objects in an image, words in a sentence) is fundamental to both computer vision and natural language processing. Existing methods mainly focus on designing advanced attention mechanisms to simulate soft alignment, where no training signals are provided to explicitly encourage alignment. Plus, the learned attention matrices are often dense and difficult to interpret. We propose Graph Optimal Transport (GOT), a principled framework that builds upon recent advances in Optimal Transport (OT). In GOT, cross-domain alignment is formulated as a graph matching problem, by representing entities as a dynamically-constructed graph. Two types of OT distances are considered: (i) Wasserstein distance (WD) for node (entity) matching; and (ii) Gromov-Wasserstein distance (GWD) for edge (structure) matching. Both WD and GWD can be incorporated into existing neural network models, effectively acting as a drop-in regularizer. The inferred transport plan also yields sparse and self-normalized alignment, enhancing the interpretability of the learned model. Experiments show consistent outperformance of GOT over baselines across a wide range of tasks, including image-text retrieval, visual question answering, image captioning, machine translation, and text summarization.

AAAI Conference 2020 Conference Paper

Graph-Driven Generative Models for Heterogeneous Multi-Task Learning

  • Wenlin Wang
  • Hongteng Xu
  • Zhe Gan
  • Bai Li
  • Guoyin Wang
  • Liqun Chen
  • Qian Yang
  • Wenqi Wang

We propose a novel graph-driven generative model, that unifies multiple heterogeneous learning tasks into the same framework. The proposed model is based on the fact that heterogeneous learning tasks, which correspond to different generative processes, often rely on data with a shared graph structure. Accordingly, our model combines a graph convolutional network (GCN) with multiple variational autoencoders, thus embedding the nodes of the graph (i. e. , samples for the tasks) in a uniform manner, while specializing their organization and usage to different tasks. With a focus on healthcare applications (tasks), including clinical topic modeling, procedure recommendation and admission-type prediction, we demonstrate that our method successfully leverages information across different tasks, boosting performance in all tasks and outperforming existing state-of-the-art approaches.

NeurIPS Conference 2020 Conference Paper

Large-Scale Adversarial Training for Vision-and-Language Representation Learning

  • Zhe Gan
  • Yen-Chun Chen
  • Linjie Li
  • Chen Zhu
  • Yu Cheng
  • Jingjing Liu

We present VILLA, the first known effort on large-scale adversarial training for vision-and-language (V+L) representation learning. VILLA consists of two training stages: (i) task-agnostic adversarial pre-training; followed by (ii) task-specific adversarial finetuning. Instead of adding adversarial perturbations on image pixels and textual tokens, we propose to perform adversarial training in the embedding space of each modality. To enable large-scale training, we adopt the ``free'' adversarial training strategy, and combine it with KL-divergence-based regularization to promote higher invariance in the embedding space. We apply VILLA to current best-performing V+L models, and achieve new state of the art on a wide range of tasks, including Visual Question Answering, Visual Commonsense Reasoning, Image-Text Retrieval, Referring Expression Comprehension, Visual Entailment, and NLVR2.

AAAI Conference 2020 Conference Paper

What Makes A Good Story? Designing Composite Rewards for Visual Storytelling

  • Junjie Hu
  • Yu Cheng
  • Zhe Gan
  • Jingjing Liu
  • Jianfeng Gao
  • Graham Neubig

Previous storytelling approaches mostly focused on optimizing traditional metrics such as BLEU, ROUGE and CIDEr. In this paper, we re-examine this problem from a different angle, by looking deep into what defines a natural and topicallycoherent story. To this end, we propose three assessment criteria: relevance, coherence and expressiveness, which we observe through empirical analysis could constitute a “highquality” story to the human eye. We further propose a reinforcement learning framework, ReCo-RL, with reward functions designed to capture the essence of these quality criteria. Experiments on the Visual Storytelling Dataset (VIST) with both automatic and human evaluation demonstrate that our ReCo-RL model achieves better performance than state-ofthe-art baselines on both traditional metrics and the proposed new criteria.

AAAI Conference 2019 Conference Paper

Hierarchically Structured Reinforcement Learning for Topically Coherent Visual Story Generation

  • Qiuyuan Huang
  • Zhe Gan
  • Asli Celikyilmaz
  • Dapeng Wu
  • Jianfeng Wang
  • Xiaodong He

We propose a hierarchically structured reinforcement learning approach to address the challenges of planning for generating coherent multi-sentence stories for the visual storytelling task. Within our framework, the task of generating a story given a sequence of images is divided across a two-level hierarchical decoder. The high-level decoder constructs a plan by generating a semantic concept (i. e. , topic) for each image in sequence. The low-level decoder generates a sentence for each image using a semantic compositional network, which effectively grounds the sentence generation conditioned on the topic. The two decoders are jointly trained end-to-end using reinforcement learning. We evaluate our model on the visual storytelling (VIST) dataset. Empirical results from both automatic and human evaluations demonstrate that the proposed hierarchically structured reinforced training achieves significantly better performance compared to a strong flat deep reinforcement learning baseline.

NeurIPS Conference 2019 Conference Paper

Improving Textual Network Learning with Variational Homophilic Embeddings

  • Wenlin Wang
  • Chenyang Tao
  • Zhe Gan
  • Guoyin Wang
  • Liqun Chen
  • Xinyuan Zhang
  • Ruiyi Zhang
  • Qian Yang

The performance of many network learning applications crucially hinges on the success of network embedding algorithms, which aim to encode rich network information into low-dimensional vertex-based vector representations. This paper considers a novel variational formulation of network embeddings, with special focus on textual networks. Different from most existing methods that optimize a discriminative objective, we introduce Variational Homophilic Embedding (VHE), a fully generative model that learns network embeddings by modeling the semantic (textual) information with a variational autoencoder, while accounting for the structural (topology) information through a novel homophilic prior design. Homophilic vertex embeddings encourage similar embedding vectors for related (connected) vertices. The VHE encourages better generalization for downstream tasks, robustness to incomplete observations, and the ability to generalize to unseen vertices. Extensive experiments on real-world networks, for multiple tasks, demonstrate that the proposed method achieves consistently superior performance relative to competing state-of-the-art approaches.

AAAI Conference 2018 Conference Paper

Adaptive Feature Abstraction for Translating Video to Text

  • Yunchen Pu
  • Martin Min
  • Zhe Gan
  • Lawrence Carin

Previous models for video captioning often use the output from a specific layer of a Convolutional Neural Network (CNN) as video features. However, the variable contextdependent semantics in the video may make it more appropriate to adaptively select features from the multiple CNN layers. We propose a new approach to generating adaptive spatiotemporal representations of videos for the captioning task. A novel attention mechanism is developed, which adaptively and sequentially focuses on different layers of CNN features (levels of feature “abstraction”), as well as local spatiotemporal regions of the feature maps at each layer. The proposed approach is evaluated on three benchmark datasets: YouTube2Text, M-VAD and MSR-VTT. Along with visualizing the results and how the model works, these experiments quantitatively demonstrate the effectiveness of the proposed adaptive spatiotemporal feature abstraction for translating videos to sentences with rich semantics.

NeurIPS Conference 2018 Conference Paper

Adversarial Text Generation via Feature-Mover's Distance

  • Liqun Chen
  • Shuyang Dai
  • Chenyang Tao
  • Haichao Zhang
  • Zhe Gan
  • Dinghan Shen
  • Yizhe Zhang
  • Guoyin Wang

Generative adversarial networks (GANs) have achieved significant success in generating real-valued data. However, the discrete nature of text hinders the application of GAN to text-generation tasks. Instead of using the standard GAN objective, we propose to improve text-generation GAN via a novel approach inspired by optimal transport. Specifically, we consider matching the latent feature distributions of real and synthetic sentences using a novel metric, termed the feature-mover's distance (FMD). This formulation leads to a highly discriminative critic and easy-to-optimize objective, overcoming the mode-collapsing and brittle-training problems in existing methods. Extensive experiments are conducted on a variety of tasks to evaluate the proposed model empirically, including unconditional text generation, style transfer from non-parallel text, and unsupervised cipher cracking. The proposed model yields superior performance, demonstrating wide applicability and effectiveness.

NeurIPS Conference 2018 Conference Paper

Generating Informative and Diverse Conversational Responses via Adversarial Information Maximization

  • Yizhe Zhang
  • Michel Galley
  • Jianfeng Gao
  • Zhe Gan
  • Xiujun Li
  • Chris Brockett
  • Bill Dolan

Responses generated by neural conversational models tend to lack informativeness and diversity. We present Adversarial Information Maximization (AIM), an adversarial learning framework that addresses these two related but distinct problems. To foster response diversity, we leverage adversarial training that allows distributional matching of synthetic and real responses. To improve informativeness, our framework explicitly optimizes a variational lower bound on pairwise mutual information between query and response. Empirical results from automatic and human evaluations demonstrate that our methods significantly boost informativeness and diversity.

ICML Conference 2018 Conference Paper

JointGAN: Multi-Domain Joint Distribution Learning with Generative Adversarial Nets

  • Yunchen Pu
  • Shuyang Dai
  • Zhe Gan
  • Weiyao Wang 0002
  • Guoyin Wang 0002
  • Yizhe Zhang 0002
  • Ricardo Henao
  • Lawrence Carin

A new generative adversarial network is developed for joint distribution matching. Distinct from most existing approaches, that only learn conditional distributions, the proposed model aims to learn a joint distribution of multiple random variables (domains). This is achieved by learning to sample from conditional distributions between the domains, while simultaneously learning to sample from the marginals of each individual domain. The proposed framework consists of multiple generators and a single softmax-based critic, all jointly trained via adversarial learning. From a simple noise source, the proposed framework allows synthesis of draws from the marginals, conditional draws given observations from a subset of random variables, or complete draws from the full joint distribution. Most examples considered are for joint analysis of two domains, with examples for three domains also presented.

ICML Conference 2017 Conference Paper

Adversarial Feature Matching for Text Generation

  • Yizhe Zhang 0002
  • Zhe Gan
  • Kai Fan 0002
  • Zhi Chen 0009
  • Ricardo Henao
  • Dinghan Shen
  • Lawrence Carin

The Generative Adversarial Network (GAN) has achieved great success in generating realistic (real-valued) synthetic data. However, convergence issues and difficulties dealing with discrete data hinder the applicability of GAN to text. We propose a framework for generating realistic text via adversarial training. We employ a long short-term memory network as generator, and a convolutional network as discriminator. Instead of using the standard objective of GAN, we propose matching the high-dimensional latent feature distributions of real and synthetic sentences, via a kernelized discrepancy metric. This eases adversarial training by alleviating the mode-collapsing problem. Our experiments show superior performance in quantitative evaluation, and demonstrate that our model can generate realistic-looking sentences.

NeurIPS Conference 2017 Conference Paper

Adversarial Symmetric Variational Autoencoder

  • Yuchen Pu
  • Weiyao Wang
  • Ricardo Henao
  • Liqun Chen
  • Zhe Gan
  • Chunyuan Li
  • Lawrence Carin

A new form of variational autoencoder (VAE) is developed, in which the joint distribution of data and codes is considered in two (symmetric) forms: (i) from observed data fed through the encoder to yield codes, and (ii) from latent codes drawn from a simple prior and propagated through the decoder to manifest data. Lower bounds are learned for marginal log-likelihood fits observed data and latent codes. When learning with the variational bound, one seeks to minimize the symmetric Kullback-Leibler divergence of joint density functions from (i) and (ii), while simultaneously seeking to maximize the two marginal log-likelihoods. To facilitate learning, a new form of adversarial training is developed. An extensive set of experiments is performed, in which we demonstrate state-of-the-art data reconstruction and generation on several image benchmarks datasets.

NeurIPS Conference 2017 Conference Paper

Deconvolutional Paragraph Representation Learning

  • Yizhe Zhang
  • Dinghan Shen
  • Guoyin Wang
  • Zhe Gan
  • Ricardo Henao
  • Lawrence Carin

Learning latent representations from long text sequences is an important first step in many natural language processing applications. Recurrent Neural Networks (RNNs) have become a cornerstone for this challenging task. However, the quality of sentences during RNN-based decoding (reconstruction) decreases with the length of the text. We propose a sequence-to-sequence, purely convolutional and deconvolutional autoencoding framework that is free of the above issue, while also being computationally efficient. The proposed method is simple, easy to implement and can be leveraged as a building block for many applications. We show empirically that compared to RNNs, our framework is better at reconstructing and correcting long paragraphs. Quantitative evaluation on semi-supervised text classification and summarization tasks demonstrate the potential for better utilization of long unlabeled text data.

ICML Conference 2017 Conference Paper

Stochastic Gradient Monomial Gamma Sampler

  • Yizhe Zhang 0002
  • Changyou Chen
  • Zhe Gan
  • Ricardo Henao
  • Lawrence Carin

Scaling Markov Chain Monte Carlo (MCMC) to estimate posterior distributions from large datasets has been made possible as a result of advances in stochastic gradient techniques. Despite their success, mixing performance of existing methods when sampling from multimodal distributions can be less efficient with insufficient Monte Carlo samples; this is evidenced by slow convergence and insufficient exploration of posterior distributions. We propose a generalized framework to improve the sampling efficiency of stochastic gradient MCMC, by leveraging a generalized kinetics that delivers superior stationary mixing, especially in multimodal distributions, and propose several techniques to overcome the practical issues. We show that the proposed approach is better at exploring a complicated multimodal posterior distribution, and demonstrate improvements over other stochastic gradient MCMC methods on various applications.

NeurIPS Conference 2017 Conference Paper

Triangle Generative Adversarial Networks

  • Zhe Gan
  • Liqun Chen
  • Weiyao Wang
  • Yuchen Pu
  • Yizhe Zhang
  • Hao Liu
  • Chunyuan Li
  • Lawrence Carin

A Triangle Generative Adversarial Network ($\Delta$-GAN) is developed for semi-supervised cross-domain joint distribution matching, where the training data consists of samples from each domain, and supervision of domain correspondence is provided by only a few paired samples. $\Delta$-GAN consists of four neural networks, two generators and two discriminators. The generators are designed to learn the two-way conditional distributions between the two domains, while the discriminators implicitly define a ternary discriminative function, which is trained to distinguish real data pairs and two kinds of fake data pairs. The generators and discriminators are trained together using adversarial learning. Under mild assumptions, in theory the joint distributions characterized by the two generators concentrate to the data distribution. In experiments, three different kinds of domain pairs are considered, image-label, image-image and image-attribute pairs. Experiments on semi-supervised image classification, image-to-image translation and attribute-based image generation demonstrate the superiority of the proposed approach.

AAAI Conference 2017 Conference Paper

Unsupervised Learning with Truncated Gaussian Graphical Models

  • Qinliang Su
  • Xuejun Liao
  • Chunyuan Li
  • Zhe Gan
  • Lawrence Carin

Gaussian graphical models (GGMs) are widely used for statistical modeling, because of ease of inference and the ubiquitous use of the normal distribution in practical approximations. However, they are also known for their limited modeling abilities, due to the Gaussian assumption. In this paper, we introduce a novel variant of GGMs, which relaxes the Gaussian restriction and yet admits efficient inference. Specifically, we impose a bipartite structure on the GGM and govern the hidden variables by truncated normal distributions. The nonlinearity of the model is revealed by its connection to rectified linear unit (ReLU) neural networks. Meanwhile, thanks to the bipartite structure and appealing properties of truncated normals, we are able to train the models efficiently using contrastive divergence. We consider three output constructs, accounting for real-valued, binary and count data. We further extend the model to deep constructions and show that deep models can be used for unsupervised pre-training of rectifier neural networks. Extensive experimental results are provided to validate the proposed models and demonstrate their superiority over competing models.

NeurIPS Conference 2017 Conference Paper

VAE Learning via Stein Variational Gradient Descent

  • Yuchen Pu
  • Zhe Gan
  • Ricardo Henao
  • Chunyuan Li
  • Shaobo Han
  • Lawrence Carin

A new method for learning variational autoencoders (VAEs) is developed, based on Stein variational gradient descent. A key advantage of this approach is that one need not make parametric assumptions about the form of the encoder distribution. Performance is further enhanced by integrating the proposed encoder with importance sampling. Excellent performance is demonstrated across multiple unsupervised and semi-supervised problems, including semi-supervised analysis of the ImageNet data, demonstrating the scalability of the model to large datasets.

ICML Conference 2016 Conference Paper

Factored Temporal Sigmoid Belief Networks for Sequence Learning

  • Jiaming Song
  • Zhe Gan
  • Lawrence Carin

Deep conditional generative models are developed to simultaneously learn the temporal dependencies of multiple sequences. The model is designed by introducing a three-way weight tensor to capture the multiplicative interactions between side information and sequences. The proposed model builds on the Temporal Sigmoid Belief Network (TSBN), a sequential stack of Sigmoid Belief Networks (SBNs). The transition matrices are further factored to reduce the number of parameters and improve generalization. When side information is not available, a general framework for semi-supervised learning based on the proposed model is constituted, allowing robust sequence classification. Experimental results show that the proposed approach achieves state-of-the-art predictive and classification performance on sequential data, and has the capacity to synthesize sequences, with controlled style transitioning and blending.

NeurIPS Conference 2016 Conference Paper

Variational Autoencoder for Deep Learning of Images, Labels and Captions

  • Yunchen Pu
  • Zhe Gan
  • Ricardo Henao
  • Xin Yuan
  • Chunyuan Li
  • Andrew Stevens
  • Lawrence Carin

A novel variational autoencoder is developed to model images, as well as associated labels or captions. The Deep Generative Deconvolutional Network (DGDN) is used as a decoder of the latent image features, and a deep Convolutional Neural Network (CNN) is used as an image encoder; the CNN is used to approximate a distribution for the latent DGDN features/code. The latent code is also linked to generative models for labels (Bayesian support vector machine) or captions (recurrent neural network). When predicting a label/caption for a new image at test, averaging is performed across the distribution of latent codes; this is computationally efficient as a consequence of the learned CNN-based encoder. Since the framework is capable of modeling the image in the presence/absence of associated labels/captions, a new semi-supervised setting is manifested for CNN learning with images; the framework even allows unsupervised CNN learning, based on images alone.

NeurIPS Conference 2015 Conference Paper

Deep Poisson Factor Modeling

  • Ricardo Henao
  • Zhe Gan
  • James Lu
  • Lawrence Carin

We propose a new deep architecture for topic modeling, based on Poisson Factor Analysis (PFA) modules. The model is composed of a Poisson distribution to model observed vectors of counts, as well as a deep hierarchy of hidden binary units. Rather than using logistic functions to characterize the probability that a latent binary unit is on, we employ a Bernoulli-Poisson link, which allows PFA modules to be used repeatedly in the deep architecture. We also describe an approach to build discriminative topic models, by adapting PFA modules. We derive efficient inference via MCMC and stochastic variational methods, that scale with the number of non-zeros in the data and binary units, yielding significant efficiency, relative to models based on logistic links. Experiments on several corpora demonstrate the advantages of our model when compared to related deep models.

NeurIPS Conference 2015 Conference Paper

Deep Temporal Sigmoid Belief Networks for Sequence Modeling

  • Zhe Gan
  • Chunyuan Li
  • Ricardo Henao
  • David Carlson
  • Lawrence Carin

Deep dynamic generative models are developed to learn sequential dependencies in time-series data. The multi-layered model is designed by constructing a hierarchy of temporal sigmoid belief networks (TSBNs), defined as a sequential stack of sigmoid belief networks (SBNs). Each SBN has a contextual hidden state, inherited from the previous SBNs in the sequence, and is used to regulate its hidden bias. Scalable learning and inference algorithms are derived by introducing a recognition model that yields fast sampling from the variational posterior. This recognition model is trained jointly with the generative model, by maximizing its variational lower bound on the log-likelihood. Experimental results on bouncing balls, polyphonic music, motion capture, and text streams show that the proposed approach achieves state-of-the-art predictive performance, and has the capacity to synthesize various sequences.

ICML Conference 2015 Conference Paper

Scalable Deep Poisson Factor Analysis for Topic Modeling

  • Zhe Gan
  • Changyou Chen
  • Ricardo Henao
  • David E. Carlson
  • Lawrence Carin

A new framework for topic modeling is developed, based on deep graphical models, where interactions between topics are inferred through deep latent binary hierarchies. The proposed multi-layer model employs a deep sigmoid belief network or restricted Boltzmann machine, the bottom binary layer of which selects topics for use in a Poisson factor analysis model. Under this setting, topics live on the bottom layer of the model, while the deep specification serves as a flexible prior for revealing topic structure. Scalable inference algorithms are derived by applying Bayesian conditional density filtering algorithm, in addition to extending recently proposed work on stochastic gradient thermostats. Experimental results on several corpora show that the proposed approach readily handles very large collections of text documents, infers structured topic representations, and obtains superior test perplexities when compared with related models.