Arrow Research search

Author name cluster

Jun Zhu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

143 papers
2 author rows

Possible papers

143

AAAI Conference 2026 Conference Paper

Benchmarking Trustworthiness in Multimodal LLMs for Video Understanding

  • Youze Wang
  • Zijun Chen
  • Ruoyu Chen
  • Shishen Gu
  • Wenbo Hu
  • Jiayang Liu
  • Yinpeng Dong
  • Hang Su

Recent advancements in multimodal large language models for video understanding (videoLLMs) have enhanced their capacity to process complex spatiotemporal data. However, challenges such as factual inaccuracies, harmful content, biases, hallucinations, and privacy risks compromise their reliability. This study introduces Trust-videoLLMs, a first comprehensive benchmark evaluating 23 state-of-the-art videoLLMs (5 commercial, 18 open-source) across five critical dimensions: truthfulness, robustness, safety, fairness, and privacy. Comprising 30 tasks with adapted, synthetic, and annotated videos, the framework assesses spatiotemporal risks, temporal consistency and cross-modal impact. Results reveal significant limitations in dynamic scene comprehension, cross-modal perturbation resilience and real-world risk mitigation. While open-source models occasionally outperform, proprietary models generally exhibit superior credibility, though scaling does not consistently improve performance. These findings underscore the need for enhanced training datat diversity and robust multimodal alignment. Trust-videoLLMs provides a publicly available, extensible toolkit for standardized trustworthiness assessments, addressing the critical gap between accuracy-focused benchmarks and demands for robustness, safety, fairness, and privacy.

AAAI Conference 2026 Conference Paper

H-RDT: Human Manipulation Enhanced Bimanual Robotic Manipulation

  • Hongzhe Bi
  • Lingxuan Wu
  • Tianwei Lin
  • Hengkai Tan
  • Zhizhong Su
  • Hang Su
  • Jun Zhu

Imitation learning for robotic manipulation faces a fundamental challenge: the scarcity of large-scale, high-quality robot demonstration data. Recent robotic foundation models often pre-train on cross-embodiment robot datasets to increase data scale, while they face significant limitations as the diverse morphologies and action spaces across different robot embodiments make unified training challenging. In this paper, we present H-RDT (Human to Robotics Diffusion Transformer), a novel approach that leverages human manipulation data to enhance robot manipulation capabilities. Our key insight is that large-scale egocentric human manipulation videos with paired 3D hand pose annotations provide rich behavioral priors that capture natural manipulation strategies and can benefit robotic policy learning. We introduce a two-stage training paradigm: (1) pre-training on large-scale egocentric human manipulation data, and (2) cross-embodiment fine-tuning on robot-specific data with modular action encoders and decoders. Built on a diffusion transformer architecture with 2B parameters, H-RDT uses flow matching to model complex action distributions. The modular design of action encoder and decoder components enables effective knowledge transfer from the unified human embodiment to diverse robot platforms through efficient fine-tuning. Extensive evaluations encompassing both simulation and real-world experiments, single-task and multitask scenarios, as well as few-shot learning and robustness assessments, demonstrate that H-RDT outperforms training from scratch and existing state-of-the-art methods, including π0 and RDT, achieving significant improvements of 13.9% and 40.5% over training from scratch in simulation and real-world experiments, respectively. The results validate our core hypothesis that human manipulation data can serve as a powerful foundation for learning bimanual robotic manipulation policies.

NeurIPS Conference 2025 Conference Paper

A Regularized Newton Method for Nonconvex Optimization with Global and Local Complexity Guarantees

  • Yuhao Zhou
  • Jintao Xu
  • Bingrui Li
  • Chenglong Bao
  • Chao Ding
  • Jun Zhu

Finding an $\epsilon$-stationary point of a nonconvex function with a Lipschitz continuous Hessian is a central problem in optimization. Regularized Newton methods are a classical tool and have been studied extensively, yet they still face a trade‑off between global and local convergence. Whether a parameter-free algorithm of this type can simultaneously achieve optimal global complexity and quadratic local convergence remains an open question. To bridge this long-standing gap, we propose a new class of regularizers constructed from the current and previous gradients, and leverage the conjugate gradient approach with a negative curvature monitor to solve the regularized Newton equation. The proposed algorithm is adaptive, requiring no prior knowledge of the Hessian Lipschitz constant, and achieves a global complexity of $O(\epsilon^{-\frac{3}{2}})$ in terms of the second-order oracle calls, and $\tilde O(\epsilon^{-\frac{7}{4}})$ for Hessian-vector products, respectively. When the iterates converge to a point where the Hessian is positive definite, the method exhibits quadratic local convergence. Preliminary numerical results, including training the physics-informed neural networks, illustrate the competitiveness of our algorithm.

NeurIPS Conference 2025 Conference Paper

Audio Super-Resolution with Latent Bridge Models

  • Chang Li
  • Zehua Chen
  • Liyuan Wang
  • Jun Zhu

Audio super-resolution (SR), i. e. , upsampling the low-resolution (LR) waveform to the high-resolution (HR) version, has recently been explored with diffusion and bridge models, while previous methods often suffer from sub-optimal upsampling quality due to their uninformative generation prior. Towards high-quality audio super-resolution, we present a new system with latent bridge models (LBMs), where we compress the audio waveform into a continuous latent space and design an LBM to enable a latent-to-latent generation process that naturally matches the LR-to-HR upsampling process, thereby fully exploiting the instructive prior information contained in the LR waveform. To further enhance the training results despite the limited availability of HR samples, we introduce frequency-aware LBMs, where the prior and target frequency are taken as model input, enabling LBMs to explicitly learn an any-to-any upsampling process at the training stage. Furthermore, we design cascaded LBMs and present two prior augmentation strategies, where we make the first attempt to unlock the audio upsampling beyond 48 kHz and empower a seamless cascaded SR process, providing higher flexibility for audio post-production. Comprehensive experimental results evaluated on the VCTK, ESC-50, Song-Describer benchmark datasets and two internal testsets demonstrate that we achieve state-of-the-art objective and perceptual quality for any-to-48 kHz SR across speech, audio, and music signals, as well as setting the first record for any-to-192kHz audio SR. Demo at https: //AudioLBM. github. io/.

AAAI Conference 2025 Conference Paper

DUSTED: Dual-Attention Enhanced Spatial Transcriptomics Denoiser

  • Jun Zhu
  • Yifu Li
  • Zhenchao Tang
  • Cheng Chang

Spatially Resolved Transcriptomics (SRT) has become an indispensable tool in various fields, including tumor microenvironment identification, neurobiology, and the study of complex tissue architecture. However, the accuracy of these insights is often compromised by noise in spatial transcriptomics data due to technical limitations. While recent advancements in denoising methods have shown some promise, they frequently fall short by neglecting spatial features, overlooking the variability in noise levels among genes, and relying heavily on external histological images for supplementary information. In our study, we propose DUSTED, a Dual-Attention Enhanced Spatial Transcriptomics Denoiser, designed to address these challenges. Built on a graph autoencoder framework, DUSTED utilizes gene channel attention and graph attention mechanisms to simultaneously consider spatial features and noise variability in gene expression data. Additionally, it integrates the negative binomial distribution with or without zero-inflation, ensuring a more accurate fit for gene expression distributions. Benchmark tests using simulated datasets demonstrate that DUSTED outperforms existing methods. Furthermore, in real-world applications with the HOCWTA and DLPFC datasets, DUSTED excels in enhancing the correlation between gene and protein expression, recovering spatial gene expression patterns, and improving clustering results. These improvements underscore its potential impact on advancing our understanding of tumor microenvironments, neural tissue organization, and other biologically significant areas.

IROS Conference 2025 Conference Paper

Navi2Gaze: Leveraging Foundation Models for Navigation and Target Gazing

  • Jun Zhu
  • Zihao Du
  • Haotian Xu
  • Fengbo Lan
  • Zilong Zheng
  • Bo Ma
  • Shengjie Wang
  • Tao Zhang

Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot’s pose. However, the robot’s orientation is crucial for successfully completing tasks because of how objects are arranged (e. g. , to open a refrigerator door). Humans intuitively navigate to objects with the right orientation using semantics and common sense. For instance, when opening a refrigerator, we naturally stand in front of it rather than to the side. Recent advances suggest that Vision-Language Models (VLMs) can provide robots with similar common sense. Therefore, we develop a VLM-driven method called Navigation-to-Gaze (Navi2Gaze) for efficient navigation and object gazing based on task descriptions. This method uses the VLM to score and select the best pose from numerous candidates automatically. In evaluations on multiple photorealistic simulation benchmarks, Navi2Gaze significantly outperforms existing approaches by precisely determining the optimal orientation relative to target objects, resulting in a 68. 8% reduction in Distance to Goal (DTG). Real-world video demonstrations can be found on the supplementary website 1.

ICLR Conference 2025 Conference Paper

PivotMesh: Generic 3D Mesh Generation via Pivot Vertices Guidance

  • Haohan Weng
  • Yikai Wang
  • Tong Zhang 0015
  • C. L. Philip Chen
  • Jun Zhu

Generating compact and sharply detailed 3D meshes poses a significant challenge for current 3D generative models. Different from extracting dense meshes from neural representation, some recent works try to model the native mesh distribution (i.e., a set of triangles), which generates more compact results as humans crafted. However, due to the complexity and variety of mesh topology, most of these methods are typically limited to generating meshes with simple geometry. In this paper, we introduce a generic and scalable mesh generation framework PivotMesh, which makes an initial attempt to extend the native mesh generation to large-scale datasets. We employ a transformer-based autoencoder to encode meshes into discrete tokens and decode them from face level to vertex level hierarchically. Subsequently, to model the complex typology, our model first learns to generate pivot vertices as coarse mesh representation and then generate the complete mesh tokens with the same auto-regressive Transformer. This reduces the difficulty compared with directly modeling the mesh distribution and further improves the model controllability. PivotMesh demonstrates its versatility by effectively learning from both small datasets like Shapenet, and large-scale datasets like Objaverse and Objaverse-xl. Extensive experiments indicate that PivotMesh can generate compact and sharp 3D meshes across various categories, highlighting its great potential for native mesh modeling.

AAAI Conference 2025 Conference Paper

Pruning Large Language Models with Semi-Structural Adaptive Sparse Training

  • Weiyu Huang
  • Yuezhou Hu
  • Guohao Jian
  • Jun Zhu
  • Jianfei Chen

The remarkable success of Large Language Models (LLMs) relies heavily on their substantial scale, which poses significant challenges during model deployment in terms of latency and memory consumption. Recently, numerous studies have attempted to compress LLMs using one-shot pruning methods. However, these methods often suffer from considerable performance degradation on complex language understanding tasks, raising concerns about the feasibility of pruning in LLMs. To address this issue, we propose Adaptive Sparse Trainer (AST), a novel and efficient retraining framework tailored for semi-structured sparse models. AST enables models to learn optimal masks during the weight update process without incurring additional computational overhead. Furthermore, we demonstrate that incorporating knowledge distillation significantly improves retraining efficiency and enhances model performance under fixed computational constraints. Additionally, a supplementary set of well-initialized parameters is integrated to further augment the model's efficacy. AST achieves state-of-the-art performance with minimal training cost. When applied to the LLaMA2-7B model, AST reduces the perplexity and zero-shot accuracy gap between dense and 2:4 semi-structured sparse models to 0.6 and 1.16%, respectively, utilizing less than 0.4% of the pretraining tokens and GPU hours. Our work demonstrates the feasibility of deploying semi-structured sparse LLMs and offers a promising alternative for achieving highly compressed models when combined with existing quantization techniques.

IJCAI Conference 2025 Conference Paper

Riding the Wave: Multi-Scale Spatial-Temporal Graph Learning for Highway Traffic Flow Prediction Under Overload Scenarios

  • Xigang Sun
  • Jiahui Jin
  • Hancheng Wang
  • Xiangguo Sun
  • Xiaoliang Wang
  • Jun Zhu

Highway traffic flow prediction under overload scenarios (HIPO) is a critical problem in intelligent transportation systems, which aims to forecast future traffic patterns on highway segments during periods of exceptionally high demand. Despite its importance, this problem has rarely been explored in recent research due to the unique challenges posed by irregular flow patterns, complex traffic behaviors, and sparse contextual data. In this paper, we propose a Heterogeneous Spatial-Temporal graph network With Adaptive contrastiVE learning (HST-WAVE) to address the HIPO problem. Specifically, we first construct a heterogeneous traffic graph according to the physical highway structure. Then, we develop a multi-scale temporal weaving Transformer and a coupled heterogeneous graph attention network to capture the irregular traffic flow patterns and complex transition behaviors. Furthermore, we introduce an adaptive temporal enhancement contrastive learning strategy to bridge the gap between divergent temporal patterns and mitigate data sparsity. We conduct extensive experiments on two real-world highway network datasets (No. G56 and G60 in Hangzhou, China), showing that our model can effectively handle the HIPO problem and achieve state-of-the-art performance. The source code is available at https: //github. com/luck-seu/HST-WAVE.

NeurIPS Conference 2025 Conference Paper

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

  • Jintao Zhang
  • Jia wei
  • Haoxu Wang
  • Pengle Zhang
  • Xiaoming Xu
  • Haofeng Huang
  • Kai Jiang
  • Jianfei Chen

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new $\texttt{FP4}$ Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves $\textbf{1038}$ $\texttt{TOPS}$ on $\texttt{RTX5090}$, which is a $\textbf{5}\times$ speedup over the fastest FlashAttention on $\texttt{RTX5090}$. Experiments show that our $\texttt{FP4}$ attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient $\texttt{8-bit}$ attention for both forward and backward propagation. Experiments indicate that $\texttt{8-bit}$ attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https: //github. com/thu-ml/SageAttention.

NeurIPS Conference 2025 Conference Paper

Scaling Diffusion Transformers Efficiently via $\mu$P

  • Chenyu Zheng
  • Xinyu Zhang
  • Rongzhen Wang
  • Wei Huang
  • Zhi Tian
  • Weilin Huang
  • Jun Zhu
  • Chongxuan Li

Diffusion Transformers have emerged as the foundation for vision generative models, but their scalability is limited by the high cost of hyperparameter (HP) tuning at large scales. Recently, Maximal Update Parametrization ($\mu$P) was proposed for vanilla Transformers, which enables stable HP transfer from small to large language models, and dramatically reduces tuning costs. However, it remains unclear whether $\mu$P of vanilla Transformers extends to diffusion Transformers, which differ architecturally and objectively. In this work, we generalize $\mu$P to diffusion Transformers and validate its effectiveness through large-scale experiments. First, we rigorously prove that $\mu$P of mainstream diffusion Transformers, including DiT, U-ViT, PixArt-$\alpha$, and MMDiT, aligns with that of the vanilla Transformer, enabling the direct application of existing $\mu$P methodologies. Leveraging this result, we systematically demonstrate that DiT-$\mu$P enjoys robust HP transferability. Notably, DiT-XL-2-$\mu$P with transferred learning rate achieves 2. 9$\times$ faster convergence than the original DiT-XL-2. Finally, we validate the effectiveness of $\mu$P on text-to-image generation by scaling PixArt-$\alpha$ from 0. 04B to 0. 61B and MMDiT from 0. 18B to 18B. In both cases, models under $\mu$P outperform their respective baselines while requiring small tuning cost—only 5. 5% of one training run for PixArt-$\alpha$ and 3% of consumption by human experts for MMDiT-18B. \textit{These results establish $\mu$P as a principled and efficient framework for scaling diffusion Transformers}.

IJCAI Conference 2025 Conference Paper

Self-Consistent Model-based Adaptation for Visual Reinforcement Learning

  • Xinning Zhou
  • Chengyang Ying
  • Yao Feng
  • Hang Su
  • Jun Zhu

Visual reinforcement learning agents typically face serious performance declines in real-world applications caused by visual distractions. Existing methods rely on fine-tuning the policy's representations with hand-crafted augmentations. In this work, we propose Self-Consistent Model-based Adaptation (SCMA), a novel method that fosters robust adaptation without modifying the policy. By transferring cluttered observations to clean ones with a denoising model, SCMA can mitigate distractions for various policies as a plug-and-play enhancement. To optimize the denoising model in an unsupervised manner, we derive an unsupervised distribution matching objective with a theoretical analysis of its optimality. We further present a practical algorithm to optimize the objective by estimating the distribution of clean observations with a pre-trained world model. Extensive experiments on multiple visual generalization benchmarks and real robot data demonstrate that SCMA effectively boosts performance across various distractions and exhibits better sample efficiency.

NeurIPS Conference 2025 Conference Paper

ShapeLLM-Omni: A Native Multimodal LLM for 3D Generation and Understanding

  • junliang ye
  • Zhengyi Wang
  • Ruowen Zhao
  • Shenghao Xie
  • Jun Zhu

Recently, the powerful text-to-image capabilities of GPT-4o have led to growing appreciation for native multimodal large language models. However, its multimodal capabilities remain confined to images and text. Yet beyond images, the ability to understand and generate 3D content is equally crucial. To address this gap, we propose ShapeLLM-Omni—a native 3D large language model capable of understanding and generating 3D assets and text in any sequence. First, we train a 3D vector-quantized variational autoencoder (VQVAE), which maps 3D objects into a discrete latent space to achieve efficient and accurate shape representation and reconstruction. Building upon the 3D-aware discrete tokens, we innovatively construct a large-scale continuous training dataset named 3D-Alpaca, encompassing generation, comprehension, and editing, thus providing rich resources for future research and training. Finally, we perform instruction-based fine-tuning of the Qwen-2. 5-vl-7B-Instruct model on the 3D-Alpaca dataset, equipping it with native 3D understanding and generation capabilities. Our work represents an effective step toward extending multimodal large language models with fundamental 3D intelligence, paving the way for future advances in 3D-native AI.

ICRA Conference 2024 Conference Paper

A Large-area Tactile Sensor for Distributed Force Sensing Using Highly Sensitive Piezoresistive Sponge

  • Wendong Zheng
  • Kun Liu
  • Di Guo 0002
  • Wuqiang Yang
  • Jun Zhu
  • Huaping Liu 0001

Tactile sensing plays a critical role in enabling robots to interact safely with target objects in dynamic and unstructured environments. While various tactile sensors based on different sensing principles or different sensitive materials have been proposed, the development of flexible large-area tactile sensors for robots is still challenging. In this paper, a novel highly sensitive piezoresistive sponge based on multi-walled carbon nanotubes (MWCNTs) and polyurethane (PU) sponge is fabricated for pressure sensing. The sensing behavior of the piezoresistive sponge was experimentally evaluated, showing high sensitivity and fast response. Based on the piezoresistive sponge, a flexible large-area tactile sensor is designed for distributed force detection with electrical resistance tomography technology. The sensing performance of the sensor is validated by touch location, sensitivity analysis, real-time touch discrimination, and touch modality recognition. The experimental results indicate that the sensor performs well in detecting the position and force of contact in a large area. The sensor’s performance shows promise in embodied tactile sensing and human–robot interaction.

NeurIPS Conference 2024 Conference Paper

Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control

  • Huayu Chen
  • Kaiwen Zheng
  • Hang Su
  • Jun Zhu

Drawing upon recent advances in language model alignment, we formulate offline Reinforcement Learning as a two-stage optimization problem: First pretraining expressive generative policies on reward-free behavior datasets, then finetuning these policies to align with task-specific annotations like Q-values. This strategy allows us to leverage abundant and diverse behavior data to enhance generalization and enable rapid adaptation to downstream tasks using minimal annotations. In particular, we introduce Efficient Diffusion Alignment (EDA) for solving continuous control problems. EDA utilizes diffusion models for behavior modeling. However, unlike previous approaches, we represent diffusion policies as the derivative of a scalar neural network with respect to action inputs. This representation is critical because it enables direct density calculation for diffusion models, making them compatible with existing LLM alignment theories. During policy fine-tuning, we extend preference-based alignment methods like Direct Preference Optimization (DPO) to align diffusion behaviors with continuous Q-functions. Our evaluation on the D4RL benchmark shows that EDA exceeds all baseline methods in overall performance. Notably, EDA maintains about 95\% of performance and still outperforms several baselines given only 1\% of Q-labelled data during fine-tuning.

NeurIPS Conference 2024 Conference Paper

C-GAIL: Stabilizing Generative Adversarial Imitation Learning with Control Theory

  • Tianjiao Luo
  • Tim Pearce
  • Huayu Chen
  • Jianfei Chen
  • Jun Zhu

Generative Adversarial Imitation Learning (GAIL) provides a promising approach to training a generative policy to imitate a demonstrator. It uses on-policy Reinforcement Learning (RL) to optimize a reward signal derived from an adversarial discriminator. However, optimizing GAIL is difficult in practise, with the training loss oscillating during training, slowing convergence. This optimization instability can prevent GAIL from finding a good policy, harming its final performance. In this paper, we study GAIL’s optimization from a control-theoretic perspective. We show that GAIL cannot converge to the desired equilibrium. In response, we analyze the training dynamics of GAIL in function space and design a novel controller that not only pushes GAIL to the desired equilibrium but also achieves asymptotic stability in a simplified “one-step” setting. Going from theory to practice, we propose Controlled-GAIL (C-GAIL), which adds a differentiable regularization term on the GAIL objective to stabilize training. Empirically, the C-GAIL regularizer improves the training of various existing GAIL methods, including the popular GAIL-DAC, by speeding up the convergence, reducing the range of oscillation, and matching the expert distribution more closely.

TMLR Journal 2024 Journal Article

Calibrating Deep Ensemble through Functional Variational Inference

  • Zhijie Deng
  • Feng Zhou
  • Jianfei Chen
  • Guoqiang Wu
  • Jun Zhu

Deep Ensemble (DE) is an effective and practical uncertainty quantification approach in deep learning. The uncertainty of DE is usually manifested by the functional inconsistency among the ensemble members, which, yet, originates from unmanageable randomness in the initialization and optimization of neural networks (NNs), and may easily collapse in specific cases. To tackle this issue, we advocate characterizing the functional inconsistency with the empirical covariance of the functions dictated by the ensemble members, and defining a Gaussian process (GP) with it. We perform functional variational inference to tune such a probabilistic model w.r.t. training data and specific prior beliefs. This way, we can explicitly manage the uncertainty of the ensemble of NNs. We further provide strategies to make the training efficient. The proposed approach achieves better uncertainty quantification than DE and its variants across diverse scenarios, while consuming only marginally added training cost compared to standard DE. The code is available at https://github.com/thudzj/DE-GP.

NeurIPS Conference 2024 Conference Paper

Consistency Diffusion Bridge Models

  • Guande He
  • Kaiwen Zheng
  • Jianfei Chen
  • Fan Bao
  • Jun Zhu

Diffusion models (DMs) have become the dominant paradigm of generative modeling in a variety of domains by learning stochastic processes from noise to data. Recently, diffusion denoising bridge models (DDBMs), a new formulation of generative modeling that builds stochastic processes between fixed data endpoints based on a reference diffusion process, have achieved empirical success across tasks with coupled data distribution, such as image-to-image translation. However, DDBM's sampling process typically requires hundreds of network evaluations to achieve decent performance, which may impede their practical deployment due to high computational demands. In this work, inspired by the recent advance of consistency models in DMs, we tackle this problem by learning the consistency function of the probability-flow ordinary differential equation (PF-ODE) of DDBMs, which directly predicts the solution at a starting step given any point on the ODE trajectory. Based on a dedicated general-form ODE solver, we propose two paradigms: consistency bridge distillation and consistency bridge training, which is flexible to apply on DDBMs with broad design choices. Experimental results show that our proposed method could sample $4\times$ to $50\times$ faster than the base DDBM and produce better visual quality given the same step in various tasks with pixel resolution ranging from $64 \times 64$ to $256 \times 256$, as well as supporting downstream tasks such as semantic interpolation in the data space.

AAAI Conference 2024 Conference Paper

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

  • Wentse Chen
  • Shiyu Huang
  • Yuan Chiang
  • Tim Pearce
  • Wei-Wei Tu
  • Ting Chen
  • Jun Zhu

Most reinforcement learning algorithms seek a single optimal strategy that solves a given task. However, it can often be valuable to learn a diverse set of solutions, for instance, to make an agent's interaction with users more engaging, or improve the robustness of a policy to an unexpected perturbance. We propose Diversity-Guided Policy Optimization (DGPO), an on-policy algorithm that discovers multiple strategies for solving a given task. Unlike prior work, it achieves this with a shared policy network trained over a single run. Specifically, we design an intrinsic reward based on an information-theoretic diversity objective. Our final objective alternately constraints on the diversity of the strategies and on the extrinsic reward. We solve the constrained optimization problem by casting it as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently discovers diverse strategies in a wide variety of reinforcement learning tasks. Compared to baseline methods, DGPO achieves comparable rewards, while discovering more diverse strategies, and often with better sample efficiency.

NeurIPS Conference 2024 Conference Paper

Diffusion Models are Certifiably Robust Classifiers

  • Huanran Chen
  • Yinpeng Dong
  • Shitong Shao
  • Zhongkai Hao
  • Xiao Yang
  • Hang Su
  • Jun Zhu

Generative learning, recognized for its effective modeling of data distributions, offers inherent advantages in handling out-of-distribution instances, especially for enhancing robustness to adversarial attacks. Among these, diffusion classifiers, utilizing powerful diffusion models, have demonstrated superior empirical robustness. However, a comprehensive theoretical understanding of their robustness is still lacking, raising concerns about their vulnerability to stronger future attacks. In this study, we prove that diffusion classifiers possess $O(1)$ Lipschitzness, and establish their certified robustness, demonstrating their inherent resilience. To achieve non-constant Lipschitzness, thereby obtaining much tighter certified robustness, we generalize diffusion classifiers to classify Gaussian-corrupted data. This involves deriving the evidence lower bounds (ELBOs) for these distributions, approximating the likelihood using the ELBO, and calculating classification probabilities via Bayes' theorem. Experimental results show the superior certified robustness of these Noised Diffusion Classifiers (NDCs). Notably, we achieve over 80\% and 70\% certified robustness on CIFAR-10 under adversarial perturbations with \(\ell_2\) norms less than 0. 25 and 0. 5, respectively, using a single off-the-shelf diffusion model without any additional data.

ICRA Conference 2024 Conference Paper

i-Octree: A Fast, Lightweight, and Dynamic Octree for Proximity Search

  • Jun Zhu
  • Hongyi Li
  • Zhepeng Wang 0002
  • Shengjie Wang 0002
  • Tao Zhang 0006

Establishing the correspondences between newly acquired points and historically accumulated data (i. e. , the map) through nearest neighbor search is crucial in numerous robotic applications. However, static tree data structures are inadequate to handle large and dynamically growing maps in real-time. To address this issue, we present the i-Octree, a dynamic octree data structure that supports both fast nearest neighbor search and real-time dynamic updates, such as point insertion, deletion, and on-tree down-sampling. The i-Octree is built upon a leaf-based octree and has two key features: a local spatially continuous storing strategy that allows for fast access to points while minimizing memory usage, and local on-tree updates that significantly reduce computation time compared to existing static or dynamic tree structures. The experiments show that the i-Octree outperforms contemporary state-of-the-art approaches by achieving, on average, a 19% reduction in runtime on real-world open datasets.

NeurIPS Conference 2024 Conference Paper

Identifying and Solving Conditional Image Leakage in Image-to-Video Diffusion Model

  • Min Zhao
  • Hongzhou Zhu
  • Chendong Xiang
  • Kaiwen Zheng
  • Chongxuan Li
  • Jun Zhu

Diffusion models have obtained substantial progress in image-to-video generation. However, in this paper, we find that these models tend to generate videos with less motion than expected. We attribute this to the issue called conditional image leakage, where the image-to-video diffusion models (I2V-DMs) tend to over-rely on the conditional image at large time steps. We further address this challenge from both inference and training aspects. First, we propose to start the generation process from an earlier time step to avoid the unreliable large-time steps of I2V-DMs, as well as an initial noise distribution with optimal analytic expressions (Analytic-Init) by minimizing the KL divergence between it and the actual marginal distribution to bridge the training-inference gap. Second, we design a time-dependent noise distribution (TimeNoise) for the conditional image during training, applying higher noise levels at larger time steps to disrupt it and reduce the model's dependency on it. We validate these general strategies on various I2V-DMs on our collected open-domain image benchmark and the UCF101 dataset. Extensive results show that our methods outperform baselines by producing higher motion scores with lower errors while maintaining image alignment and temporal consistency, thereby yielding superior overall performance and enabling more accurate motion control. The project page: \url{https: //cond-image-leak. github. io/}.

NeurIPS Conference 2024 Conference Paper

MultiTrust: A Comprehensive Benchmark Towards Trustworthy Multimodal Large Language Models

  • Yichi Zhang
  • Yao Huang
  • Yitong Sun
  • Chang Liu
  • Zhe Zhao
  • Zhengwei Fang
  • Yifan Wang
  • Huanran Chen

Despite the superior capabilities of Multimodal Large Language Models (MLLMs) across diverse tasks, they still face significant trustworthiness challenges. Yet, current literature on the assessment of trustworthy MLLMs remains limited, lacking a holistic evaluation to offer thorough insights into future improvements. In this work, we establish MultiTrust, the first comprehensive and unified benchmark on the trustworthiness of MLLMs across five primary aspects: truthfulness, safety, robustness, fairness, and privacy. Our benchmark employs a rigorous evaluation strategy that addresses both multimodal risks and cross-modal impacts, encompassing 32 diverse tasks with self-curated datasets. Extensive experiments with 21 modern MLLMs reveal some previously unexplored trustworthiness issues and risks, highlighting the complexities introduced by the multimodality and underscoring the necessity for advanced methodologies to enhance their reliability. For instance, typical proprietary models still struggle with the perception of visually confusing images and are vulnerable to multimodal jailbreaking and adversarial attacks; MLLMs are more inclined to disclose privacy in text and reveal ideological and cultural biases even when paired with irrelevant images in inference, indicating that the multimodality amplifies the internal risks from base LLMs. Additionally, we release a scalable toolbox for standardized trustworthiness research, aiming to facilitate future advancements in this important field. Code and resources are publicly available at: https: //multi-trust. github. io/.

NeurIPS Conference 2024 Conference Paper

Noise Contrastive Alignment of Language Models with Explicit Rewards

  • Huayu Chen
  • Guande He
  • Lifan Yuan
  • Ganqu Cui
  • Hang Su
  • Jun Zhu

User intentions are typically formalized as evaluation rewards to be maximized when fine-tuning language models (LMs). Existing alignment methods, such as Direct Preference Optimization (DPO), are mainly tailored for pairwise preference data where rewards are implicitly defined rather than explicitly given. In this paper, we introduce a general framework for LM alignment, leveraging Noise Contrastive Estimation (NCE) to bridge the gap in handling reward datasets explicitly annotated with scalar evaluations. Our framework comprises two parallel algorithms, NCA and InfoNCA, both enabling the direct extraction of an LM policy from reward data as well as preference data. Notably, we show that the DPO loss is a special case of our proposed InfoNCA objective under pairwise preference settings, thereby integrating and extending current alignment theories. By comparing NCA and InfoNCA, we demonstrate that the well-observed decreasing-likelihood trend of DPO/InfoNCA is caused by their focus on adjusting relative likelihood across different responses. In contrast, NCA optimizes the absolute likelihood for each response, thereby effectively preventing the chosen likelihood from decreasing. We evaluate our methods in both reward and preference settings with Mistral-8$\times$7B and 7B models. Experiments suggest that InfoNCA/NCA surpasses various preference baselines when reward datasets are available. We also find NCA significantly outperforms DPO in complex reasoning tasks like math and coding.

NeurIPS Conference 2024 Conference Paper

On Mesa-Optimization in Autoregressively Trained Transformers: Emergence and Capability

  • Chenyu Zheng
  • Wei Huang
  • Rongzhen Wang
  • Guoqiang Wu
  • Jun Zhu
  • Chongxuan Li

Autoregressively trained transformers have brought a profound revolution to the world, especially with their in-context learning (ICL) ability to address downstream tasks. Recently, several studies suggest that transformers learn a mesa-optimizer during autoregressive (AR) pretraining to implement ICL. Namely, the forward pass of the trained transformer is equivalent to optimizing an inner objective function in-context. However, whether the practical non-convex training dynamics will converge to the ideal mesa-optimizer is still unclear. Towards filling this gap, we investigate the non-convex dynamics of a one-layer linear causal self-attention model autoregressively trained by gradient flow, where the sequences are generated by an AR process $x_{t+1} = W x_t$. First, under a certain condition of data distribution, we prove that an autoregressively trained transformer learns $W$ by implementing one step of gradient descent to minimize an ordinary least squares (OLS) problem in-context. It then applies the learned $\widehat{W}$ for next-token prediction, thereby verifying the mesa-optimization hypothesis. Next, under the same data conditions, we explore the capability limitations of the obtained mesa-optimizer. We show that a stronger assumption related to the moments of data is the sufficient and necessary condition that the learned mesa-optimizer recovers the distribution. Besides, we conduct exploratory analyses beyond the first data condition and prove that generally, the trained transformer will not perform vanilla gradient descent for the OLS problem. Finally, our simulation results verify the theoretical results.

JMLR Journal 2024 Journal Article

Optimal Learning Policies for Differential Privacy in Multi-armed Bandits

  • Siwei Wang
  • Jun Zhu

This paper studies the multi-armed bandit problem with a requirement of differential privacy guarantee or global differential privacy guarantee. We first prove that, the lower bound for the extra regret to protect $(\epsilon,\delta)$-global differential privacy is $\Omega({N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$ ($N$ is the number of arms and $T$ is the time horizon), which is independent with $T$ for $\delta > 0$ and large enough $T$. Moreover, the lower bound for the extra regret to protect $(\epsilon,\delta)$-differential privacy can be no more than the above bound. This means that, different with the case $\delta = 0$, it is possible to design algorithms that protect privacy and achieve the same asymptotical regret upper bound as the non-private algorithms when $\delta > 0$. Then we adapt the Follow the Perturbed Leader (FTPL) framework, and propose learning policies with both Gaussian and Beta perturbed distributions (DP-FTPL-Gauss and DP-FTPL-Beta) to protect $(\epsilon,\delta)$-differential privacy. The analysis shows that they achieve an $O({N\log T\over \Delta_{\min}} + N \min\{{1\over \delta^2}, {1\over \epsilon^2}\log{1\over \delta}\})$ regret upper bound, where $\Delta_{\min}$ is the minimum expected reward gap between the optimal arm and any other ones. We also design a unique perturbed distribution to protect $(\epsilon,\delta)$-differential privacy in the FTPL framework (DP-FTPL-New), which reduces the regret upper bound to $O({N\log T\over \Delta_{\min}} + {N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$. We further show that this perturbed distribution could also be used to protect $(\epsilon,\delta)$-global differential privacy, and design a corresponding algorithm GDP-Elim-New. We show that its regret upper bound is $O({\Delta_{\max} \over \Delta_{\min}}({N\log T\over \Delta_{\min}} + {N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T}))$. This shows that our $\Omega({N\over \epsilon }\log {(e^{\epsilon} -1)T + \delta T \over (e^{\epsilon}-1) + \delta T})$ regret lower bound is tight (e.g. when ${\Delta_{\max}\over \Delta_{\min}}$ is bounded). [abs] [ pdf ][ bib ] &copy JMLR 2024. ( edit, beta )

NeurIPS Conference 2024 Conference Paper

PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning

  • Chengyang Ying
  • Zhongkai Hao
  • Xinning Zhou
  • Xuezhou Xu
  • Hang Su
  • Xingxing Zhang
  • Jun Zhu

Designing generalizable agents capable of adapting to diverse embodiments has achieved significant attention in Reinforcement Learning (RL), which is critical for deploying RL agents in various real-world applications. Previous Cross-Embodiment RL approaches have focused on transferring knowledge across embodiments within specific tasks. These methods often result in knowledge tightly coupled with those tasks and fail to adequately capture the distinct characteristics of different embodiments. To address this limitation, we introduce the notion of Cross-Embodiment Unsupervised RL (CEURL), which leverages unsupervised learning to enable agents to acquire embodiment-aware and task-agnostic knowledge through online interactions within reward-free environments. We formulate CEURL as a novel Controlled Embodiment Markov Decision Process (CE-MDP) and systematically analyze CEURL's pre-training objectives under CE-MDP. Based on these analyses, we develop a novel algorithm Pre-trained Embodiment-Aware Control (PEAC) for handling CEURL, incorporating an intrinsic reward function specifically designed for cross-embodiment pre-training. PEAC not only provides an intuitive optimization strategy for cross-embodiment pre-training but also can integrate flexibly with existing unsupervised RL methods, facilitating cross-embodiment exploration and skill discovery. Extensive experiments in both simulated (e. g. , DMC and Robosuite) and real-world environments (e. g. , legged locomotion) demonstrate that PEAC significantly improves adaptation performance and cross-embodiment generalization, demonstrating its effectiveness in overcoming the unique challenges of CEURL. The project page and code are in https: //yingchengyang. github. io/ceurl.

NeurIPS Conference 2024 Conference Paper

PINNacle: A Comprehensive Benchmark of Physics-Informed Neural Networks for Solving PDEs

  • Zhongkai Hao
  • Jiachen Yao
  • Chang Su
  • Hang Su
  • Ziao Wang
  • Fanzhi Lu
  • Zeyu Xia
  • Yichi Zhang

While significant progress has been made on Physics-Informed Neural Networks (PINNs), a comprehensive comparison of these methods across a wide range of Partial Differential Equations (PDEs) is still lacking. This study introduces PINNacle, a benchmarking tool designed to fill this gap. PINNacle provides a diverse dataset, comprising over 20 distinct PDEs from various domains, including heat conduction, fluid dynamics, biology, and electromagnetics. These PDEs encapsulate key challenges inherent to real-world problems, such as complex geometry, multi-scale phenomena, nonlinearity, and high dimensionality. PINNacle also offers a user-friendly toolbox, incorporating about 10 state-of-the-art PINN methods for systematic evaluation and comparison. We have conducted extensive experiments with these methods, offering insights into their strengths and weaknesses. In addition to providing a standardized means of assessing performance, PINNacle also offers an in-depth analysis to guide future research, particularly in areas such as domain decomposition methods and loss reweighting for handling multi-scale problems and complex geometry. To the best of our knowledge, it is the largest benchmark with a diverse and comprehensive evaluation that will undoubtedly foster further research in PINNs.

NeurIPS Conference 2024 Conference Paper

S-STE: Continuous Pruning Function for Efficient 2:4 Sparse Pre-training

  • Yuezhou Hu
  • Jun Zhu
  • Jianfei Chen

Training deep neural networks (DNNs) is costly. Fortunately, Nvidia Ampere and Hopper GPUs can accelerate matrix multiplications twice as fast as a dense equivalent by implementing 2: 4 sparsity. However, previous STE-based 2: 4 pre-training methods (\eg~STE with hard-thresholding, SR-STE) suffer from optimization difficulties because of discontinuous pruning function. In this study, we comprehensively analyse the bottleneck of traditional N: M sparse training and recognize three drawbacks with discontinuity: incorrect descending direction, inability to predict the amount of descent and sparse mask oscillation. In the light of this statement, we propose S-STE, a simple yet powerful 2: 4 training method that contains two parts: to continuously project weights to be 2: 4 sparse, and to rescale sparse weights with a per-tensor fixed scaling factor. Besides, we adopt minimum-variance unbiased estimation for activation gradient and FP8 quantization for whole process. Results show that our method surpass previous 2: 4 pre-training recipes and is comparable even with full parameter models.

NeurIPS Conference 2024 Conference Paper

T2VSafetyBench: Evaluating the Safety of Text-to-Video Generative Models

  • Yibo Miao
  • Yifan Zhu
  • Lijia Yu
  • Jun Zhu
  • Xiao-Shan Gao
  • Yinpeng Dong

The recent development of Sora leads to a new era in text-to-video (T2V) generation. Along with this comes the rising concern about its safety risks. The generated videos may contain illegal or unethical content, and there is a lack of comprehensive quantitative understanding of their safety, posing a challenge to their reliability and practical deployment. Previous evaluations primarily focus on the quality of video generation. While some evaluations of text-to-image models have considered safety, they cover limited aspects and do not address the unique temporal risk inherent in video generation. To bridge this research gap, we introduce T2VSafetyBench, the first comprehensive benchmark for conducting safety-critical assessments of text-to-video models. We define 4 primary categories with 14 critical aspects of video generation safety and construct a malicious prompt dataset including real-world prompts, LLM-generated prompts, and jailbreak attack-based prompts. We then conduct a thorough safety evaluation on 9 recently released T2V models. Based on our evaluation results, we draw several important findings, including: 1) no single model excels in all aspects, with different models showing various strengths; 2) the correlation between GPT-4 assessments and manual reviews is generally high; 3) there is a trade-off between the usability and safety of text-to-video generative models. This indicates that as the field of video generation rapidly advances, safety risks are set to surge, highlighting the urgency of prioritizing video safety. We hope that T2VSafetyBench can provide insights for better understanding the safety of video generation in the era of generative AIs. Our code is publicly available at \url{https: //github. com/yibo-miao/T2VSafetyBench}.

NeurIPS Conference 2024 Conference Paper

Vidu4D: Single Generated Video to High-Fidelity 4D Reconstruction with Dynamic Gaussian Surfels

  • Yikai Wang
  • Xinzhou Wang
  • Zilong Chen
  • Zhengyi Wang
  • Fuchun Sun
  • Jun Zhu

Video generative models are receiving particular attention given their ability to generate realistic and imaginative frames. Besides, these models are also observed to exhibit strong 3D consistency, significantly enhancing their potential to act as world simulators. In this work, we present Vidu4D, a novel reconstruction model that excels in accurately reconstructing 4D (i. e. , sequential 3D) representations from single generated videos, addressing challenges associated with non-rigidity and frame distortion. This capability is pivotal for creating high-fidelity virtual contents that maintain both spatial and temporal coherence. At the core of Vidu4D is our proposed Dynamic Gaussian Surfels (DGS) technique. DGS optimizes time-varying warping functions to transform Gaussian surfels (surface elements) from a static state to a dynamically warped state. This transformation enables a precise depiction of motion and deformation over time. To preserve the structural integrity of surface-aligned Gaussian surfels, we design the warped-state geometric regularization based on continuous warping fields for estimating normals. Additionally, we learn refinements on rotation and scaling parameters of Gaussian surfels, which greatly alleviates texture flickering during the warping process and enhances the capture of fine-grained appearance details. Vidu4D also contains a novel initialization state that provides a proper start for the warping fields in DGS. Equipping Vidu4D with an existing video generative model, the overall framework demonstrates high-fidelity text-to-4D generation in both appearance and geometry.

JMLR Journal 2024 Journal Article

Virtual-Event-Based Posterior Sampling and Inference for Neyman-Scott Processes

  • Chengkuan Hong
  • Christian R. Shelton
  • Jun Zhu

Neyman-Scott processes (NSPs) are a class of Cox processes constructed by stacking layers of Poisson processes into a deep structure. While a lot of research has been conducted regarding the posterior sampling and inference for NSPs, most of the existing methods only work for shallow NSPs (i.e., NSPs with one layer of latent Poisson processes). In this paper, we present virtual-event-based posterior sampling and inference algorithms for NSPs. The algorithms work for both deep NSPs and shallow NSPs. Moreover, we show that deep NSPs can be viewed as branching processes or a limiting case of probabilistic graphical models. We conduct a theoretical analysis of the convergence of our algorithms and provide the condition for the convergence to hold. In doing so, we also prove the convergence of virtual-event-based sampling inference algorithms for other point process models with missing information (Markov jump processes, piecewise-constant intensity models, and Hawkes processes). Like NSPs, the latent variables of these models with missing information are also point processes. Our experimental results demonstrate that the prediction based on our sampling and inference algorithms for NSPs can achieve good prediction performance compared with state-of-the-art methods. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2024. ( edit, beta )

AAAI Conference 2023 Conference Paper

Certifiable Out-of-Distribution Generalization

  • Nanyang Ye
  • Lin Zhu
  • Jia Wang
  • Zhaoyu Zeng
  • Jiayao Shao
  • Chensheng Peng
  • Bikang Pan
  • Kaican Li

Machine learning methods suffer from test-time performance degeneration when faced with out-of-distribution (OoD) data whose distribution is not necessarily the same as training data distribution. Although a plethora of algorithms have been proposed to mitigate this issue, it has been demonstrated that achieving better performance than ERM simultaneously on different types of distributional shift datasets is challenging for existing approaches. Besides, it is unknown how and to what extent these methods work on any OoD datum without theoretical guarantees. In this paper, we propose a certifiable out-of-distribution generalization method that provides provable OoD generalization performance guarantees via a functional optimization framework leveraging random distributions and max-margin learning for each input datum. With this approach, the proposed algorithmic scheme can provide certified accuracy for each input datum's prediction on the semantic space and achieves better performance simultaneously on OoD datasets dominated by correlation shifts or diversity shifts. Our code is available at https://github.com/ZlatanWilliams/StochasticDisturbanceLearning.

AAMAS Conference 2023 Conference Paper

DGPO: Discovering Multiple Strategies with Diversity-Guided Policy Optimization

  • Wenze Chen
  • Shiyu Huang
  • Yuan Chiang
  • Ting Chen
  • Jun Zhu

Recent algorithms designed for reinforcement learning tasks focus on finding a single optimal solution. However, in many practical applications, it is important to develop reasonable agents with diverse strategies. In this paper, we propose Diversity-Guided Policy Optimization, an on-policy framework for discovering multiple strategies for the same task. Our algorithm uses diversity objectives to guide a latent code conditioned policy to learn a set of diverse strategies in a single training procedure. Experimental results show that our method efficiently finds diverse strategies in a wide variety of reinforcement learning tasks. We further show that DGPO has similar performance and achieves a higher diversity score or better sample efficiency compared to other baselines.

NeurIPS Conference 2023 Conference Paper

Diffusion Models and Semi-Supervised Learners Benefit Mutually with Few Labels

  • Zebin You
  • Yong Zhong
  • Fan Bao
  • Jiacheng Sun
  • Chongxuan Li
  • Jun Zhu

In an effort to further advance semi-supervised generative and classification tasks, we propose a simple yet effective training strategy called *dual pseudo training* (DPT), built upon strong semi-supervised learners and diffusion models. DPT operates in three stages: training a classifier on partially labeled data to predict pseudo-labels; training a conditional generative model using these pseudo-labels to generate pseudo images; and retraining the classifier with a mix of real and pseudo images. Empirically, DPT consistently achieves SOTA performance of semi-supervised generation and classification across various settings. In particular, with one or two labels per class, DPT achieves a Fréchet Inception Distance (FID) score of 3. 08 or 2. 52 on ImageNet $256\times256$. Besides, DPT outperforms competitive semi-supervised baselines substantially on ImageNet classification tasks, *achieving top-1 accuracies of 59. 0 (+2. 8), 69. 5 (+3. 0), and 74. 4 (+2. 0)* with one, two, or five labels per class, respectively. Notably, our results demonstrate that diffusion can generate realistic images with only a few labels (e. g. , $<0. 1$%) and generative augmentation remains viable for semi-supervised classification. Our code is available at *https: //github. com/ML-GSAI/DPT*.

NeurIPS Conference 2023 Conference Paper

DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics

  • Kaiwen Zheng
  • Cheng Lu
  • Jianfei Chen
  • Jun Zhu

Diffusion probabilistic models (DPMs) have exhibited excellent performance for high-fidelity image generation while suffering from inefficient sampling. Recent works accelerate the sampling procedure by proposing fast ODE solvers that leverage the specific ODE form of DPMs. However, they highly rely on specific parameterization during inference (such as noise/data prediction), which might not be the optimal choice. In this work, we propose a novel formulation towards the optimal parameterization during sampling that minimizes the first-order discretization error of the ODE solution. Based on such formulation, we propose \textit{DPM-Solver-v3}, a new fast ODE solver for DPMs by introducing several coefficients efficiently computed on the pretrained model, which we call \textit{empirical model statistics}. We further incorporate multistep methods and a predictor-corrector framework, and propose some techniques for improving sample quality at small numbers of function evaluations (NFE) or large guidance scales. Experiments show that DPM-Solver-v3 achieves consistently better or comparable performance in both unconditional and conditional sampling with both pixel-space and latent-space DPMs, especially in 5$\sim$10 NFEs. We achieve FIDs of 12. 21 (5 NFE), 2. 51 (10 NFE) on unconditional CIFAR10, and MSE of 0. 55 (5 NFE, 7. 5 guidance scale) on Stable Diffusion, bringing a speed-up of 15\%$\sim$30\% compared to previous state-of-the-art training-free methods. Code is available at \url{https: //github. com/thu-ml/DPM-Solver-v3}.

AAAI Conference 2023 Conference Paper

DQ-DETR: Dual Query Detection Transformer for Phrase Extraction and Grounding

  • Shilong Liu
  • Shijia Huang
  • Feng Li
  • Hao Zhang
  • Yaoyuan Liang
  • Hang Su
  • Jun Zhu
  • Lei Zhang

In this paper, we study the problem of visual grounding by considering both phrase extraction and grounding (PEG). In contrast to the previous phrase-known-at-test setting, PEG requires a model to extract phrases from text and locate objects from image simultaneously, which is a more practical setting in real applications. As phrase extraction can be regarded as a 1D text segmentation problem, we formulate PEG as a dual detection problem and propose a novel DQ-DETR model, which introduces dual queries to probe different features from image and text for object prediction and phrase mask prediction. Each pair of dual queries are designed to have shared positional parts but different content parts. Such a design effectively alleviates the difficulty of modality alignment between image and text (in contrast to a single query design) and empowers Transformer decoder to leverage phrase mask-guided attention to improve the performance. To evaluate the performance of PEG, we also propose a new metric CMAP (cross-modal average precision), analogous to the AP metric in object detection. The new metric overcomes the ambiguity of Recall@1 in many-box-to-one-phrase cases in phrase grounding. As a result, our PEG pre-trained DQ-DETR establishes new state-of-the-art results on all visual grounding benchmarks with a ResNet-101 backbone. For example, it achieves 91.04% and 83.51% in terms of recall rate on RefCOCO testA and testB with a ResNet-101 backbone.

NeurIPS Conference 2023 Conference Paper

Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality

  • Liyuan Wang
  • Jingyi Xie
  • Xingxing Zhang
  • Mingyi Huang
  • Hang Su
  • Jun Zhu

Prompt-based continual learning is an emerging direction in leveraging pre-trained knowledge for downstream continual learning, and has almost reached the performance pinnacle under supervised pre-training. However, our empirical research reveals that the current strategies fall short of their full potential under the more realistic self-supervised pre-training, which is essential for handling vast quantities of unlabeled data in practice. This is largely due to the difficulty of task-specific knowledge being incorporated into instructed representations via prompt parameters and predicted by uninstructed representations at test time. To overcome the exposed sub-optimality, we conduct a theoretical analysis of the continual learning objective in the context of pre-training, and decompose it into hierarchical components: within-task prediction, task-identity inference, and task-adaptive prediction. Following these empirical and theoretical insights, we propose Hierarchical Decomposition (HiDe-)Prompt, an innovative approach that explicitly optimizes the hierarchical components with an ensemble of task-specific prompts and statistics of both uninstructed and instructed representations, further with the coordination of a contrastive regularization strategy. Our extensive experiments demonstrate the superior performance of HiDe-Prompt and its robustness to pre-training paradigms in continual learning (e. g. , up to 15. 01% and 9. 61% lead on Split CIFAR-100 and Split ImageNet-R, respectively).

NeurIPS Conference 2023 Conference Paper

Learning Sample Difficulty from Pre-trained Models for Reliable Prediction

  • Peng Cui
  • Dan Zhang
  • Zhijie Deng
  • Yinpeng Dong
  • Jun Zhu

Large-scale pre-trained models have achieved remarkable success in many applications, but how to leverage them to improve the prediction reliability of downstream models is undesirably under-explored. Moreover, modern neural networks have been found to be poorly calibrated and make overconfident predictions regardless of inherent sample difficulty and data uncertainty. To address this issue, we propose to utilize large-scale pre-trained models to guide downstream model training with sample difficulty-aware entropy regularization. Pre-trained models that have been exposed to large-scale datasets and do not overfit the downstream training classes enable us to measure each training sample’s difficulty via feature-space Gaussian modeling and relative Mahalanobis distance computation. Importantly, by adaptively penalizing overconfident prediction based on the sample difficulty, we simultaneously improve accuracy and uncertainty calibration across challenging benchmarks (e. g. , +0. 55% ACC and −3. 7% ECE on ImageNet1k using ResNet34), consistently surpassing competitive baselines for reliable prediction. The improved uncertainty estimate further improves selective classification (abstaining from erroneous predictions) and out-of-distribution detection.

NeurIPS Conference 2023 Conference Paper

Memory Efficient Optimizers with 4-bit States

  • Bingrui Li
  • Jianfei Chen
  • Jun Zhu

Optimizer states are a major source of memory consumption for training neural networks, limiting the maximum trainable model within given memory budget. Compressing the optimizer states from 32-bit floating points to lower bitwidth is promising to reduce the training memory footprint, while the current lowest achievable bitwidth is 8-bit. In this work, we push optimizer states bitwidth down to 4-bit through a detailed empirical analysis of first and second moments. Specifically, we find that moments have complicated outlier patterns, that current block-wise quantization cannot accurately approximate. We use a smaller block size and propose to utilize both row-wise and column-wise information for better quantization. We further identify a zero point problem of quantizing the second moment, and solve this problem with a linear quantizer that excludes the zero point. Our 4-bit optimizers are evaluated on a wide variety of benchmarks including natural language understanding, machine translation, image classification, and instruction tuning. On all the tasks our optimizers can achieve comparable accuracy with their full-precision counterparts, while enjoying better memory efficiency.

IJCAI Conference 2023 Conference Paper

On the Reuse Bias in Off-Policy Reinforcement Learning

  • Chengyang Ying
  • Zhongkai Hao
  • Xinning Zhou
  • Hang Su
  • Dong Yan
  • Jun Zhu

Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to address this issue mainly focus on analyzing the variance of IS. In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS --- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We theoretically show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective, which may cause an erroneous gradient update and degenerate the performance. We further provide a high-probability upper bound of the Reuse Bias and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms. Based on these analyses, we present a novel yet simple Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias, and show that our BIRIS can significantly reduce the Reuse Bias empirically. Moreover, extensive experimental results show that our BIRIS-based methods can significantly improve the sample efficiency on a series of continuous control tasks in MuJoCo.

NeurIPS Conference 2023 Conference Paper

Overcoming Recency Bias of Normalization Statistics in Continual Learning: Balance and Adaptation

  • Yilin Lyu
  • Liyuan Wang
  • Xingxing Zhang
  • Zicheng Sun
  • Hang Su
  • Jun Zhu
  • Liping Jing

Continual learning entails learning a sequence of tasks and balancing their knowledge appropriately. With limited access to old training samples, much of the current work in deep neural networks has focused on overcoming catastrophic forgetting of old tasks in gradient-based optimization. However, the normalization layers provide an exception, as they are updated interdependently by the gradient and statistics of currently observed training samples, which require specialized strategies to mitigate recency bias. In this work, we focus on the most popular Batch Normalization (BN) and provide an in-depth theoretical analysis of its sub-optimality in continual learning. Our analysis demonstrates the dilemma between balance and adaptation of BN statistics for incremental tasks, which potentially affects training stability and generalization. Targeting on these particular challenges, we propose Adaptive Balance of BN (AdaB$^2$N), which incorporates appropriately a Bayesian-based strategy to adapt task-wise contributions and a modified momentum to balance BN statistics, corresponding to the training and testing stages. By implementing BN in a continual learning fashion, our approach achieves significant performance gains across a wide range of benchmarks, particularly for the challenging yet realistic online scenarios (e. g. , up to 7. 68\%, 6. 86\% and 4. 26\% on Split CIFAR-10, Split CIFAR-100 and Split Mini-ImageNet, respectively). Our code is available at https: //github. com/lvyilin/AdaB2N.

NeurIPS Conference 2023 Conference Paper

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation

  • Zhengyi Wang
  • Cheng Lu
  • Yikai Wang
  • Fan Bao
  • Chongxuan Li
  • Hang Su
  • Jun Zhu

Score distillation sampling (SDS) has shown great promise in text-to-3D generation by distilling pretrained large-scale text-to-image diffusion models, but suffers from over-saturation, over-smoothing, and low-diversity problems. In this work, we propose to model the 3D parameter as a random variable instead of a constant as in SDS and present *variational score distillation* (VSD), a principled particle-based variational framework to explain and address the aforementioned issues in text-to-3D generation. We show that SDS is a special case of VSD and leads to poor samples with both small and large CFG weights. In comparison, VSD works well with various CFG weights as ancestral sampling from diffusion models and simultaneously improves the diversity and sample quality with a common CFG weight (i. e. , 7. 5). We further present various improvements in the design space for text-to-3D such as distillation time schedule and density initialization, which are orthogonal to the distillation algorithm yet not well explored. Our overall approach, dubbed *ProlificDreamer*, can generate high rendering resolution (i. e. , 512$\times$512) and high-fidelity NeRF with rich structure and complex effects (e. g. , smoke and drops). Further, initialized from NeRF, meshes fine-tuned by VSD are meticulously detailed and photo-realistic.

NeurIPS Conference 2023 Conference Paper

Towards Accelerated Model Training via Bayesian Data Selection

  • Zhijie Deng
  • Peng Cui
  • Jun Zhu

Mislabeled, duplicated, or biased data in real-world scenarios can lead to prolonged training and even hinder model convergence. Traditional solutions prioritizing easy or hard samples lack the flexibility to handle such a variety simultaneously. Recent work has proposed a more reasonable data selection principle by examining the data's impact on the model's generalization loss. However, its practical adoption relies on less principled approximations and additional holdout data. This work solves these problems by leveraging a lightweight Bayesian treatment and incorporating off-the-shelf zero-shot predictors built on large-scale pre-trained models. The resulting algorithm is efficient and easy to implement. We perform extensive empirical studies on challenging benchmarks with considerable data noise and imbalance in the online batch selection scenario, and observe superior training efficiency over competitive baselines. Notably, on the challenging WebVision benchmark, our method can achieve similar predictive performance with significantly fewer training iterations than leading data selection methods.

NeurIPS Conference 2023 Conference Paper

Training Transformers with 4-bit Integers

  • Haocheng Xi
  • ChangHao Li
  • Jianfei Chen
  • Jun Zhu

Quantizing the activation, weight, and gradient to 4-bit is promising to accelerate neural network training. However, existing 4-bit training methods require custom numerical formats which are not supported by contemporary hardware. In this work, we propose a training method for transformers with all matrix multiplications implemented with the INT4 arithmetic. Training with an ultra-low INT4 precision is challenging. To achieve this, we carefully analyze the specific structures of activation and gradients in transformers to propose dedicated quantizers for them. For forward propagation, we identify the challenge of outliers and propose a Hadamard quantizer to suppress the outliers. For backpropagation, we leverage the structural sparsity of gradients by proposing bit splitting and leverage score sampling techniques to quantize gradients accurately. Our algorithm achieves competitive accuracy on a wide range of tasks including natural language understanding, machine translation, and image classification. Unlike previous 4-bit training methods, our algorithm can be implemented on the current generation of GPUs. Our prototypical linear operator implementation is up to 2. 2 times faster than the FP16 counterparts and speeds up the training by 17. 8\% on average for sufficiently large models. Our code is available at https: //github. com/xijiu9/Train_Transformers_with_INT4.

NeurIPS Conference 2022 Conference Paper

A Unified Hard-Constraint Framework for Solving Geometrically Complex PDEs

  • Songming Liu
  • Hao Zhongkai
  • Chengyang Ying
  • Hang Su
  • Jun Zhu
  • Ze Cheng

We present a unified hard-constraint framework for solving geometrically complex PDEs with neural networks, where the most commonly used Dirichlet, Neumann, and Robin boundary conditions (BCs) are considered. Specifically, we first introduce the "extra fields'' from the mixed finite element method to reformulate the PDEs so as to equivalently transform the three types of BCs into linear forms. Based on the reformulation, we derive the general solutions of the BCs analytically, which are employed to construct an ansatz that automatically satisfies the BCs. With such a framework, we can train the neural networks without adding extra loss terms and thus efficiently handle geometrically complex PDEs, alleviating the unbalanced competition between the loss terms corresponding to the BCs and PDEs. We theoretically demonstrate that the "extra fields'' can stabilize the training process. Experimental results on real-world geometrically complex PDEs showcase the effectiveness of our method compared with state-of-the-art baselines.

NeurIPS Conference 2022 Conference Paper

Accelerated Linearized Laplace Approximation for Bayesian Deep Learning

  • Zhijie Deng
  • Feng Zhou
  • Jun Zhu

Laplace approximation (LA) and its linearized variant (LLA) enable effortless adaptation of pretrained deep neural networks to Bayesian neural networks. The generalized Gauss-Newton (GGN) approximation is typically introduced to improve their tractability. However, LA and LLA are still confronted with non-trivial inefficiency issues and should rely on Kronecker-factored, diagonal, or even last-layer approximate GGN matrices in practical use. These approximations are likely to harm the fidelity of learning outcomes. To tackle this issue, inspired by the connections between LLA and neural target kernels (NTKs), we develop a Nystrom approximation to NTKs to accelerate LLA. Our method benefits from the capability of popular deep learning libraries for forward mode automatic differentiation, and enjoys reassuring theoretical guarantees. Extensive studies reflect the merits of the proposed method in aspects of both scalability and performance. Our method can even scale up to architectures like vision transformers. We also offer valuable ablation studies to diagnose our method. Code is available at https: //github. com/thudzj/ELLA.

NeurIPS Conference 2022 Conference Paper

Censored Quantile Regression Neural Networks for Distribution-Free Survival Analysis

  • Tim Pearce
  • Jong-Hyeon Jeong
  • yichen jia
  • Jun Zhu

This paper considers doing quantile regression on censored data using neural networks (NNs). This adds to the survival analysis toolkit by allowing direct prediction of the target variable, along with a distribution-free characterisation of uncertainty, using a flexible function approximator. We begin by showing how an algorithm popular in linear models can be applied to NNs. However, the resulting procedure is inefficient, requiring sequential optimisation of an individual NN at each desired quantile. Our major contribution is a novel algorithm that simultaneously optimises a grid of quantiles output by a single NN. To offer theoretical insight into our algorithm, we show firstly that it can be interpreted as a form of expectation-maximisation, and secondly that it exhibits a desirable `self-correcting' property. Experimentally, the algorithm produces quantiles that are better calibrated than existing methods on 10 out of 12 real datasets.

IJCAI Conference 2022 Conference Paper

Cluster Attack: Query-based Adversarial Attacks on Graph with Graph-Dependent Priors

  • Zhengyi Wang
  • Zhongkai Hao
  • Ziqiao Wang
  • Hang Su
  • Jun Zhu

While deep neural networks have achieved great success in graph analysis, recent work has shown that they are vulnerable to adversarial attacks. Compared with adversarial attacks on image classification, performing adversarial attacks on graphs is more challenging because of the discrete and non-differential nature of the adjacent matrix for a graph. In this work, we propose Cluster Attack --- a Graph Injection Attack (GIA) on node classification, which injects fake nodes into the original graph to degenerate the performance of graph neural networks (GNNs) on certain victim nodes while affecting the other nodes as little as possible. We demonstrate that a GIA problem can be equivalently formulated as a graph clustering problem; thus, the discrete optimization problem of the adjacency matrix can be solved in the context of graph clustering. In particular, we propose to measure the similarity between victim nodes by a metric of Adversarial Vulnerability, which is related to how the victim nodes will be affected by the injected fake node, and to cluster the victim nodes accordingly. Our attack is performed in a practical and unnoticeable query-based black-box manner with only a few nodes on the graphs that can be accessed. Theoretical analysis and extensive experiments demonstrate the effectiveness of our method by fooling the node classifiers with only a small number of queries.

NeurIPS Conference 2022 Conference Paper

Confidence-based Reliable Learning under Dual Noises

  • Peng Cui
  • Yang Yue
  • Zhijie Deng
  • Jun Zhu

Deep neural networks (DNNs) have achieved remarkable success in a variety of computer vision tasks, where massive labeled images are routinely required for model optimization. Yet, the data collected from the open world are unavoidably polluted by noise, which may significantly undermine the efficacy of the learned models. Various attempts have been made to reliably train DNNs under data noise, but they separately account for either the noise existing in the labels or that existing in the images. A naive combination of the two lines of works would suffer from the limitations in both sides, and miss the opportunities to handle the two kinds of noise in parallel. This works provides a first, unified framework for reliable learning under the joint (image, label)-noise. Technically, we develop a confidence-based sample filter to progressively filter out noisy data without the need of pre-specifying noise ratio. Then, we penalize the model uncertainty of the detected noisy data instead of letting the model continue over-fitting the misleading information in them. Experimental results on various challenging synthetic and real-world noisy datasets verify that the proposed method can outperform competing baselines in the aspect of classification performance.

NeurIPS Conference 2022 Conference Paper

DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps

  • Cheng Lu
  • Yuhao Zhou
  • Fan Bao
  • Jianfei Chen
  • Chongxuan Li
  • Jun Zhu

Diffusion probabilistic models (DPMs) are emerging powerful generative models. Despite their high-quality generation performance, DPMs still suffer from their slow sampling as they generally need hundreds or thousands of sequential function evaluations (steps) of large neural networks to draw a sample. Sampling from DPMs can be viewed alternatively as solving the corresponding diffusion ordinary differential equations (ODEs). In this work, we propose an exact formulation of the solution of diffusion ODEs. The formulation analytically computes the linear part of the solution, rather than leaving all terms to black-box ODE solvers as adopted in previous works. By applying change-of-variable, the solution can be equivalently simplified to an exponentially weighted integral of the neural network. Based on our formulation, we propose DPM-Solver, a fast dedicated high-order solver for diffusion ODEs with the convergence order guarantee. DPM-Solver is suitable for both discrete-time and continuous-time DPMs without any further training. Experimental results show that DPM-Solver can generate high-quality samples in only 10 to 20 function evaluations on various datasets. We achieve 4. 70 FID in 10 function evaluations and 2. 87 FID in 20 function evaluations on the CIFAR10 dataset, and a 4~16x speedup compared with previous state-of-the-art training-free samplers on various datasets.

JMLR Journal 2022 Journal Article

Efficient Inference for Dynamic Flexible Interactions of Neural Populations

  • Feng Zhou
  • Quyu Kong
  • Zhijie Deng
  • Jichao Kan
  • Yixuan Zhang
  • Cheng Feng
  • Jun Zhu

Hawkes process provides an effective statistical framework for analyzing the interactions of neural spiking activities. Although utilized in many real applications, the classic Hawkes process is incapable of modeling inhibitory interactions among neural population. Instead, the nonlinear Hawkes process allows for modeling a more flexible influence pattern with excitatory or inhibitory interactions. This work proposes a flexible nonlinear Hawkes process variant based on sigmoid nonlinearity. To ease inference, three sets of auxiliary latent variables (Polya-Gamma variables, latent marked Poisson processes and sparsity variables) are augmented to make functional connection weights appear in a Gaussian form, which enables simple iterative algorithms with analytical updates. As a result, the efficient Gibbs sampler, expectation-maximization algorithm and mean-field approximation are derived to estimate the interactions among neural populations. Furthermore, to reconcile with time-varying neural systems, the proposed time-invariant model is extended to a dynamic version by introducing a Markov state process. Similarly, three analytical iterative inference algorithms: Gibbs sampler, EM algorithm and mean-field approximation are derived. We compare the accuracy and efficiency of these inference algorithms on synthetic data, and further experiment on real neural recordings to demonstrate that the developed models achieve superior performance over the state-of-the-art competitors. [abs] [ pdf ][ bib ] &copy JMLR 2022. ( edit, beta )

NeurIPS Conference 2022 Conference Paper

EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations

  • Min Zhao
  • Fan Bao
  • Chongxuan Li
  • Jun Zhu

Score-based diffusion models (SBDMs) have achieved the SOTA FID results in unpaired image-to-image translation (I2I). However, we notice that existing methods totally ignore the training data in the source domain, leading to sub-optimal solutions for unpaired I2I. To this end, we propose energy-guided stochastic differential equations (EGSDE) that employs an energy function pretrained on both the source and target domains to guide the inference process of a pretrained SDE for realistic and faithful unpaired I2I. Building upon two feature extractors, we carefully design the energy function such that it encourages the transferred image to preserve the domain-independent features and discard domain-specific ones. Further, we provide an alternative explanation of the EGSDE as a product of experts, where each of the three experts (corresponding to the SDE and two feature extractors) solely contributes to faithfulness or realism. Empirically, we compare EGSDE to a large family of baselines on three widely-adopted unpaired I2I tasks under four metrics. EGSDE not only consistently outperforms existing SBDMs-based methods in almost all settings but also achieves the SOTA realism results without harming the faithful performance. Furthermore, EGSDE allows for flexible trade-offs between realism and faithfulness and we improve the realism results further (e. g. , FID of 51. 04 in Cat $\to$ Dog and FID of 50. 43 in Wild $\to$ Dog on AFHQ) by tuning hyper-parameters. The code is available at https: //github. com/ML-GSAI/EGSDE.

NeurIPS Conference 2022 Conference Paper

Fast Instrument Learning with Faster Rates

  • Ziyu Wang
  • Yuhao Zhou
  • Jun Zhu

We investigate nonlinear instrumental variable (IV) regression given high-dimensional instruments. We propose a simple algorithm which combines kernelized IV methods and an arbitrary, adaptive regression algorithm, accessed as a black box. Our algorithm enjoys faster-rate convergence and adapts to the dimensionality of informative latent features, while avoiding an expensive minimax optimization procedure, which has been necessary to establish similar guarantees. It further brings the benefit of flexible machine learning models to quasi-Bayesian uncertainty quantification, likelihood-based model selection, and model averaging. Simulation studies demonstrate the competitive performance of our method.

NeurIPS Conference 2022 Conference Paper

Isometric 3D Adversarial Examples in the Physical World

  • Yibo Miao
  • Yinpeng Dong
  • Jun Zhu
  • Xiao-Shan Gao

Recently, several attempts have demonstrated that 3D deep learning models are as vulnerable to adversarial example attacks as 2D models. However, these methods are still far from stealthy and suffer from severe performance degradation in the physical world. Although 3D data is highly structured, it is difficult to bound the perturbations with simple metrics in the Euclidean space. In this paper, we propose a novel $\epsilon$-isometric ($\epsilon$-ISO) attack method to generate natural and robust 3D adversarial examples in the physical world by considering the geometric properties of 3D objects and the invariance to physical transformations. For naturalness, we constrain the adversarial example and the original one to be $\epsilon$-isometric by adopting the Gaussian curvature as the surrogate metric under a theoretical analysis. For robustness under physical transformations, we propose a maxima over transformation (MaxOT) method to actively search for the most difficult transformations rather than random ones to make the generated adversarial example more robust in the physical world. Extensive experiments on typical point cloud recognition models validate that our approach can improve the attack success rate and naturalness of the generated 3D adversarial examples than the state-of-the-art attack methods.

AAAI Conference 2022 Conference Paper

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

  • Jialian Li
  • Tongzheng Ren
  • Dong Yan
  • Hang Su
  • Jun Zhu

In high-stake scenarios like medical treatment and autopiloting, it’s risky or even infeasible to collect online experimental data to train the agent. Simulation-based training can alleviate this issue, but may suffer from its inherent mismatches from the simulator and real environment. It is therefore imperative to utilize the simulator to learn a robust policy for the real-world deployment. In this work, we consider policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments. Specifically, we focus on the setting where the training environment can be characterized as a generative model and a constrained perturbation can be added to the model during testing. Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worstcase perturbation for testing. To solve this issue, we propose a generic method which formalizes the perturbation as an opponent to obtain a two-player zero-sum game, and further show that the Nash Equilibrium corresponds to the robust policy. We prove that, with a polynomial number of samples from the generative model, our algorithm can find a near-optimal robust policy with a high probability. Our method is able to deal with general perturbations under some mild assumptions and can also be extended to more complex problems like robust partial observable Markov decision process, thanks to the game-theoretical formulation.

ICML Conference 2022 Conference Paper

Thompson Sampling for (Combinatorial) Pure Exploration

  • Siwei Wang 0002
  • Jun Zhu

Existing methods of combinatorial pure exploration mainly focus on the UCB approach. To make the algorithm efficient, they usually use the sum of upper confidence bounds within arm set $S$ to represent the upper confidence bound of $S$, which can be much larger than the tight upper confidence bound of $S$ and leads to a much higher complexity than necessary, since the empirical means of different arms in $S$ are independent. To deal with this challenge, we explore the idea of Thompson Sampling (TS) that uses independent random samples instead of the upper confidence bounds, and design the first TS-based algorithm TS-Explore for (combinatorial) pure exploration. In TS-Explore, the sum of independent random samples within arm set $S$ will not exceed the tight upper confidence bound of $S$ with high probability. Hence it solves the above challenge, and achieves a lower complexity upper bound than existing efficient UCB-based algorithms in general combinatorial pure exploration. As for pure exploration of classic multi-armed bandit, we show that TS-Explore achieves an asymptotically optimal complexity upper bound.

JMLR Journal 2022 Journal Article

Tianshou: A Highly Modularized Deep Reinforcement Learning Library

  • Jiayi Weng
  • Huayu Chen
  • Dong Yan
  • Kaichao You
  • Alexis Duburcq
  • Minghao Zhang
  • Yi Su
  • Hang Su

In this paper, we present Tianshou, a highly modularized Python library for deep reinforcement learning (DRL) that uses PyTorch as its backend. Tianshou intends to be research-friendly by providing a flexible and reliable infrastructure of DRL algorithms. It supports online and offline training with more than 20 classic algorithms through a unified interface. To facilitate related research and prove Tianshou's reliability, we have released Tianshou's benchmark of MuJoCo environments, covering eight classic algorithms with state-of-the-art performance. We open-sourced Tianshou at https://github.com/thu-ml/tianshou/. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2022. ( edit, beta )

IJCAI Conference 2022 Conference Paper

Towards Safe Reinforcement Learning via Constraining Conditional Value-at-Risk

  • Chengyang Ying
  • Xinning Zhou
  • Hang Su
  • Dong Yan
  • Ning Chen
  • Jun Zhu

Though deep reinforcement learning (DRL) has obtained substantial success, it may encounter catastrophic failures due to the intrinsic uncertainty of both transition and observation. Most of the existing methods for safe reinforcement learning can only handle transition disturbance or observation disturbance since these two kinds of disturbance affect different parts of the agent; besides, the popular worst-case return may lead to overly pessimistic policies. To address these issues, we first theoretically prove that the performance degradation under transition disturbance and observation disturbance depends on a novel metric of Value Function Range (VFR), which corresponds to the gap in the value function between the best state and the worst state. Based on the analysis, we adopt conditional value-at-risk (CVaR) as an assessment of risk and propose a novel reinforcement learning algorithm of CVaR-Proximal-Policy-Optimization (CPPO) which formalizes the risk-sensitive constrained optimization problem by keeping its CVaR under a given threshold. Experimental results show that CPPO achieves a higher cumulative reward and is more robust against both observation and transition disturbances on a series of continuous control tasks in MuJoCo.

NeurIPS Conference 2022 Conference Paper

ViewFool: Evaluating the Robustness of Visual Recognition to Adversarial Viewpoints

  • Yinpeng Dong
  • Shouwei Ruan
  • Hang Su
  • Caixin Kang
  • Xingxing Wei
  • Jun Zhu

Recent studies have demonstrated that visual recognition models lack robustness to distribution shift. However, current work mainly considers model robustness to 2D image transformations, leaving viewpoint changes in the 3D world less explored. In general, viewpoint changes are prevalent in various real-world applications (e. g. , autonomous driving), making it imperative to evaluate viewpoint robustness. In this paper, we propose a novel method called ViewFool to find adversarial viewpoints that mislead visual recognition models. By encoding real-world objects as neural radiance fields (NeRF), ViewFool characterizes a distribution of diverse adversarial viewpoints under an entropic regularizer, which helps to handle the fluctuations of the real camera pose and mitigate the reality gap between the real objects and their neural representations. Experiments validate that the common image classifiers are extremely vulnerable to the generated adversarial viewpoints, which also exhibit high cross-model transferability. Based on ViewFool, we introduce ImageNet-V, a new out-of-distribution dataset for benchmarking viewpoint robustness of image classifiers. Evaluation results on 40 classifiers with diverse architectures, objective functions, and data augmentations reveal a significant drop in model performance when tested on ImageNet-V, which provides a possibility to leverage ViewFool as an effective data augmentation strategy to improve viewpoint robustness.

AAAI Conference 2021 Conference Paper

A Bayesian Approach for Subset Selection in Contextual Bandits

  • Jialian Li
  • Chao Du
  • Jun Zhu

Subset selection in Contextual Bandits (CB) is an important task in various applications such as advertisement recommendation. In CB, arms are attached with contexts and thus correlated in the context space. Proper exploration for subset selection in CB should carefully consider the contexts. Previous works mainly concentrate on the best one arm identification in linear bandit problems, where the expected rewards are linearly dependent on the contexts. However, these methods highly rely on linearity, and cannot be easily extended to more general cases. We propose a novel Bayesian approach for subset selection in general CB where the reward functions can be nonlinear. Our method provides a principled way to employ contextual information and efficiently explore the arms. For cases with relatively smooth posteriors, we give theoretical results that are comparable to previous works. For general cases, we provide a calculable approximate variant. Empirical results show the effectiveness of our method on both linear bandits and general CB.

NeurIPS Conference 2021 Conference Paper

Accumulative Poisoning Attacks on Real-time Data

  • Tianyu Pang
  • Xiao Yang
  • Yinpeng Dong
  • Hang Su
  • Jun Zhu

Collecting training data from untrusted sources exposes machine learning services to poisoning adversaries, who maliciously manipulate training data to degrade the model accuracy. When trained on offline datasets, poisoning adversaries have to inject the poisoned data in advance before training, and the order of feeding these poisoned batches into the model is stochastic. In contrast, practical systems are more usually trained/fine-tuned on sequentially captured real-time data, in which case poisoning adversaries could dynamically poison each data batch according to the current model state. In this paper, we focus on the real-time settings and propose a new attacking strategy, which affiliates an accumulative phase with poisoning attacks to secretly (i. e. , without affecting accuracy) magnify the destructive effect of a (poisoned) trigger batch. By mimicking online learning and federated learning on MNIST and CIFAR-10, we show that model accuracy significantly drops by a single update step on the trigger batch after the accumulative phase. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects, with no need to explore complex techniques.

NeurIPS Conference 2021 Conference Paper

AFEC: Active Forgetting of Negative Transfer in Continual Learning

  • Liyuan Wang
  • Mingtian Zhang
  • Zhongfan Jia
  • Qian Li
  • Chenglong Bao
  • Kaisheng Ma
  • Jun Zhu
  • Yi Zhong

Continual learning aims to learn a sequence of tasks from dynamic data distributions. Without accessing to the old training samples, knowledge transfer from the old tasks to each new task is difficult to determine, which might be either positive or negative. If the old knowledge interferes with the learning of a new task, i. e. , the forward knowledge transfer is negative, then precisely remembering the old tasks will further aggravate the interference, thus decreasing the performance of continual learning. By contrast, biological neural networks can actively forget the old knowledge that conflicts with the learning of a new experience, through regulating the learning-triggered synaptic expansion and synaptic convergence. Inspired by the biological active forgetting, we propose to actively forget the old knowledge that limits the learning of new tasks to benefit continual learning. Under the framework of Bayesian continual learning, we develop a novel approach named Active Forgetting with synaptic Expansion-Convergence (AFEC). Our method dynamically expands parameters to learn each new task and then selectively combines them, which is formally consistent with the underlying mechanism of biological active forgetting. We extensively evaluate AFEC on a variety of continual learning benchmarks, including CIFAR-10 regression tasks, visual classification tasks and Atari reinforcement tasks, where AFEC effectively improves the learning of new tasks and achieves the state-of-the-art performance in a plug-and-play way.

IJCAI Conference 2021 Conference Paper

Combining Tree Search and Action Prediction for State-of-the-Art Performance in DouDiZhu

  • Yunsheng Zhang
  • Dong Yan
  • Bei Shi
  • Haobo Fu
  • Qiang Fu
  • Hang Su
  • Jun Zhu
  • Ning Chen

AlphaZero has achieved superhuman performance on various perfect-information games, such as chess, shogi and Go. However, directly applying AlphaZero to imperfect-information games (IIG) is infeasible, due to the fact that traditional MCTS methods cannot handle missing information of other players. Meanwhile, there have been several extensions of MCTS for IIGs, by implicitly or explicitly sampling a state of other players. But, due to the inability to handle private and public information well, the performance of these methods is not satisfactory. In this paper, we extend AlphaZero to multiplayer IIGs by developing a new MCTS method, Action-Prediction MCTS (AP-MCTS). In contrast to traditional MCTS extensions for IIGs, AP-MCTS first builds the search tree based on public information, adopts the policy-value network to generalize between hidden states, and finally predicts other players' actions directly. This design bypasses the inefficiency of sampling and the difficulty of predicting the state of other players. We conduct extensive experiments on the popular 3-player poker game DouDiZhu to evaluate the performance of AP-MCTS combined with the framework AlphaZero. When playing against experienced human players, AP-MCTS achieved a 65. 65\% winning rate, which is almost twice the human's winning rate. When comparing with state-of-the-art DouDiZhu AIs, the Elo rating of AP-MCTS is 50 to 200 higher than them. The ablation study shows that accurate action prediction is the key to AP-MCTS winning.

AAAI Conference 2021 Conference Paper

Improving Generative Moment Matching Networks with Distribution Partition

  • Yong Ren
  • Yucen Luo
  • Jun Zhu

Generative moment matching networks (GMMN) present a theoretically sound approach to learning deep generative models. However, such methods are typically limited by the high sample complexity, thereby impractical in generating complex data. In this paper, we present a new strategy to train GMMN with a low sample complexity while retaining the theoretical soundness. Our method introduces some auxiliary variables, whose values are provided by a pre-trained model such as an encoder network in practice. Conditioned on these variables, we partition the distribution into a set of conditional distributions, which can be effectively matched with a low sample complexity. We instantiate this strategy by presenting an amortized network called GMMN-DP with shared auxiliary variable information for the data generation task, as well as developing an efficient stochastic training algorithm. The experimental results show that GMMN-DP can generate complex samples on datasets such as CelebA and CIFAR-10, where the vanilla GMMN fails.

AAAI Conference 2021 Conference Paper

Learning Task-Distribution Reward Shaping with Meta-Learning

  • Haosheng Zou
  • Tongzheng Ren
  • Dong Yan
  • Hang Su
  • Jun Zhu

Reward shaping is one of the most effective methods to tackle the crucial yet challenging problem of credit assignment and accelerate Reinforcement Learning. However, designing shaping functions usually requires rich expert knowledge and handengineering, and the difficulties are further exacerbated given multiple tasks to solve. In this paper, we consider reward shaping on a distribution of tasks that share state spaces but not necessarily action spaces. We provide insights into optimal reward shaping, and propose a novel meta-learning framework to automatically learn such reward shaping to apply on newly sampled tasks. Theoretical analysis and extensive experiments establish us as the state-of-the-art in learning task-distribution reward shaping, outperforming previous such works (Konidaris and Barto 2006; Snel and Whiteson 2014). We further show that our method outperforms learning intrinsic rewards (Yang et al. 2019; Zheng et al. 2020), outperforms Rainbow (Hessel et al. 2018) in complex pixel-based CoinRun games, and is also better than hand-designed reward shaping on grid mazes. While the goal of this paper is to learn reward shaping rather than to propose new general meta-learning algorithms as PEARL (Rakelly et al. 2019) or MQL (Fakoor et al. 2020), our framework based on MAML (Finn, Abbeel, and Levine 2017) also outperforms PEARL / MQL, and could combine with them for further improvement.

NeurIPS Conference 2021 Conference Paper

On the Convergence of Prior-Guided Zeroth-Order Optimization Algorithms

  • Shuyu Cheng
  • Guoqiang Wu
  • Jun Zhu

Zeroth-order (ZO) optimization is widely used to handle challenging tasks, such as query-based black-box adversarial attacks and reinforcement learning. Various attempts have been made to integrate prior information into the gradient estimation procedure based on finite differences, with promising empirical results. However, their convergence properties are not well understood. This paper makes an attempt to fill up this gap by analyzing the convergence of prior-guided ZO algorithms under a greedy descent framework with various gradient estimators. We provide a convergence guarantee for the prior-guided random gradient-free (PRGF) algorithms. Moreover, to further accelerate over greedy descent methods, we present a new accelerated random search (ARS) algorithm that incorporates prior information, together with a convergence analysis. Finally, our theoretical results are confirmed by experiments on several numerical benchmarks as well as adversarial attacks.

NeurIPS Conference 2021 Conference Paper

Rethinking and Reweighting the Univariate Losses for Multi-Label Ranking: Consistency and Generalization

  • Guoqiang Wu
  • Chongxuan Li
  • Kun Xu
  • Jun Zhu

The (partial) ranking loss is a commonly used evaluation measure for multi-label classification, which is usually optimized with convex surrogates for computational efficiency. Prior theoretical efforts on multi-label ranking mainly focus on (Fisher) consistency analyses. However, there is a gap between existing theory and practice --- some inconsistent pairwise losses can lead to promising performance, while some consistent univariate losses usually have no clear superiority in practice. To take a step towards filling up this gap, this paper presents a systematic study from two complementary perspectives of consistency and generalization error bounds of learning algorithms. We theoretically find two key factors of the distribution (or dataset) that affect the learning guarantees of algorithms: the instance-wise class imbalance and the label size $c$. Specifically, in an extremely imbalanced case, the algorithm with the consistent univariate loss has an error bound of $O(c)$, while the one with the inconsistent pairwise loss depends on $O(\sqrt{c})$ as shown in prior work. This may shed light on the superior performance of pairwise methods in practice, where real datasets are usually highly imbalanced. Moreover, we present an inconsistent reweighted univariate loss-based algorithm that enjoys an error bound of $O(\sqrt{c})$ for promising performance as well as the computational efficiency of univariate losses. Finally, experimental results confirm our theoretical findings.

NeurIPS Conference 2021 Conference Paper

Scalable Quasi-Bayesian Inference for Instrumental Variable Regression

  • Ziyu Wang
  • Yuhao Zhou
  • Tongzheng Ren
  • Jun Zhu

Recent years have witnessed an upsurge of interest in employing flexible machine learning models for instrumental variable (IV) regression, but the development of uncertainty quantification methodology is still lacking. In this work we present a scalable quasi-Bayesian procedure for IV regression, building upon the recently developed kernelized IV models. Contrary to Bayesian modeling for IV, our approach does not require additional assumptions on the data generating process, and leads to a scalable approximate inference algorithm with time cost comparable to the corresponding point estimation methods. Our algorithm can be further extended to work with neural network models. We analyze the theoretical properties of the proposed quasi-posterior, and demonstrate through empirical evaluation the competitive performance of our method.

NeurIPS Conference 2021 Conference Paper

Stability and Generalization of Bilevel Programming in Hyperparameter Optimization

  • Fan Bao
  • Guoqiang Wu
  • Chongxuan Li
  • Jun Zhu
  • Bo Zhang

The (gradient-based) bilevel programming framework is widely used in hyperparameter optimization and has achieved excellent performance empirically. Previous theoretical work mainly focuses on its optimization properties, while leaving the analysis on generalization largely open. This paper attempts to address the issue by presenting an expectation bound w. r. t. the validation set based on uniform stability. Our results can explain some mysterious behaviours of the bilevel programming in practice, for instance, overfitting to the validation set. We also present an expectation bound for the classical cross-validation algorithm. Our results suggest that gradient-based algorithms can be better than cross-validation under certain conditions in a theoretical perspective. Furthermore, we prove that regularization terms in both the outer and inner levels can relieve the overfitting problem in gradient-based algorithms. In experiments on feature learning and data reweighting for noisy labels, we corroborate our theoretical findings.

NeurIPS Conference 2020 Conference Paper

Adversarial Distributional Training for Robust Deep Learning

  • Yinpeng Dong
  • Zhijie Deng
  • Tianyu Pang
  • Jun Zhu
  • Hang Su

Adversarial training (AT) is among the most effective techniques to improve model robustness by augmenting training data with adversarial examples. However, most existing AT methods adopt a specific attack to craft adversarial examples, leading to the unreliable robustness against other unseen attacks. Besides, a single attack algorithm could be insufficient to explore the space of perturbations. In this paper, we introduce adversarial distributional training (ADT), a novel framework for learning robust models. ADT is formulated as a minimax optimization problem, where the inner maximization aims to learn an adversarial distribution to characterize the potential adversarial examples around a natural one under an entropic regularizer, and the outer minimization aims to train robust models by minimizing the expected loss over the worst-case adversarial distributions. Through a theoretical analysis, we develop a general algorithm for solving ADT, and present three approaches for parameterizing the adversarial distributions, ranging from the typical Gaussian distributions to the flexible implicit ones. Empirical results on several benchmarks validate the effectiveness of ADT compared with the state-of-the-art AT methods.

NeurIPS Conference 2020 Conference Paper

Bi-level Score Matching for Learning Energy-based Latent Variable Models

  • Fan Bao
  • Chongxuan Li
  • Kun Xu
  • Hang Su
  • Jun Zhu
  • Bo Zhang

Score matching (SM) provides a compelling approach to learn energy-based models (EBMs) by avoiding the calculation of partition function. However, it remains largely open to learn energy-based latent variable models (EBLVMs), except some special cases. This paper presents a bi-level score matching (BiSM) method to learn EBLVMs with general structures by reformulating SM as a bi-level optimization problem. The higher level introduces a variational posterior of the latent variables and optimizes a modified SM objective, and the lower level optimizes the variational posterior to fit the true posterior. To solve BiSM efficiently, we develop a stochastic optimization algorithm with gradient unrolling. Theoretically, we analyze the consistency of BiSM and the convergence of the stochastic algorithm. Empirically, we show the promise of BiSM in Gaussian restricted Boltzmann machines and highly nonstructural EBLVMs parameterized by deep convolutional neural networks. BiSM is comparable to the widely adopted contrastive divergence and SM methods when they are applicable; and can learn complex EBLVMs with intractable posteriors to generate natural images.

NeurIPS Conference 2020 Conference Paper

Boosting Adversarial Training with Hypersphere Embedding

  • Tianyu Pang
  • Xiao Yang
  • Yinpeng Dong
  • Kun Xu
  • Jun Zhu
  • Hang Su

Adversarial training (AT) is one of the most effective defenses against adversarial attacks for deep learning models. In this work, we advocate incorporating the hypersphere embedding (HE) mechanism into the AT procedure by regularizing the features onto compact manifolds, which constitutes a lightweight yet effective module to blend in the strength of representation learning. Our extensive analyses reveal that AT and HE are well coupled to benefit the robustness of the adversarially trained models from several aspects. We validate the effectiveness and adaptability of HE by embedding it into the popular AT frameworks including PGD-AT, ALP, and TRADES, as well as the FreeAT and FastAT strategies. In the experiments, we evaluate our methods under a wide range of adversarial attacks on the CIFAR-10 and ImageNet datasets, which verifies that integrating HE can consistently enhance the model robustness for each AT framework with little extra computation.

NeurIPS Conference 2020 Conference Paper

Calibrated Reliable Regression using Maximum Mean Discrepancy

  • Peng Cui
  • Wenbo Hu
  • Jun Zhu

Accurate quantification of uncertainty is crucial for real-world applications of machine learning. However, modern deep neural networks still produce unreliable predictive uncertainty, often yielding over-confident predictions. In this paper, we are concerned with getting well-calibrated predictions in regression tasks. We propose the calibrated regression method using the maximum mean discrepancy by minimizing the kernel embedding measure. Theoretically, the calibration error of our method asymptotically converges to zero when the sample size is large enough. Experiments on non-trivial real datasets show that our method can produce well-calibrated and sharp prediction intervals, which outperforms the related state-of-the-art methods.

NeurIPS Conference 2020 Conference Paper

Efficient Learning of Generative Models via Finite-Difference Score Matching

  • Tianyu Pang
  • Kun Xu
  • Chongxuan Li
  • Yang Song
  • Stefano Ermon
  • Jun Zhu

Several machine learning applications involve the optimization of higher-order derivatives (e. g. , gradients of gradients) during training, which can be expensive with respect to memory and computation even with automatic differentiation. As a typical example in generative modeling, score matching~(SM) involves the optimization of the trace of a Hessian. To improve computing efficiency, we rewrite the SM objective and its variants in terms of directional derivatives, and present a generic strategy to efficiently approximate any-order directional derivative with finite difference~(FD). Our approximation only involves function evaluations, which can be executed in parallel, and no gradient computations. Thus, it reduces the total computational cost while also improving numerical stability. We provide two instantiations by reformulating variants of SM objectives into the FD forms. Empirically, we demonstrate that our methods produce results comparable to the gradient-based counterparts while being much more computationally efficient.

NeurIPS Conference 2020 Conference Paper

Further Analysis of Outlier Detection with Deep Generative Models

  • Ziyu Wang
  • Bin Dai
  • David Wipf
  • Jun Zhu

The recent, counter-intuitive discovery that deep generative models (DGMs) can frequently assign a higher likelihood to outliers has implications for both outlier detection applications as well as our overall understanding of generative modeling. In this work, we present a possible explanation for this phenomenon, starting from the observation that a model's typical set and high-density region may not conincide. From this vantage point we propose a novel outlier test, the empirical success of which suggests that the failure of existing likelihood-based outlier tests does not necessarily imply that the corresponding generative model is uncalibrated. We also conduct additional experiments to help disentangle the impact of low-level texture versus high-level semantics in differentiating outliers. In aggregate, these results suggest that modifications to the standard evaluation practices and benchmarks commonly applied in the literature are needed.

NeurIPS Conference 2020 Conference Paper

Multi-label classification: do Hamming loss and subset accuracy really conflict with each other?

  • Guoqiang Wu
  • Jun Zhu

Various evaluation measures have been developed for multi-label classification, including Hamming Loss (HL), Subset Accuracy (SA) and Ranking Loss (RL). However, there is a gap between empirical results and the existing theories: 1) an algorithm often empirically performs well on some measure(s) while poorly on others, while a formal theoretical analysis is lacking; and 2) in small label space cases, the algorithms optimizing HL often have comparable or even better performance on the SA measure than those optimizing SA directly, while existing theoretical results show that SA and HL are conflicting measures. This paper provides an attempt to fill up this gap by analyzing the learning guarantees of the corresponding learning algorithms on both SA and HL measures. We show that when a learning algorithm optimizes HL with its surrogate loss, it enjoys an error bound for the HL measure independent of $c$ (the number of labels), while the bound for the SA measure depends on at most $O(c)$. On the other hand, when directly optimizing SA with its surrogate loss, it has learning guarantees that depend on $O(\sqrt{c})$ for both HL and SA measures. This explains the observation that when the label space is not large, optimizing HL with its surrogate loss can have promising performance for SA. We further show that our techniques are applicable to analyze the learning guarantees of algorithms on other measures, such as RL. Finally, the theoretical analyses are supported by experimental results.

AIJ Journal 2020 Journal Article

PopMNet: Generating structured pop music melodies using neural networks

  • Jian Wu
  • Xiaoguang Liu
  • Xiaolin Hu
  • Jun Zhu

Recently, many deep learning models have been proposed to generate symbolic melodies. However, generating pop music melodies with well organized structures remains to be challenging. In this paper, we present a melody structure-based model called PopMNet to generate structured pop music melodies. The melody structure is defined by pairwise relations, specifically, repetition and sequence, between all bars in a melody. PopMNet consists of a Convolutional Neural Network (CNN)-based Structure Generation Net (SGN) and a Recurrent Neural Network (RNN)-based Melody Generation Net (MGN). The former generates melody structures and the latter generates melodies conditioned on the structures and chord progressions. The proposed model is compared with four existing models AttentionRNN, LookbackRNN, MidiNet and Music Transformer. The results indicate that the melodies generated by our model contain much clearer structures compared to those generated by other models, as confirmed by human behavior experiments.

NeurIPS Conference 2020 Conference Paper

Understanding and Exploring the Network with Stochastic Architectures

  • Zhijie Deng
  • Yinpeng Dong
  • Shifeng Zhang
  • Jun Zhu

There is an emerging trend to train a network with stochastic architectures to enable various architectures to be plugged and played during inference. However, the existing investigation is highly entangled with neural architecture search (NAS), limiting its widespread use across scenarios. In this work, we decouple the training of a network with stochastic architectures (NSA) from NAS and provide a first systematical investigation on it as a stand-alone problem. We first uncover the characteristics of NSA in various aspects ranging from training stability, convergence, predictive behaviour, to generalization capacity to unseen architectures. We identify various issues of the vanilla NSA, such as training/test disparity and function mode collapse, and further propose the solutions to these issues with theoretical and empirical insights. We believe that these results could also serve as good heuristics for NAS. Given these understandings, we further apply the NSA with our improvements into diverse scenarios to fully exploit its promise of inference-time architecture stochasticity, including model ensemble, uncertainty estimation and semi-supervised learning. Remarkable performance (e. g. , 2. 75% error rate and 0. 0032 expected calibration error on CIFAR-10) validate the effectiveness of such a model, providing new perspectives of exploring the potential of the network with stochastic architectures, beyond NAS.

AAAI Conference 2019 Conference Paper

Combo-Action: Training Agent For FPS Game with Auxiliary Tasks

  • Shiyu Huang
  • Hang Su
  • Jun Zhu
  • Ting Chen

Deep reinforcement learning (DRL) has achieved surpassing human performance on Atari games, using raw pixels and rewards to learn everything. However, first-person-shooter (FPS) games in 3D environments contain higher levels of human concepts (enemy, weapon, spatial structure, etc.) and a large action space. In this paper, we explore a novel method which can plan on temporally-extended action sequences, which we refer as Combo-Action to compress the action space. We further train a deep recurrent Q-learning network model as a high-level controller, called supervisory network, to manage the Combo-Actions. Our method can be boosted with auxiliary tasks (enemy detection and depth prediction), which enable the agent to extract high-level concepts in the FPS games. Extensive experiments show that our method is efficient in training process and outperforms previous stateof-the-art approaches by a large margin. Ablation study experiments also indicate that our method can boost the performance of the FPS agent in a reasonable way.

AAAI Conference 2019 Conference Paper

Composite Binary Decomposition Networks

  • You Qiaoben
  • Zheng Wang
  • Jianguo Li
  • Yinpeng Dong
  • Yu-Gang Jiang
  • Jun Zhu

Binary neural networks have great resource and computing efficiency, while suffer from long training procedure and non-negligible accuracy drops, when comparing to the fullprecision counterparts. In this paper, we propose the composite binary decomposition networks (CBDNet), which first compose real-valued tensor of each layer with a limited number of binary tensors, and then decompose some conditioned binary tensors into two low-rank binary tensors, so that the number of parameters and operations are greatly reduced comparing to the original ones. Experiments demonstrate the effectiveness of the proposed method, as CBDNet can approximate image classification network ResNet-18 using 5. 25 bits, VGG-16 using 5. 47 bits, DenseNet-121 using 5. 72 bits, object detection networks SSD300 using 4. 38 bits, and semantic segmentation networks SegNet using 5. 18 bits, all with minor accuracy drops. 1

AAAI Conference 2019 Conference Paper

Direct Training for Spiking Neural Networks: Faster, Larger, Better

  • YuJie Wu
  • Lei Deng
  • Guoqi Li
  • Jun Zhu
  • Yuan Xie
  • Luping Shi

Spiking neural networks (SNNs) that enables energy efficient implementation on emerging neuromorphic hardware are gaining more attention. Yet now, SNNs have not shown competitive performance compared with artificial neural networks (ANNs), due to the lack of effective learning algorithms and efficient programming frameworks. We address this issue from two aspects: (1) We propose a neuron normalization technique to adjust the neural selectivity and develop a direct learning algorithm for deep SNNs. (2) Via narrowing the rate coding window and converting the leaky integrate-and-fire (LIF) model into an explicitly iterative version, we present a Pytorch-based implementation method towards the training of large-scale SNNs. In this way, we are able to train deep SNNs with tens of times speedup. As a result, we achieve significantly better accuracy than the reported works on neuromorphic datasets (N-MNIST and DVS- CIFAR10), and comparable accuracy as existing ANNs and pre-trained SNNs on non-spiking datasets (CIFAR10). To our best knowledge, this is the first work that demonstrates direct training of deep SNNs with high performance on CIFAR10, and the efficient implementation provides a new way to explore the potential of SNNs.

YNIMG Journal 2019 Journal Article

Dynamic Contrast Optical Coherence Tomography reveals laminar microvascular hemodynamics in the mouse neocortex in vivo

  • Conrad W. Merkle
  • Jun Zhu
  • Marcel T. Bernucci
  • Vivek J. Srinivasan

Studies of flow-metabolism coupling often presume that microvessel architecture is a surrogate for blood flow. To test this assumption, we introduce an in vivo Dynamic Contrast Optical Coherence Tomography (DyC-OCT) method to quantify layer-resolved microvascular blood flow and volume across the full depth of the mouse neocortex, where the angioarchitecture has been previously described. First, we cross-validate average DyC-OCT cortical flow against conventional Doppler OCT flow. Next, with laminar DyC-OCT, we discover that layer 4 consistently exhibits the highest microvascular blood flow, approximately two-fold higher than the outer cortical layers. While flow differences between layers are well-explained by microvascular volume and density, flow differences between subjects are better explained by transit time. Finally, from layer-resolved tracer enhancement, we also infer that microvascular hematocrit increases in deep cortical layers, consistent with predictions of plasma skimming. Altogether, our results show that while the cortical blood supply derives mainly from the pial surface, laminar hemodynamics ensure that the energetic needs of individual cortical layers are met. The laminar trends reported here provide data that links predictions based on the cortical angioarchitecture to cerebrovascular physiology in vivo.

NeurIPS Conference 2019 Conference Paper

Generative Well-intentioned Networks

  • Justin Cosentino
  • Jun Zhu

We propose Generative Well-intentioned Networks (GWINs), a novel framework for increasing the accuracy of certainty-based, closed-world classifiers. A conditional generative network recovers the distribution of observations that the classifier labels correctly with high certainty. We introduce a reject option to the classifier during inference, allowing the classifier to reject an observation instance rather than predict an uncertain label. These rejected observations are translated by the generative network to high-certainty representations, which are then relabeled by the classifier. This architecture allows for any certainty-based classifier or rejection function and is not limited to multilayer perceptrons. The capability of this framework is assessed using benchmark classification datasets and shows that GWINs significantly improve the accuracy of uncertain observations.

NeurIPS Conference 2019 Conference Paper

Improving Black-box Adversarial Attacks with a Transfer-based Prior

  • Shuyu Cheng
  • Yinpeng Dong
  • Tianyu Pang
  • Hang Su
  • Jun Zhu

We consider the black-box adversarial setting, where the adversary has to generate adversarial perturbations without access to the target models to compute gradients. Previous methods tried to approximate the gradient either by using a transfer gradient of a surrogate white-box model, or based on the query feedback. However, these methods often suffer from low attack success rates or poor query efficiency since it is non-trivial to estimate the gradient in a high-dimensional space with limited information. To address these problems, we propose a prior-guided random gradient-free (P-RGF) method to improve black-box adversarial attacks, which takes the advantage of a transfer-based prior and the query information simultaneously. The transfer-based prior given by the gradient of a surrogate model is appropriately integrated into our algorithm by an optimal coefficient derived by a theoretical analysis. Extensive experiments demonstrate that our method requires much fewer queries to attack black-box models with higher success rates compared with the alternative state-of-the-art methods.

AAMAS Conference 2019 Conference Paper

Learn a Robust Policy in Adversarial Games via Playing with an Expert Opponent

  • Jialian Li
  • Tongzheng Ren
  • Hang Su
  • Jun Zhu

Reinforcement learning methods such as AlphaZero have achieved super-human performance in adversarial games by training in a self-play manner. However, they generally require a large amount of computational resources to search for an (approximately) optimal policy in the joint state-action space involving both players and the environment. To accelerate the exploration process, we propose a new paradigm of “learning by playing” by considering the scenarios where expert opponents are accessible. By observing the opponent actions, the agent accelerates exploration by assigning more searching sources in these actions. To alleviate the sparse reward issue when facing the expert opponent at the beginning, we technically propose a novel method called Ladder Opponent Modeling (LOM), which builds a ladder opponent to facilitate the learning process. The agent plays with both the expert and ladder alternatively with its competence improved gradually. The online manner of the ladder opponent generates auxiliary tasks gradually, yielding a tractable improvement for the agent.

NeurIPS Conference 2019 Conference Paper

Multi-objects Generation with Amortized Structural Regularization

  • Taufik Xu
  • Chongxuan Li
  • Jun Zhu
  • Bo Zhang

Deep generative models (DGMs) have shown promise in image generation. However, most of the existing methods learn a model by simply optimizing a divergence between the marginal distributions of the model and the data, and often fail to capture rich structures, such as attributes of objects and their relationships, in an image. Human knowledge is a crucial element to the success of DGMs to infer these structures, especially in unsupervised learning. In this paper, we propose amortized structural regularization (ASR), which adopts posterior regularization (PR) to embed human knowledge into DGMs via a set of structural constraints. We derive a lower bound of the regularized log-likelihood in PR and adopt the amortized inference technique to jointly optimize the generative model and an auxiliary recognition model for inference efficiently. Empirical results show that ASR outperforms the DGM baselines in terms of inference performance and sample quality.

IJCAI Conference 2019 Conference Paper

Playing FPS Games With Environment-Aware Hierarchical Reinforcement Learning

  • Shihong Song
  • Jiayi Weng
  • Hang Su
  • Dong Yan
  • Haosheng Zou
  • Jun Zhu

Learning rational behaviors in First-person-shooter (FPS) games is a challenging task for Reinforcement Learning (RL) with the primary difficulties of huge action space and insufficient exploration. To address this, we propose a hierarchical agent based on combined options with intrinsic rewards to drive exploration. Specifically, we present a hierarchical model that works in a manager-worker fashion over two levels of hierarchy. The high-level manager learns a policy over options, and the low-level workers, motivated by intrinsic reward, learn to execute the options. Performance is further improved with environmental signals appropriately harnessed. Extensive experiments demonstrate that our trained bot significantly outperforms the alternative RL-based models on FPS games requiring maze solving and combat skills, etc. Notably, we achieved first place in VDAIC 2018 Track(1).

AAAI Conference 2019 Conference Paper

Sparse Adversarial Perturbations for Videos

  • Xingxing Wei
  • Jun Zhu
  • Sha Yuan
  • Hang Su

Although adversarial samples of deep neural networks (DNNs) have been intensively studied on static images, their extensions in videos are never explored. Compared with images, attacking a video needs to consider not only spatial cues but also temporal cues. Moreover, to improve the imperceptibility as well as reduce the computation cost, perturbations should be added on as few frames as possible, i. e. , adversarial perturbations are temporally sparse. This further motivates the propagation of perturbations, which denotes that perturbations added on the current frame can transfer to the next frames via their temporal interactions. Thus, no (or few) extra perturbations are needed for these frames to misclassify them. To this end, we propose the first white-box video attack method, which utilizes an l2, 1-norm based optimization algorithm to compute the sparse adversarial perturbations for videos. We choose the action recognition as the targeted task, and networks with a CNN+RNN architecture as threat models to verify our method. Thanks to the propagation, we can compute perturbations on a shortened version video, and then adapt them to the long version video to fool DNNs. Experimental results on the UCF101 dataset demonstrate that even only one frame in a video is perturbed, the fooling rate can still reach 59. 7%.

AAAI Conference 2018 Conference Paper

Collaborative Filtering With User-Item Co-Autoregressive Models

  • Chao Du
  • Chongxuan Li
  • Yin Zheng
  • Jun Zhu
  • Bo Zhang

Deep neural networks have shown promise in collaborative filtering (CF). However, existing neural approaches are either user-based or item-based, which cannot leverage all the underlying information explicitly. We propose CF-UIcA, a neural co-autoregressive model for CF tasks, which exploits the structural correlation in the domains of both users and items. The co-autoregression allows extra desired properties to be incorporated for different tasks. Furthermore, we develop an efficient stochastic learning algorithm to handle large scale datasets. We evaluate CF-UIcA on two popular benchmarks: MovieLens 1M and Netflix, and achieve state-of-the-art performance in both rating prediction and top-N recommendation tasks, which demonstrates the effectiveness of CF-UIcA.

NeurIPS Conference 2018 Conference Paper

Graphical Generative Adversarial Networks

  • Chongxuan Li
  • Max Welling
  • Jun Zhu
  • Bo Zhang

We propose Graphical Generative Adversarial Networks (Graphical-GAN) to model structured data. Graphical-GAN conjoins the power of Bayesian networks on compactly representing the dependency structures among random variables and that of generative adversarial networks on learning expressive dependency functions. We introduce a structured recognition model to infer the posterior distribution of latent variables given observations. We generalize the Expectation Propagation (EP) algorithm to learn the generative model and recognition model jointly. Finally, we present two important instances of Graphical-GAN, i. e. Gaussian Mixture GAN (GMGAN) and State Space GAN (SSGAN), which can successfully learn the discrete and temporal structures on visual datasets, respectively.

IJCAI Conference 2018 Conference Paper

Learning to Write Stylized Chinese Characters by Reading a Handful of Examples

  • Danyang Sun
  • Tongzheng Ren
  • Chongxuan Li
  • Hang Su
  • Jun Zhu

Automatically writing stylized characters is an attractive yet challenging task, especially for Chinese characters with complex shapes and structures. Most current methods are restricted to generate stylized characters already present in the training set, but required to retrain the model when generating characters of new styles. In this paper, we develop a novel framework of Style-Aware Variational Auto-Encoder (SA-VAE), which disentangles the content-relevant and style-relevant components of a Chinese character feature with a novel intercross pair-wise optimization method. In this case, our method can generate Chinese characters flexibly by reading a few examples. Experiments demonstrate that our method has a powerful one-shot/few-shot generalization ability by inferring the style representation, which is the first attempt to learn to write new-style Chinese characters by observing only one or a few examples.

IJCAI Conference 2018 Conference Paper

Probabilistic Machine Learning: Models, Algorithms and a Programming Library

  • Jun Zhu

Probabilistic machine learning provides a suite of powerful tools for modeling uncertainty, performing probabilistic inference, and making predictions or decisions in uncertain environments. In this paper, we present an overview of our recent work on probabilistic machine learning, including the theory of regularized Bayesian inference, Bayesian deep learning, scalable inference algorithms, a probabilistic programming library named ZhuSuan, and applications in representation learning as well as learning from crowds.

AAAI Conference 2018 Conference Paper

Riemannian Stein Variational Gradient Descent for Bayesian Inference

  • Chang Liu
  • Jun Zhu

We develop Riemannian Stein Variational Gradient Descent (RSVGD), a Bayesian inference method that generalizes Stein Variational Gradient Descent (SVGD) to Riemann manifold. The benefits are two-folds: (i) for inference tasks in Euclidean spaces, RSVGD has the advantage over SVGD of utilizing information geometry, and (ii) for inference tasks on Riemann manifolds, RSVGD brings the unique advantages of SVGD to the Riemannian world. To appropriately transfer to Riemann manifolds, we conceive novel and non-trivial techniques for RSVGD, which are required by the intrinsically different characteristics of general Riemann manifolds from Euclidean spaces. We also discover Riemannian Stein’s Identity and Riemannian Kernelized Stein Discrepancy. Experimental results show the advantages over SVGD of exploring distribution geometry and the advantages of particleefficiency, iteration-effectiveness and approximation flexibility over other inference methods on Riemann manifolds.

AAAI Conference 2018 Conference Paper

Selective Verification Strategy for Learning From Crowds

  • Tian Tian
  • Yichi Zhou
  • Jun Zhu

To deal with the low qualities of web workers in crowdsourcing, many unsupervised label aggregation methods have been investigated but most of them provide inconsistent performance. In this paper, we explore the learning from crowds with selective verification problem. In addition to the noisy responses from the crowds, it also collects the ground truths for a well-chosen subset of tasks as the reference, then aggregates the redundant responses based on the patterns provided by both the supervised and unsupervised signal. To improve the labeling efficiency, we propose the EBM selecting strategy for choosing the verification subset, which is based on the loss error minimization. Specifically, we first establish the expected loss error given the semi-supervised learning estimate, then find the subset that minimizes this selecting criterion. We do extensive empirical comparisons on both synthetic and real-world datasets to show the benefits of this new learning setting as well as our proposal.

NeurIPS Conference 2018 Conference Paper

Semi-crowdsourced Clustering with Deep Generative Models

  • Yucen Luo
  • Tian Tian
  • Jiaxin Shi
  • Jun Zhu
  • Bo Zhang

We consider the semi-supervised clustering problem where crowdsourcing provides noisy information about the pairwise comparisons on a small subset of data, i. e. , whether a sample pair is in the same cluster. We propose a new approach that includes a deep generative model (DGM) to characterize low-level features of the data, and a statistical relational model for noisy pairwise annotations on its subset. The two parts share the latent variables. To make the model automatically trade-off between its complexity and fitting data, we also develop its fully Bayesian variant. The challenge of inference is addressed by fast (natural-gradient) stochastic variational inference algorithms, where we effectively combine variational message passing for the relational part and amortized learning of the DGM under a unified framework. Empirical results on synthetic and real-world datasets show that our model outperforms previous crowdsourced clustering methods.

NeurIPS Conference 2018 Conference Paper

Stochastic Expectation Maximization with Variance Reduction

  • Jianfei Chen
  • Jun Zhu
  • Yee Whye Teh
  • Tong Zhang

Expectation-Maximization (EM) is a popular tool for learning latent variable models, but the vanilla batch EM does not scale to large data sets because the whole data set is needed at every E-step. Stochastic Expectation Maximization (sEM) reduces the cost of E-step by stochastic approximation. However, sEM has a slower asymptotic convergence rate than batch EM, and requires a decreasing sequence of step sizes, which is difficult to tune. In this paper, we propose a variance reduced stochastic EM (sEM-vr) algorithm inspired by variance reduced stochastic gradient descent algorithms. We show that sEM-vr has the same exponential asymptotic convergence rate as batch EM. Moreover, sEM-vr only requires a constant step size to achieve this rate, which alleviates the burden of parameter tuning. We compare sEM-vr with batch EM, sEM and other algorithms on Gaussian mixture models and probabilistic latent semantic analysis, and sEM-vr converges significantly faster than these baselines.

NeurIPS Conference 2018 Conference Paper

Towards Robust Detection of Adversarial Examples

  • Tianyu Pang
  • Chao Du
  • Yinpeng Dong
  • Jun Zhu

Although the recent progress is substantial, deep learning methods can be vulnerable to the maliciously generated adversarial examples. In this paper, we present a novel training procedure and a thresholding test strategy, towards robust detection of adversarial examples. In training, we propose to minimize the reverse cross-entropy (RCE), which encourages a deep network to learn latent representations that better distinguish adversarial examples from normal ones. In testing, we propose to use a thresholding strategy as the detector to filter out adversarial examples for reliable predictions. Our method is simple to implement using standard algorithms, with little extra training cost compared to the common cross-entropy minimization. We apply our method to defend various attacking methods on the widely used MNIST and CIFAR-10 datasets, and achieve significant improvements on robust predictions under all the threat models in the adversarial setting.

AAAI Conference 2018 Conference Paper

Towards Training Probabilistic Topic Models on Neuromorphic Multi-Chip Systems

  • Zihao Xiao
  • Jianfei Chen
  • Jun Zhu

Probabilistic topic models are popular unsupervised learning methods, including probabilistic latent semantic indexing (pLSI) and latent Dirichlet allocation (LDA). By now, their training is implemented on general purpose computers (GPCs), which are flexible in programming but energyconsuming. Towards low-energy implementations, this paper investigates their training on an emerging hardware technology called the neuromorphic multi-chip systems (NMSs). NMSs are very effective for a family of algorithms called spiking neural networks (SNNs). We present three SNNs to train topic models. The first SNN is a batch algorithm combining the conventional collapsed Gibbs sampling (CGS) algorithm and an inference SNN to train LDA. The other two SNNs are online algorithms targeting at both energy- and storage-limited environments. The two online algorithms are equivalent with training LDA by using maximum-a-posterior estimation and maximizing the semi-collapsed likelihood, respectively. They use novel, tailored ordinary differential equations for stochastic optimization. We simulate the new algorithms and show that they are comparable with the GPC algorithms, while being suitable for NMS implementation. We also propose an extension to train pLSI and a method to prune the network to obey the limited fan-in of some NMSs.

AAAI Conference 2018 Conference Paper

Understanding Human Behaviors in Crowds by Imitating the Decision-Making Process

  • Haosheng Zou
  • Hang Su
  • Shihong Song
  • Jun Zhu

Crowd behavior understanding is crucial yet challenging across a wide range of applications, since crowd behavior is inherently determined by a sequential decision-making process based on various factors, such as the pedestrians’ own destinations, interaction with nearby pedestrians and anticipation of upcoming events. In this paper, we propose a novel framework of Social-Aware Generative Adversarial Imitation Learning (SA-GAIL) to mimic the underlying decisionmaking process of pedestrians in crowds. Specifically, we infer the latent factors of human decision-making process in an unsupervised manner by extending the Generative Adversarial Imitation Learning framework to anticipate future paths of pedestrians. Different factors of human decision making are disentangled with mutual information maximization, with the process modeled by collision avoidance regularization and Social-Aware LSTMs. Experimental results demonstrate the potential of our framework in disentangling the latent decision-making factors of pedestrians and stronger abilities in predicting future trajectories.

IJCAI Conference 2017 Conference Paper

Distributed Accelerated Proximal Coordinate Gradient Methods

  • Yong Ren
  • Jun Zhu

We develop a general accelerated proximal coordinate descent algorithm in distributed settings (Dis- APCG) for the optimization problem that minimizes the sum of two convex functions: the first part f is smooth with a gradient oracle, and the other one Ψ is separable with respect to blocks of coordinate and has a simple known structure (e. g. , L1 norm). Our algorithm gets new accelerated convergence rate in the case that f is strongly con- vex by making use of modern parallel structures, and includes previous non-strongly case as a special case. We further present efficient implementations to avoid full-dimensional operations in each step, significantly reducing the computation cost. Experiments on the regularized empirical risk minimization problem demonstrate the effectiveness of our algorithm and match our theoretical findings.

IJCAI Conference 2017 Conference Paper

Forecast the Plausible Paths in Crowd Scenes

  • Hang Su
  • Jun Zhu
  • Yinpeng Dong
  • Bo Zhang

Forecasting the future plausible paths of pedestrians in crowd scenes is of wide applications, but it still remains as a challenging task due to the complexities and uncertainties of crowd motions. To address these issues, we propose to explore the inherent crowd dynamics via a social-aware recurrent Gaussian process model, which facilitates the path prediction by taking advantages of the interplay between the rich prior knowledge and motion uncertainties. Specifically, we derive a social-aware LSTM to explore the crowd dynamic, resulting in a hidden feature embedding the rich prior in massive data. Afterwards, we integrate the descriptor into deep Gaussian processes with motion uncertainties appropriately harnessed. Crowd motion forecasting is implemented by regressing relative motion against the current positions, yielding the predicted paths based on a functional object associated with a distribution. Extensive experiments on public datasets demonstrate that our method obtains the state-of-the-art performance in both structured and unstructured scenes by exploring the complex and uncertain motion patterns, even if the occlusion is serious or the observed trajectories are noisy.

IJCAI Conference 2017 Conference Paper

Improving Learning-from-Crowds through Expert Validation

  • Mengchen Liu
  • Liu Jiang
  • Junlin Liu
  • Xiting Wang
  • Jun Zhu
  • Shixia Liu

Although several effective learning-from-crowd methods have been developed to infer correct labels from noisy crowdsourced labels, a method for post-processed expert validation is still needed. This paper introduces a semi-supervised learning algorithm that is capable of selecting the most informative instances and maximizing the influence of expert labels. Specifically, we have developed a complete uncertainty assessment to facilitate the selection of the most informative instances. The expert labels are then propagated to similar instances via regularized Bayesian inference. Experiments on both real-world and simulated datasets indicate that given a specific accuracy goal (e. g. , 95%) our method reduces expert effort from 39% to 60% compared with the state-of-the-art method.

AAAI Conference 2017 Conference Paper

Learning Attributes from the Crowdsourced Relative Labels

  • Tian Tian
  • Ning Chen
  • Jun Zhu

Finding semantic attributes to describe related concepts is typically a hard problem. The commonly used attributes in most fields are designed by domain experts, which is expensive and time-consuming. In this paper we propose an efficient method to learn human comprehensible attributes with crowdsourcing. We first design an analogical interface to collect relative labels from the crowds. Then we propose a hierarchical Bayesian model, as well as an efficient initialization strategy, to aggregate labels and extract concise attributes. Our experimental results demonstrate promise on discovering diverse and convincing attributes, which significantly improve the performance of the challenging zero-shot learning tasks.

JMLR Journal 2017 Journal Article

Online Bayesian Passive-Aggressive Learning

  • Tianlin Shi
  • Jun Zhu

We present online Bayesian Passive-Aggressive (BayesPA) learning, a generic online learning framework for hierarchical Bayesian models with max-margin posterior regularization. We show that BayesPA subsumes the standard online Passive- Aggressive (PA) learning and extends naturally to incorporate latent variables for both parametric and nonparametric Bayesian inference, therefore providing great flexibility for explorative analysis. As an important example, we apply BayesPA to topic modeling and derive efficient online learning algorithms for max-margin topic models. We further develop nonparametric BayesPA topic models to infer the unknown number of topics in an online manner. Experimental results on 20newsgroups and a large Wikipedia multi-label dataset (with 1.1 millions of training documents and 0.9 million of unique terms in the vocabulary) show that our approaches significantly improve time efficiency while achieving comparable accuracy with the corresponding batch algorithms. [abs] [ pdf ][ bib ] &copy JMLR 2017. ( edit, beta )

NeurIPS Conference 2017 Conference Paper

Population Matching Discrepancy and Applications in Deep Learning

  • Jianfei Chen
  • Chongxuan Li
  • Yizhong Ru
  • Jun Zhu

A differentiable estimation of the distance between two distributions based on samples is important for many deep learning tasks. One such estimation is maximum mean discrepancy (MMD). However, MMD suffers from its sensitive kernel bandwidth hyper-parameter, weak gradients, and large mini-batch size when used as a training objective. In this paper, we propose population matching discrepancy (PMD) for estimating the distribution distance based on samples, as well as an algorithm to learn the parameters of the distributions using PMD as an objective. PMD is defined as the minimum weight matching of sample populations from each distribution, and we prove that PMD is a strongly consistent estimator of the first Wasserstein metric. We apply PMD to two deep learning tasks, domain adaptation and generative modeling. Empirical results demonstrate that PMD overcomes the aforementioned drawbacks of MMD, and outperforms MMD on both tasks in terms of the performance as well as the convergence speed.

IJCAI Conference 2017 Conference Paper

Semi-supervised Max-margin Topic Model with Manifold Posterior Regularization

  • Wenbo Hu
  • Jun Zhu
  • Hang Su
  • Jingwei Zhuo
  • Bo Zhang

Supervised topic models leverage label information to learn discriminative latent topic representations. As collecting a fully labeled dataset is often time-consuming, semi-supervised learning is of high interest. In this paper, we present an effective semi-supervised max-margin topic model by naturally introducing manifold posterior regularization to a regularized Bayesian topic model, named LapMedLDA. The model jointly learns latent topics and a related classifier with only a small fraction of labeled documents. To perform the approximate inference, we derive an efficient stochastic gradient MCMC method. Unlike the previous semi-supervised topic models, our model adopts a tight coupling between the generative topic model and the discriminative classifier. Extensive experiments demonstrate that such tight coupling brings significant benefits in quantitative and qualitative performance.

NeurIPS Conference 2017 Conference Paper

Structured Generative Adversarial Networks

  • Zhijie Deng
  • Hao Zhang
  • Xiaodan Liang
  • Luona Yang
  • Shizhen Xu
  • Jun Zhu
  • Eric Xing

We study the problem of conditional generative modeling based on designated semantics or structures. Existing models that build conditional generators either require massive labeled instances as supervision or are unable to accurately control the semantics of generated samples. We propose structured generative adversarial networks (SGANs) for semi-supervised conditional generative modeling. SGAN assumes the data x is generated conditioned on two independent latent variables: y that encodes the designated semantics, and z that contains other factors of variation. To ensure disentangled semantics in y and z, SGAN builds two collaborative games in the hidden space to minimize the reconstruction error of y and z, respectively. Training SGAN also involves solving two adversarial games that have their equilibrium concentrating at the true joint data distributions p(x, z) and p(x, y), avoiding distributing the probability mass diffusely over data space that MLE-based methods may suffer. We assess SGAN by evaluating its trained networks, and its performance on downstream tasks. We show that SGAN delivers a highly controllable generator, and disentangled representations; it also establishes start-of-the-art results across multiple datasets when applied for semi-supervised image classification (1. 27%, 5. 73%, 17. 26% error rates on MNIST, SVHN and CIFAR-10 using 50, 1000 and 4000 labels, respectively). Benefiting from the separate modeling of y and z, SGAN can generate images with high visual quality and strictly following the designated semantic, and can be extended to a wide spectrum of applications, such as style transfer.

NeurIPS Conference 2017 Conference Paper

Triple Generative Adversarial Nets

  • Chongxuan Li
  • Taufik Xu
  • Jun Zhu
  • Bo Zhang

Generative Adversarial Nets (GANs) have shown promise in image generation and semi-supervised learning (SSL). However, existing GANs in SSL have two problems: (1) the generator and the discriminator (i. e. the classifier) may not be optimal at the same time; and (2) the generator cannot control the semantics of the generated samples. The problems essentially arise from the two-player formulation, where a single discriminator shares incompatible roles of identifying fake samples and predicting labels and it only estimates the data without considering the labels. To address the problems, we present triple generative adversarial net (Triple-GAN), which consists of three players---a generator, a discriminator and a classifier. The generator and the classifier characterize the conditional distributions between images and labels, and the discriminator solely focuses on identifying fake image-label pairs. We design compatible utilities to ensure that the distributions characterized by the classifier and the generator both converge to the data distribution. Our results on various datasets demonstrate that Triple-GAN as a unified model can simultaneously (1) achieve the state-of-the-art classification results among deep generative models, and (2) disentangle the classes and styles of the input and transfer smoothly in the data space via interpolation in the latent space class-conditionally.

AAAI Conference 2016 Conference Paper

Bayesian Matrix Completion via Adaptive Relaxed Spectral Regularization

  • Yang Song
  • Jun Zhu

Bayesian matrix completion has been studied based on a lowrank matrix factorization formulation with promising results. However, little work has been done on Bayesian matrix completion based on the more direct spectral regularization formulation. We fill this gap by presenting a novel Bayesian matrix completion method based on spectral regularization. In order to circumvent the difficulties of dealing with the orthonormality constraints of singular vectors, we derive a new equivalent form with relaxed constraints, which then leads us to design an adaptive version of spectral regularization feasible for Bayesian inference. Our Bayesian method requires no parameter tuning and can infer the number of latent factors automatically. Experiments on synthetic and real datasets demonstrate encouraging results on rank recovery and collaborative filtering, with notably good results for very sparse matrices.

NeurIPS Conference 2016 Conference Paper

Conditional Generative Moment-Matching Networks

  • Yong Ren
  • Jun Zhu
  • Jialian Li
  • Yucen Luo

Maximum mean discrepancy (MMD) has been successfully applied to learn deep generative models for characterizing a joint distribution of variables via kernel mean embedding. In this paper, we present conditional generative moment-matching networks (CGMMN), which learn a conditional distribution given some input variables based on a conditional maximum mean discrepancy (CMMD) criterion. The learning is performed by stochastic gradient descent with the gradient calculated by back-propagation. We evaluate CGMMN on a wide range of tasks, including predictive modeling, contextual generation, and Bayesian dark knowledge, which distills knowledge from a Bayesian model by learning a relatively small CGMMN student network. Our results demonstrate competitive performance in all the tasks.

IJCAI Conference 2016 Conference Paper

Crowd Scene Understanding with Coherent Recurrent Neural Networks

  • Hang Su
  • Yinpeng Dong
  • Jun Zhu
  • Haibin Ling
  • Bo Zhang

Exploring crowd dynamics is essential in understanding crowd scenes, which still remains as a challenging task due to the nonlinear characteristics and coherent spatio-temporal motion patterns in crowd behaviors. To address these issues, we present a Coherent Long Short Term Memory (cLSTM) network to capture the nonlinear crowd dynamics by learning an informative representation of crowd motions, which facilitates the critical tasks in crowd scene analysis. By describing the crowd motion patterns with a cloud of keypoint tracklets, we explore the nonlinear crowd dynamics embedded in the tracklets with a stacked LSTM model, which is further improved to capture the collective properties by introducing a coherent regularization term; and finally, we adopt an unsupervised encoder-decoder framework to learn a hidden feature for each input tracklet that embeds its inherent dynamics. With the learnt features properly harnessed, crowd scene understanding is conducted effectively in predicting the future paths of agents, estimating group states, and classifying crowd events. Extensive experiments on hundreds of public crowd videos demonstrate that our method is state-of-the-art performance by exploring the coherent spatio-temporal structures in crowd behaviors.

AAAI Conference 2016 Conference Paper

Discriminative Nonparametric Latent Feature Relational Models with Data Augmentation

  • Bei Chen
  • Ning Chen
  • Jun Zhu
  • Jiaming Song
  • Bo Zhang

We present a discriminative nonparametric latent feature relational model (LFRM) for link prediction to automatically infer the dimensionality of latent features. Under the generic RegBayes (regularized Bayesian inference) framework, we handily incorporate the prediction loss with probabilistic inference of a Bayesian model; set distinct regularization parameters for different types of links to handle the imbalance issue in real networks; and unify the analysis of both the smooth logistic log-loss and the piecewise linear hinge loss. For the nonconjugate posterior inference, we present a simple Gibbs sampler via data augmentation, without making restricting assumptions as done in variational methods. We further develop an approximate sampler using stochastic gradient Langevin dynamics to handle large networks with hundreds of thousands of entities and millions of links, orders of magnitude larger than what existing LFRM models can process. Extensive studies on various real networks show promising performance.

AAAI Conference 2016 Conference Paper

Jointly Modeling Topics and Intents with Global Order Structure

  • Bei Chen
  • Jun Zhu
  • Nan Yang
  • Tian Tian
  • Ming Zhou
  • Bo Zhang

Modeling document structure is of great importance for discourse analysis and related applications. The goal of this research is to capture the document intent structure by modeling documents as a mixture of topic words and rhetorical words. While the topics are relatively unchanged through one document, the rhetorical functions of sentences usually change following certain orders in discourse. We propose GMM-LDA, a topic modeling based Bayesian unsupervised model, to analyze the document intent structure cooperated with order information. Our model is flexible that has the ability to combine the annotations and do supervised learning. Additionally, entropic regularization can be introduced to model the significant divergence between topics and intents. We perform experiments in both unsupervised and supervised settings, results show the superiority of our model over several state-of-the-art baselines.

NeurIPS Conference 2016 Conference Paper

Kernel Bayesian Inference with Posterior Regularization

  • Yang Song
  • Jun Zhu
  • Yong Ren

We propose a vector-valued regression problem whose solution is equivalent to the reproducing kernel Hilbert space (RKHS) embedding of the Bayesian posterior distribution. This equivalence provides a new understanding of kernel Bayesian inference. Moreover, the optimization problem induces a new regularization for the posterior embedding estimator, which is faster and has comparable performance to the squared regularization in kernel Bayes' rule. This regularization coincides with a former thresholding approach used in kernel POMDPs whose consistency remains to be established. Our theoretical work solves this open problem and provides consistency analysis in regression settings. Based on our optimizational formulation, we propose a flexible Bayesian posterior regularization framework which for the first time enables us to put regularization at the distribution level. We apply this method to nonparametric state-space filtering tasks with extremely nonlinear dynamics and show performance gains over all other baselines.

AAAI Conference 2016 Conference Paper

Pose-Guided Human Parsing by an AND/OR Graph Using Pose-Context Features

  • Fangting Xia
  • Jun Zhu
  • Peng Wang
  • Alan Yuille

Parsing human into semantic parts is crucial to human-centric analysis. In this paper, we propose a human parsing pipeline that uses pose cues, i. e. , estimates of human joint locations, to provide pose-guided segment proposals for semantic parts. These segment proposals are ranked using standard appearance cues, deep-learned semantic feature, and a novel pose feature called pose-context. Then these proposals are selected and assembled using an And-Or graph to output a parse of the person. The And-Or graph is able to deal with large human appearance variability due to pose, choice of clothes, etc. We evaluate our approach on the popular Penn-Fudan pedestrian parsing dataset, showing that it significantly outperforms the state-of-the-arts, and perform diagnostics to demonstrate the effectiveness of different stages of our pipeline.

NeurIPS Conference 2016 Conference Paper

Stochastic Gradient Geodesic MCMC Methods

  • Chang Liu
  • Jun Zhu
  • Yang Song

We propose two stochastic gradient MCMC methods for sampling from Bayesian posterior distributions defined on Riemann manifolds with a known geodesic flow, e. g. hyperspheres. Our methods are the first scalable sampling methods on these manifolds, with the aid of stochastic gradients. Novel dynamics are conceived and 2nd-order integrators are developed. By adopting embedding techniques and the geodesic integrator, the methods do not require a global coordinate system of the manifold and do not involve inner iterations. Synthetic experiments show the validity of the method, and its application to the challenging inference for spherical topic models indicate practical usability and efficiency.

IJCAI Conference 2015 Conference Paper

Adaptive Dropout Rates for Learning with Corrupted Features

  • Jingwei Zhuo
  • Jun Zhu
  • Bo Zhang

Feature noising is an effective mechanism on reducing the risk of overfitting. To avoid an explosive searching space, existing work typically assumes that all features share a single noise level, which is often cross-validated. In this paper, we present a Bayesian feature noising model that flexibly allows for dimension-specific or group-specific noise levels, and we derive a learning algorithm that adaptively updates these noise levels. Our adaptive rule is simple and interpretable, by drawing a direct connection to the fitness of each individual feature or feature group. Empirical results on various datasets demonstrate the effectiveness on avoiding extensive tuning and sometimes improving the performance due to its flexibility.

NeurIPS Conference 2015 Conference Paper

Max-Margin Deep Generative Models

  • Chongxuan Li
  • Jun Zhu
  • Tianlin Shi
  • Bo Zhang

Deep generative models (DGMs) are effective on learning multilayered representations of complex data and performing inference of input data by exploring the generative ability. However, little work has been done on examining or empowering the discriminative ability of DGMs on making accurate predictions. This paper presents max-margin deep generative models (mmDGMs), which explore the strongly discriminative principle of max-margin learning to improve the discriminative power of DGMs, while retaining the generative capability. We develop an efficient doubly stochastic subgradient algorithm for the piecewise linear objective. Empirical results on MNIST and SVHN datasets demonstrate that (1) max-margin learning can significantly improve the prediction performance of DGMs and meanwhile retain the generative ability; and (2) mmDGMs are competitive to the state-of-the-art fully discriminative networks by employing deep convolutional neural networks (CNNs) as both recognition and generative models.

NeurIPS Conference 2015 Conference Paper

Max-Margin Majority Voting for Learning from Crowds

  • Tian Tian
  • Jun Zhu

Learning-from-crowds aims to design proper aggregation strategies to infer the unknown true labels from the noisy labels provided by ordinary web workers. This paper presents max-margin majority voting (M^3V) to improve the discriminative ability of majority voting and further presents a Bayesian generalization to incorporate the flexibility of generative methods on modeling noisy observations with worker confusion matrices. We formulate the joint learning as a regularized Bayesian inference problem, where the posterior regularization is derived by maximizing the margin between the aggregated score of a potential true label and that of any alternative label. Our Bayesian model naturally covers the Dawid-Skene estimator and M^3V. Empirical results demonstrate that our methods are competitive, often achieving better results than state-of-the-art estimators.

IJCAI Conference 2015 Conference Paper

Modelling High-Dimensional Sequences with LSTM-RTRBM: Application to Polyphonic Music Generation

  • Qi Lyu
  • Zhiyong Wu
  • Jun Zhu
  • Helen Meng

We propose an automatic music generation demo based on artificial neural networks, which integrates the ability of Long Short-Term Memory (LSTM) in memorizing and retrieving useful history information, together with the advantage of Restricted Boltzmann Machine (RBM) in high dimensional data modelling. Our model can generalize to different musical styles and generate polyphonic music better than previous models.

JMLR Journal 2014 Journal Article

Bayesian Inference with Posterior Regularization and Applications to Infinite Latent SVMs

  • Jun Zhu
  • Ning Chen
  • Eric P. Xing

Existing Bayesian models, especially nonparametric Bayesian methods, rely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations. While priors affect posterior distributions through Bayes' rule, imposing posterior regularization is arguably more direct and in some cases more natural and general. In this paper, we present regularized Bayesian inference (RegBayes), a novel computational framework that performs posterior inference with a regularization term on the desired post-data posterior distribution under an information theoretical formulation. RegBayes is more flexible than the procedure that elicits expert knowledge via priors, and it covers both directed Bayesian networks and undirected Markov networks. When the regularization is induced from a linear operator on the posterior distributions, such as the expectation operator, we present a general convex-analysis theorem to characterize the solution of RegBayes. Furthermore, we present two concrete examples of RegBayes, infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large- margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark data sets, which appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics. Such results contribute to push forward the interface between these two important subfields, which have been largely treated as isolated in the community. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

NeurIPS Conference 2014 Conference Paper

Distributed Bayesian Posterior Sampling via Moment Sharing

  • Minjie Xu
  • Balaji Lakshminarayanan
  • Yee Whye Teh
  • Jun Zhu
  • Bo Zhang

We propose a distributed Markov chain Monte Carlo (MCMC) inference algorithm for large scale Bayesian posterior simulation. We assume that the dataset is partitioned and stored across nodes of a cluster. Our procedure involves an independent MCMC posterior sampler at each node based on its local partition of the data. Moment statistics of the local posteriors are collected from each sampler and propagated across the cluster using expectation propagation message passing with low communication costs. The moment sharing scheme improves posterior estimation quality by enforcing agreement among the samplers. We demonstrate the speed and inference quality of our method with empirical studies on Bayesian logistic regression and sparse linear regression with a spike-and-slab prior.

AAAI Conference 2014 Conference Paper

Dropout Training for Support Vector Machines

  • Ning Chen
  • Jun Zhu
  • Jianfei Chen
  • Bo Zhang

Dropout and other feature noising schemes have shown promising results in controlling over-fitting by artificially corrupting the training data. Though extensive theoretical and empirical studies have been performed for generalized linear models, little work has been done for support vector machines (SVMs), one of the most successful approaches for supervised learning. This paper presents dropout training for linear SVMs. To deal with the intractable expectation of the non-smooth hinge loss under corrupting distributions, we develop an iteratively re-weighted least square (IRLS) algorithm by exploring data augmentation techniques. Our algorithm iteratively minimizes the expectation of a re-weighted least square problem, where the re-weights have closedform solutions. The similar ideas are applied to develop a new IRLS algorithm for the expected logistic loss under corrupting distributions. Our algorithms offer insights on the connection and difference between the hinge loss and logistic loss in dropout training. Empirical results on several real datasets demonstrate the effectiveness of dropout training on significantly boosting the classification accuracy of linear SVMs.

JMLR Journal 2014 Journal Article

Gibbs Max-margin Topic Models with Data Augmentation

  • Jun Zhu
  • Ning Chen
  • Hugh Perkins
  • Bo Zhang

Max-margin learning is a powerful approach to building classifiers and structured output predictors. Recent work on max-margin supervised topic models has successfully integrated it with Bayesian topic models to discover discriminative latent semantic structures and make accurate predictions for unseen testing data. However, the resulting learning problems are usually hard to solve because of the non-smoothness of the margin loss. Existing approaches to building max-margin supervised topic models rely on an iterative procedure to solve multiple latent SVM subproblems with additional mean-field assumptions on the desired posterior distributions. This paper presents an alternative approach by defining a new max-margin loss. Namely, we present Gibbs max-margin supervised topic models, a latent variable Gibbs classifier to discover hidden topic representations for various tasks, including classification, regression and multi-task learning. Gibbs max- margin supervised topic models minimize an expected margin loss, which is an upper bound of the existing margin loss derived from an expected prediction rule. By introducing augmented variables and integrating out the Dirichlet variables analytically by conjugacy, we develop simple Gibbs sampling algorithms with no restrictive assumptions and no need to solve SVM subproblems. Furthermore, each step of the “augment-and-collapse" Gibbs sampling algorithms has an analytical conditional distribution, from which samples can be easily drawn. Experimental results on several medium-sized and large-scale data sets demonstrate significant improvements on time efficiency. The classification performance is also improved over competitors on binary, multi- class and multi-label classification tasks. [abs] [ pdf ][ bib ] &copy JMLR 2014. ( edit, beta )

NeurIPS Conference 2014 Conference Paper

Learning From Weakly Supervised Data by The Expectation Loss SVM (e-SVM) algorithm

  • Jun Zhu
  • Junhua Mao
  • Alan Yuille

In many situations we have some measurement of confidence on positiveness for a binary label. The positiveness" is a continuous value whose range is a bounded interval. It quantifies the affiliation of each training data to the positive class. We propose a novel learning algorithm called \emph{expectation loss SVM} (e-SVM) that is devoted to the problems where only the ``positiveness" instead of a binary label of each training sample is available. Our e-SVM algorithm can also be readily extended to learn segment classifiers under weak supervision where the exact positiveness value of each training example is unobserved. In experiments, we show that the e-SVM algorithm can effectively address the segment proposal classification task under both strong supervision (e. g. the pixel-level annotations are available) and the weak supervision (e. g. only bounding-box annotations are available), and outperforms the alternative approaches. Besides, we further validate this method on two major tasks of computer vision: semantic segmentation and object detection. Our method achieves the state-of-the-art object detection performance on PASCAL VOC 2007 dataset. "

NeurIPS Conference 2014 Conference Paper

Robust Bayesian Max-Margin Clustering

  • Changyou Chen
  • Jun Zhu
  • Xinhua Zhang

We present max-margin Bayesian clustering (BMC), a general and robust framework that incorporates the max-margin criterion into Bayesian clustering models, as well as two concrete models of BMC to demonstrate its flexibility and effectiveness in dealing with different clustering tasks. The Dirichlet process max-margin Gaussian mixture is a nonparametric Bayesian clustering model that relaxes the underlying Gaussian assumption of Dirichlet process Gaussian mixtures by incorporating max-margin posterior constraints, and is able to infer the number of clusters from data. We further extend the ideas to present max-margin clustering topic model, which can learn the latent topic representation of each document while at the same time cluster documents in the max-margin fashion. Extensive experiments are performed on a number of real datasets, and the results indicate superior clustering performance of our methods compared to related baselines.

AAAI Conference 2014 Conference Paper

Small-Variance Asymptotics for Dirichlet Process Mixtures of SVMs

  • Yining Wang
  • Jun Zhu

Infinite SVM (iSVM) is a Dirichlet process (DP) mixture of large-margin classifiers. Though flexible in learning nonlinear classifiers and discovering latent clustering structures, iSVM has a difficult inference task and existing methods could hinder its applicability to large-scale problems. This paper presents a smallvariance asymptotic analysis to derive a simple and efficient algorithm, which monotonically optimizes a maxmargin DP-means (M2 DPM) problem, an extension of DP-means for both predictive learning and descriptive clustering. Our analysis is built on Gibbs infinite SVMs, an alternative DP mixture of large-margin machines, which admits a partially collapsed Gibbs sampler without truncation by exploring data augmentation techniques. Experimental results show that M2 DPM runs much faster than similar algorithms without sacrificing prediction accuracies.

NeurIPS Conference 2014 Conference Paper

Spectral Methods for Supervised Topic Models

  • Yining Wang
  • Jun Zhu

Supervised topic models simultaneously model the latent topic structure of large collections of documents and a response variable associated with each document. Existing inference methods are based on either variational approximation or Monte Carlo sampling. This paper presents a novel spectral decomposition algorithm to recover the parameters of supervised latent Dirichlet allocation (sLDA) models. The Spectral-sLDA algorithm is provably correct and computationally efficient. We prove a sample complexity bound and subsequently derive a sufficient condition for the identifiability of sLDA. Thorough experiments on a diverse range of synthetic and real-world datasets verify the theory and demonstrate the practical effectiveness of the algorithm.

IJCAI Conference 2013 Conference Paper

Generalized Relational Topic Models with Data Augmentation

  • Ning Chen
  • Jun Zhu
  • Fei Xia
  • Bo Zhang

Relational topic models have shown promise on analyzing document network structures and discovering latent topic representations. This paper presents three extensions: 1) unlike the common link likelihood with a diagonal weight matrix that allows the-same-topic interactions only, we generalize it to use a full weight matrix that captures all pairwise topic interactions and is applicable to asymmetric networks; 2) instead of doing standard Bayesian inference, we perform regularized Bayesian inference with a regularization parameter to deal with the imbalanced link structure issue in common real networks; and 3) instead of doing variational approximation with strict mean-field assumptions, we present a collapsed Gibbs sampling algorithm for the generalized relational topic models without making restricting assumptions. Experimental results demonstrate the significance of these extensions on improving the prediction performance, and the time efficiency can be dramatically improved with a simple fast approximation method.

NeurIPS Conference 2013 Conference Paper

Scalable Inference for Logistic-Normal Topic Models

  • Jianfei Chen
  • Jun Zhu
  • Zi Wang
  • Xun Zheng
  • Bo Zhang

Logistic-normal topic models can effectively discover correlation structures among latent topics. However, their inference remains a challenge because of the non-conjugacy between the logistic-normal prior and multinomial topic mixing proportions. Existing algorithms either make restricting mean-field assumptions or are not scalable to large-scale applications. This paper presents a partially collapsed Gibbs sampling algorithm that approaches the provably correct distribution by exploring the ideas of data augmentation. To improve time efficiency, we further present a parallel implementation that can deal with large-scale applications and learn the correlation structures of thousands of topics from millions of documents. Extensive empirical results demonstrate the promise.

JMLR Journal 2012 Journal Article

MedLDA: Maximum Margin Supervised Topic Models

  • Jun Zhu
  • Amr Ahmed
  • Eric P. Xing

A supervised topic model can use side information such as ratings or labels associated with documents or images to discover more predictive low dimensional topical representations of the data. However, existing supervised topic models predominantly employ likelihood-driven objective functions for learning and inference, leaving the popular and potentially powerful max-margin principle unexploited for seeking predictive representations of data and more discriminative topic bases for the corpus. In this paper, we propose the maximum entropy discrimination latent Dirichlet allocation (MedLDA) model, which integrates the mechanism behind the max-margin prediction models (e.g., SVMs) with the mechanism behind the hierarchical Bayesian topic models (e.g., LDA) under a unified constrained optimization framework, and yields latent topical representations that are more discriminative and more suitable for prediction tasks such as document classification or regression. The principle underlying the MedLDA formalism is quite general and can be applied for jointly max-margin and maximum likelihood learning of directed or undirected topic models when supervising side information is available. Efficient variational methods for posterior inference and parameter estimation are derived and extensive empirical studies on several real data sets are also provided. Our experimental results demonstrate qualitatively and quantitatively that MedLDA could: 1) discover sparse and highly discriminative topical representations; 2) achieve state of the art prediction performance; and 3) be more efficient than existing supervised topic models, especially for classification. [abs] [ pdf ][ bib ] &copy JMLR 2012. ( edit, beta )

NeurIPS Conference 2012 Conference Paper

Monte Carlo Methods for Maximum Margin Supervised Topic Models

  • Qixia Jiang
  • Jun Zhu
  • Maosong Sun
  • Eric Xing

An effective strategy to exploit the supervising side information for discovering predictive topic representations is to impose discriminative constraints induced by such information on the posterior distributions under a topic model. This strategy has been adopted by a number of supervised topic models, such as MedLDA, which employs max-margin posterior constraints. However, unlike the likelihood-based supervised topic models, of which posterior inference can be carried out using the Bayes' rule, the max-margin posterior constraints have made Monte Carlo methods infeasible or at least not directly applicable, thereby limited the choice of inference algorithms to be based on variational approximation with strict mean field assumptions. In this paper, we develop two efficient Monte Carlo methods under much weaker assumptions for max-margin supervised topic models based on an importance sampler and a collapsed Gibbs sampler, respectively, in a convex dual formulation. We report thorough experimental results that compare our approach favorably against existing alternatives in both accuracy and efficiency.

NeurIPS Conference 2012 Conference Paper

Nonparametric Max-Margin Matrix Factorization for Collaborative Prediction

  • Minjie Xu
  • Jun Zhu
  • Bo Zhang

We present a probabilistic formulation of max-margin matrix factorization and build accordingly a nonparametric Bayesian model which automatically resolves the unknown number of latent factors. Our work demonstrates a successful example that integrates Bayesian nonparametrics and max-margin learning, which are conventionally two separate paradigms and enjoy complementary advantages. We develop an efcient variational algorithm for posterior inference, and our extensive empirical studies on large-scale MovieLens and EachMovie data sets appear to justify the aforementioned dual advantages.

NeurIPS Conference 2011 Conference Paper

Infinite Latent SVM for Classification and Multi-task Learning

  • Jun Zhu
  • Ning Chen
  • Eric Xing

Unlike existing nonparametric Bayesian models, which rely solely on specially conceived priors to incorporate domain knowledge for discovering improved latent representations, we study nonparametric Bayesian inference with regularization on the desired posterior distributions. While priors can indirectly affect posterior distributions through Bayes' theorem, imposing posterior regularization is arguably more direct and in some cases can be much easier. We particularly focus on developing infinite latent support vector machines (iLSVM) and multi-task infinite latent support vector machines (MT-iLSVM), which explore the large-margin idea in combination with a nonparametric Bayesian model for discovering predictive latent features for classification and multi-task learning, respectively. We present efficient inference methods and report empirical studies on several benchmark datasets. Our results appear to demonstrate the merits inherited from both large-margin learning and Bayesian nonparametrics.

NeurIPS Conference 2010 Conference Paper

Adaptive Multi-Task Lasso: with Application to eQTL Detection

  • Seunghak Lee
  • Jun Zhu
  • Eric Xing

To understand the relationship between genomic variations among population and complex diseases, it is essential to detect eQTLs which are associated with phenotypic effects. However, detecting eQTLs remains a challenge due to complex underlying mechanisms and the very large number of genetic loci involved compared to the number of samples. Thus, to address the problem, it is desirable to take advantage of the structure of the data and prior information about genomic locations such as conservation scores and transcription factor binding sites. In this paper, we propose a novel regularized regression approach for detecting eQTLs which takes into account related traits simultaneously while incorporating many regulatory features. We first present a Bayesian network for a multi-task learning problem that includes priors on SNPs, making it possible to estimate the significance of each covariate adaptively. Then we find the maximum a posteriori (MAP) estimation of regression coefficients and estimate weights of covariates jointly. This optimization procedure is efficient since it can be achieved by using convex optimization and a coordinate descent procedure iteratively. Experimental results on simulated and real yeast datasets confirm that our model outperforms previous methods for finding eQTLs.

NeurIPS Conference 2010 Conference Paper

Efficient Relational Learning with Hidden Variable Detection

  • Ni Lao
  • Jun Zhu
  • Liu Liu
  • Yandong Liu
  • William Cohen

Markov networks (MNs) can incorporate arbitrarily complex features in modeling relational data. However, this flexibility comes at a sharp price of training an exponentially complex model. To address this challenge, we propose a novel relational learning approach, which consists of a restricted class of relational MNs (RMNs) called relation tree-based RMN (treeRMN), and an efficient Hidden Variable Detection algorithm called Contrastive Variable Induction (CVI). On one hand, the restricted treeRMN only considers simple (e. g. , unary and pairwise) features in relational data and thus achieves computational efficiency; and on the other hand, the CVI algorithm efficiently detects hidden variables which can capture long range dependencies. Therefore, the resultant approach is highly efficient yet does not sacrifice its expressive power. Empirical results on four real datasets show that the proposed relational learning method can achieve similar prediction quality as the state-of-the-art approaches, but is significantly more efficient in training; and the induced hidden variables are semantically meaningful and crucial to improve the training speed and prediction qualities of treeRMNs.

NeurIPS Conference 2010 Conference Paper

Large Margin Learning of Upstream Scene Understanding Models

  • Jun Zhu
  • Li-Jia Li
  • Li Fei-Fei
  • Eric Xing

Upstream supervised topic models have been widely used for complicated scene understanding. However, existing maximum likelihood estimation (MLE) schemes can make the prediction model learning independent of latent topic discovery and result in an imbalanced prediction rule for scene classification. This paper presents a joint max-margin and max-likelihood learning method for upstream scene understanding models, in which latent topic discovery and prediction model estimation are closely coupled and well-balanced. The optimization problem is efficiently solved with a variational EM procedure, which iteratively solves an online loss-augmented SVM. We demonstrate the advantages of the large-margin approach on both an 8-category sports dataset and the 67-class MIT indoor scene dataset for scene categorization.

NeurIPS Conference 2010 Conference Paper

Predictive Subspace Learning for Multi-view Data: a Large Margin Approach

  • Ning Chen
  • Jun Zhu
  • Eric Xing

Learning from multi-view data is important in many applications, such as image classification and annotation. In this paper, we present a large-margin learning framework to discover a predictive latent subspace representation shared by multiple views. Our approach is based on an undirected latent space Markov network that fulfills a weak conditional independence assumption that multi-view observations and response variables are independent given a set of latent variables. We provide efficient inference and parameter estimation methods for the latent subspace model. Finally, we demonstrate the advantages of large-margin learning on real video and web image data for discovering predictive latent representations and improving the performance on image classification, annotation and retrieval.

JMLR Journal 2009 Journal Article

Maximum Entropy Discrimination Markov Networks

  • Jun Zhu
  • Eric P. Xing

The standard maximum margin approach for structured prediction lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as dual sparseness and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables. In this paper, we present a new general framework called maximum entropy discrimination Markov networks (MaxEnDNet, or simply, MEDN), which integrates these two approaches and combines and extends their merits. Major innovations of this approach include: 1) It extends the conventional max-entropy discrimination learning of classification rules to a new structural max-entropy discrimination paradigm of learning a distribution of Markov networks. 2) It generalizes the extant Markov network structured-prediction rule based on a point estimator of model coefficients to an averaging model akin to a Bayesian predictor that integrates over a learned posterior distribution of model coefficients. 3) It admits flexible entropic regularization of the model during learning. By plugging in different prior distributions of the model coefficients, it subsumes the well-known maximum margin Markov networks (M 3 N) as a special case, and leads to a model similar to an L 1 -regularized M 3 N that is simultaneously primal and dual sparse, or other new types of Markov networks. 4) It applies a modular learning algorithm that combines existing variational inference techniques and convex-optimization based M 3 N solvers as subroutines. Essentially, MEDN can be understood as a jointly maximum likelihood and maximum margin estimate of Markov network. It represents the first successful attempt to combine maximum entropy learning (a dual form of maximum likelihood learning) with maximum margin learning of Markov network for structured input/output problems; and the basic principle can be generalized to learning arbitrary graphical models, such as the generative Bayesian networks or models with structured hidden variables. We discuss a number of theoretical properties of this approach, and show that empirically it outperforms a wide array of competing methods for structured input/output learning on both synthetic and real OCR and web data extraction data sets. [abs] [ pdf ][ bib ] &copy JMLR 2009. ( edit, beta )

JMLR Journal 2008 Journal Article

Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction

  • Jun Zhu
  • Zaiqing Nie
  • Bo Zhang
  • Ji-Rong Wen

Existing template-independent web data extraction approaches adopt highly ineffective decoupled strategies---attempting to do data record detection and attribute labeling in two separate phases. In this paper, we propose an integrated web data extraction paradigm with hierarchical models. The proposed model is called Dynamic Hierarchical Markov Random Fields (DHMRFs). DHMRFs take structural uncertainty into consideration and define a joint distribution of both model structure and class labels. The joint distribution is an exponential family distribution. As a conditional model, DHMRFs relax the independence assumption as made in directed models. Since exact inference is intractable, a variational method is developed to learn the model's parameters and to find the MAP model structure and label assignments. We apply DHMRFs to a real-world web data extraction task. Experimental results show that: (1) integrated web data extraction models can achieve significant improvements on both record detection and attribute labeling compared to decoupled models; (2) in diverse web data extraction DHMRFs can potentially address the blocky artifact issue which is suffered by fixed-structured hierarchical models. [abs] [ pdf ][ bib ] &copy JMLR 2008. ( edit, beta )

NeurIPS Conference 2008 Conference Paper

Partially Observed Maximum Entropy Discrimination Markov Networks

  • Jun Zhu
  • Eric Xing
  • Bo Zhang

Learning graphical models with hidden variables can offer semantic insights to complex data and lead to salient structured predictors without relying on expensive, sometime unattainable fully annotated training data. While likelihood-based methods have been extensively explored, to our knowledge, learning structured prediction models with latent variables based on the max-margin principle remains largely an open problem. In this paper, we present a partially observed Maximum Entropy Discrimination Markov Network (PoMEN) model that attempts to combine the advantages of Bayesian and margin based paradigms for learning Markov networks from partially labeled data. PoMEN leads to an averaging prediction rule that resembles a Bayes predictor that is more robust to overfitting, but is also built on the desirable discriminative laws resemble those of the M$^3$N. We develop an EM-style algorithm utilizing existing convex optimization algorithms for M$^3$N as a subroutine. We demonstrate competent performance of PoMEN over existing methods on a real-world web data extraction task.

IJCAI Conference 1999 Conference Paper

Remembering to Add: Competence-preserving Case-Addition Policies for Case- Base Maintenance

  • Jun Zhu
  • Qiang Yang

Case-base maintenance is gaining increasing recognition in research and the practical applications of case-based reasoning (CBR). This intense interest is highlighted by Smyth and Keane's research on case deletion policies. In their work, Smyth and Keane advocated a case deletion policy, whereby the cases in a case base are classified and deleted based on their cover­ age potential and adaptation power. The al­ gorithm was empirically shown to improve the competence of a CBR system and outperform a number of previous deletion-based strategies. In this paper, we present a different case-base maintenance policy that is based on case addi­ tion rather than deletion. The advantage of our algorithm is that we can place a lower bound on the competence of the resulting case base; we demonstrate that the coverage of the com­ puted case base cannot be worse than the op­ timal case ba. se in coverage4 by a fixed lower bound, and the coverage is often much closer to optimum. We also show that the Smyth and Keane's deletion based policy cannot guarantee any such lower bound. Our result highlights the importance of finding the right ca. se-ba. se maintenance algorithm in order to guarantee the best case-base coverage. We demonstrate the effectiveness of our algorithm through an experiment in case-based planning.