Author name cluster

Wei Chen 0034

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

15 papers

1 author row

ICLR Conference 2025 Conference Paper

CLIPure: Purification in Latent Space via CLIP for Adversarially Robust Zero-Shot Classification

Mingkun Zhang
Keping Bi
Wei Chen 0034
Jiafeng Guo
Xueqi Cheng

In this paper, we aim to build an adversarially robust zero-shot image classifier that can accurately and efficiently classify unseen examples while defending against unforeseen adversarial attacks, addressing critical challenges in real-world safety-sensitive scenarios. To achieve this, we focus on two key challenges: zero-shot classification and defense against unforeseen attacks. We ground our work on CLIP, a vision-language pre-trained model to perform zero-shot classification. To defend against unforeseen attacks, we adopt a purification approach, as it is independent of specific attack types. We then define a purification risk as the KL divergence between the joint distributions of the purification and attack process. The derived lower bound of purification risk inspires us to explore purification in CLIP's multi-modal latent space. We propose a CLIP-based purification method called CLIPure, which has two variants: _CLIPure-Diff_, which models image likelihood with a generative process of its latent vector, and _CLIPure-Cos_, which models the likelihood based on the similarity between embeddings of the image and a blank template. As far as we know, CLIPure is the first purification method in latent space and _CLIPure-Cos_ is the first purification method not relying on generative models, substantially improving defense efficiency. Extensive experimental results show that the robustness achieved by CLIPure is within a small gap of clean accuracy, outperforming SOTA robustness by a large margin, e.g., from 71.7\% to **91.1\%** on CIFAR10, from 59.6\% to **72.6\%** on ImageNet, and **108\%** relative improvements of average robustness on the 13 datasets over previous SOTA, with only 14\% extra inference cost and no additional training.

ECAI Conference 2024 Conference Paper

Classifier Guidance Enhances Diffusion-Based Adversarial Purification by Preserving Predictive Information

Mingkun Zhang
Jianing Li
Wei Chen 0034
Jiafeng Guo
Xueqi Cheng

Adversarial purification is one of the promising approaches to defend neural networks against adversarial attacks. Recently, methods utilizing diffusion probabilistic models have achieved great success for adversarial purification in image classification tasks. However, such methods fall into the dilemma of balancing the needs for noise removal and information preservation. This paper points out that existing adversarial purification methods based on diffusion models gradually lose sample information during the core denoising process, causing occasional label shift in subsequent classification tasks. As a remedy, we suggest to suppress such information loss by introducing guidance from the classifier confidence. Specifically, we propose Classifier-cOnfidence gUided Purification (COUP) algorithm, which purifies adversarial examples while keeping away from the classifier decision boundary. Experimental results show that COUP can achieve better adversarial robustness under strong attack methods.

ICLR Conference 2023 Conference Paper

Collaborative Pure Exploration in Kernel Bandit

Yihan Du
Wei Chen 0034
Yuko Kuroki
Longbo Huang

In this paper, we propose a novel Collaborative Pure Exploration in Kernel Bandit model (CoPE-KB), where multiple agents collaborate to complete different but related tasks with limited communication. Our model generalizes prior CoPE formulation with the single-task and classic MAB setting to allow multiple tasks and general reward structures. We propose a novel communication scheme with an efficient kernelized estimator, and design optimal algorithms CoKernelFC and CoKernelFB for CoPE-KB with fixed-confidence and fixed-budget objectives, respectively. Nearly matching upper and lower bounds in both sampling and communication complexity are established to demonstrate the optimality of our algorithms. Our theoretical results explicitly quantify how task similarities influence learning speedup, and only depend on the effective dimension of feature space. Our novel techniques including an efficient kernelized estimator and linear structured instance transformation, which overcome the communication difficulty in high-dimensional feature space and derive communication round lower bounds, can be of independent interests.

ICLR Conference 2022 Conference Paper

Gradient Information Matters in Policy Optimization by Back-propagating through Model

Chongchong Li
Yue Wang 0017
Wei Chen 0034
Yuting Liu 0002
Zhiming Ma
Tie-Yan Liu

Model-based reinforcement learning provides an efficient mechanism to find the optimal policy by interacting with the learned environment. In addition to treating the learned environment like a black-box simulator, a more effective way to use the model is to exploit its differentiability. Such methods require the gradient information of the learned environment model when calculating the policy gradient. However, since the error of gradient is not considered in the model learning phase, there is no guarantee for the model's accuracy. To address this problem, we first analyze the convergence rate for the policy optimization methods when the policy gradient is calculated using the learned environment model. The theoretical results show that the model gradient error matters in the policy optimization phrase. Then we propose a two-model-based learning method to control the prediction error and the gradient error. We separate the different roles of these two models at the model learning phase and coordinate them at the policy optimization phase. After proposing the method, we introduce the directional derivative projection policy optimization (DDPPO) algorithm as a practical implementation to find the optimal policy. Finally, we empirically demonstrate the proposed algorithm has better sample efficiency when achieving a comparable or better performance on benchmark continuous control tasks.

ICLR Conference 2022 Conference Paper

PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Dependent Adaptive Prior

Sang-gil Lee
Heeseung Kim
Chaehun Shin
Xu Tan 0003
Chang Liu 0030
Qi Meng
Tao Qin 0001
Wei Chen 0034

Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework assumes the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the audio domain, we consider the recently proposed diffusion-based audio generative models based on both the spectral and time domains and show that PriorGrad achieves faster convergence and superior performance, leading to an improved perceptual quality and tolerance to a smaller network capacity, and thereby demonstrating the efficiency of a data-dependent adaptive prior.

ICML Conference 2022 Conference Paper

SE(3) Equivariant Graph Neural Networks with Complete Local Frames

Weitao Du
He Zhang
Yuanqi Du
Qi Meng
Wei Chen 0034
Nanning Zheng 0001
Bin Shao
Tie-Yan Liu

Group equivariance (e. g. SE(3) equivariance) is a critical physical symmetry in science, from classical and quantum physics to computational biology. It enables robust and accurate prediction under arbitrary reference transformations. In light of this, great efforts have been put on encoding this symmetry into deep neural networks, which has been shown to improve the generalization performance and data efficiency for downstream tasks. Constructing an equivariant neural network generally brings high computational costs to ensure expressiveness. Therefore, how to better trade-off the expressiveness and computational efficiency plays a core role in the design of the equivariant deep learning models. In this paper, we propose a framework to construct SE(3) equivariant graph neural networks that can approximate the geometric quantities efficiently. Inspired by differential geometry and physics, we introduce equivariant local complete frames to graph neural networks, such that tensor information at given orders can be projected onto the frames. The local frame is constructed to form an orthonormal basis that avoids direction degeneration and ensure completeness. Since the frames are built only by cross product operations, our method is computationally efficient. We evaluate our method on two tasks: Newton mechanics modeling and equilibrium molecule conformation generation. Extensive experimental results demonstrate that our model achieves the best or competitive performance in two types of datasets.

ICLR Conference 2021 Conference Paper

Do not Let Privacy Overbill Utility: Gradient Embedding Perturbation for Private Learning

Da Yu
Huishuai Zhang
Wei Chen 0034
Tie-Yan Liu

The privacy leakage of the model about the training data can be bounded in the differential privacy mechanism. However, for meaningful privacy parameters, a differentially private model degrades the utility drastically when the model comprises a large number of trainable parameters. In this paper, we propose an algorithm \emph{Gradient Embedding Perturbation (GEP)} towards training differentially private deep models with decent accuracy. Specifically, in each gradient descent step, GEP first projects individual private gradient into a non-sensitive anchor subspace, producing a low-dimensional gradient embedding and a small-norm residual gradient. Then, GEP perturbs the low-dimensional embedding and the residual gradient separately according to the privacy budget. Such a decomposition permits a small perturbation variance, which greatly helps to break the dimensional barrier of private learning. With GEP, we achieve decent accuracy with low computational cost and modest privacy guarantee for deep models. Especially, with privacy bound $\epsilon=8$, we achieve $74.9\%$ test accuracy on CIFAR10 and $95.1\%$ test accuracy on SVHN, significantly improving over existing results.

ICML Conference 2021 Conference Paper

Large Scale Private Learning via Low-rank Reparametrization

Da Yu
Huishuai Zhang
Wei Chen 0034
Jian Yin 0001
Tie-Yan Liu

We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks, which are 1) the huge memory cost of storing individual gradients, 2) the added noise suffering notorious dimensional dependence. Specifically, we reparametrize each weight matrix with two \emph{gradient-carrier} matrices of small dimension and a \emph{residual weight} matrix. We argue that such reparametrization keeps the forward/backward process unchanged while enabling us to compute the projected gradient without computing the gradient itself. To learn with differential privacy, we design \emph{reparametrized gradient perturbation (RGP)} that perturbs the gradients on gradient-carrier matrices and reconstructs an update for the original weight from the noisy gradients. Importantly, we use historical updates to find the gradient-carrier matrices, whose optimality is rigorously justified under linear regression and empirically verified with deep learning tasks. RGP significantly reduces the memory cost and improves the utility. For example, we are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83. 9%$ on four downstream tasks with $\epsilon=8$, which is within $5%$ loss compared to the non-private baseline but enjoys much lower privacy leakage risk.

UAI Conference 2021 Conference Paper

Path-BN: Towards effective batch normalization in the Path Space for ReLU networks

Xufang Luo
Qi Meng
Wei Chen 0034
Yunhong Wang 0001
Tie-Yan Liu

Neural networks with ReLU activation functions (abbrev. ReLU Networks), have demonstrated their success in many applications. Recently, researchers noticed that ReLU networks are positively scale-invariant (PSI) while the weights are not. This mismatch may lead to undesirable behaviors in the optimization process. Hence, some new algorithms that conduct optimization directly in the path space (the path space is proven to be PSI) were developed, such as Stochastic Gradient Descent (SGD) in the path space. %nd it was shown that, SGD in the path space is superior to that in the weight space. However, it is still unknown that whether other deep learning techniques such as batch normalization (BN), could also have their counterparts in the path space. In this paper, we conduct a formal study on the design of BN in the path space. First, we propose path-reparameterization of ReLU networks, in which the weights in the networks are reparameterized by path-values. Then, the feedforward and backward propagation of the path-reparameterized networks can calculate the values of the hidden nodes and the gradients in the path space, respectively. Next, we design the a novel way to do batch normalization for the path-reparameterized ReLU networks, called Path-BN. Specifically, we notice that, path-reparameterized ReLU NNs have a portion of constant weights which play more critical roles to form the basis of the path space. We propose to exclude these constant weights when doing batch normalization and prove that, by doing so, the scale and the direction of the trained parameters can be more effectively decoupled during training. Finally, we conduct experiments on benchmark datasets. The results show that our proposed Path-BN can improve the performance of the optimization algorithms in the path space.

ICML Conference 2021 Conference Paper

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Bohan Wang
Qi Meng
Wei Chen 0034
Tie-Yan Liu

Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize relatively well to unseen data. Recently, researchers explained it by investigating the implicit bias of optimization algorithms. A remarkable progress is the work (Lyu & Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except the first-order optimization algorithms like GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. Mean-while, numerous works have provided empirical evidence that adaptive methods may suffer from poor generalization performance. However, theoretical explanation for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit bias of adaptive optimization algorithms on homogeneous neural networks. In particular, we study the convergent direction of parameters when they are optimizing the logistic loss. We prove that the convergent direction of Adam and RMSProp is the same as GD, while for AdaGrad, the convergent direction depends on the adaptive conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel and nontrivial adaptive gradient flow and surrogate margin. The theoretical findings explain the superiority on generalization of exponential moving average strategy that is adopted by RMSProp and Adam. To the best of knowledge, it is the first work to study the convergent direction of adaptive optimizations on non-linear deep neural networks

ICML Conference 2020 Conference Paper

(Locally) Differentially Private Combinatorial Semi-Bandits

Xiaoyu Chen 0008
Kai Zheng 0007
Zixin Zhou
Yunchang Yang
Wei Chen 0034
Liwei Wang 0001

In this paper, we study Combinatorial Semi-Bandits (CSB) that is an extension of classic Multi-Armed Bandits (MAB) under Differential Privacy (DP) and stronger Local Differential Privacy (LDP) setting. Since the server receives more information from users in CSB, it usually causes additional dependence on the dimension of data, which is a notorious side-effect for privacy preserving learning. However for CSB under two common smoothness assumptions, we show it is possible to remove this side-effect. In detail, for $B_{\infty}$-bounded smooth CSB under either $\varepsilon$-LDP or $\varepsilon$-DP, we prove the optimal regret bound is $\Theta(\frac{mB^2_{\infty}\ln T } {\Delta\varepsilon^2})$ or $\tilde{\Theta}(\frac{mB^2_{\infty}\ln T} { \Delta\varepsilon})$ respectively, where $T$ is time period, $\Delta$ is the gap of rewards and $m$ is the number of base arms, by proposing novel algorithms and matching lower bounds. For $B_1$-bounded smooth CSB under $\varepsilon$-DP, we also prove the optimal regret bound is $\tilde{\Theta}(\frac{mKB^2_1\ln T} {\Delta\varepsilon})$ with both upper bound and lower bound, where $K$ is the maximum number of feedback in each round. All above results nearly match corresponding non-private optimal rates, which imply there is no additional price for (locally) differentially private CSB in above common settings.

ICML Conference 2020 Conference Paper

Combinatorial Pure Exploration for Dueling Bandit

Wei Chen 0034
Yihan Du
Longbo Huang
Haoyu Zhao

In this paper, we study combinatorial pure exploration for dueling bandits (CPE-DB): we have multiple candidates for multiple positions as modeled by a bipartite graph, and in each round we sample a duel of two candidates on one position and observe who wins in the duel, with the goal of finding the best candidate-position matching with high probability after multiple rounds of samples. CPE-DB is an adaptation of the original combinatorial pure exploration for multi-armed bandit (CPE-MAB) problem to the dueling bandit setting. We consider both the Borda winner and the Condorcet winner cases. For Borda winner, we establish a reduction of the problem to the original CPE-MAB setting and design PAC and exact algorithms that achieve both the sample complexity similar to that in the CPE-MAB setting (which is nearly optimal for a subclass of problems) and polynomial running time per round. For Condorcet winner, we first design a fully polynomial time approximation scheme (FPTAS) for the offline problem of finding the Condorcet winner with known winning probabilities, and then use the FPTAS as an oracle to design a novel pure exploration algorithm CAR-Cond with sample complexity analysis. CAR-Cond is the first algorithm with polynomial running time per round for identifying the Condorcet winner in CPE-DB.

ICML Conference 2018 Conference Paper

Towards Binary-Valued Gates for Robust LSTM Training

Zhuohan Li 0001
Di He 0001
Fei Tian
Wei Chen 0034
Tao Qin 0001
Liwei Wang 0001
Tie-Yan Liu

Long Short-Term Memory (LSTM) is one of the most widely used recurrent structures in sequence modeling. It aims to use gates to control information flow (e. g. , whether to skip some information or not) in the recurrent computations, although its practical implementation based on soft gates only partially achieves this goal. In this paper, we propose a new way for LSTM training, which pushes the output values of the gates towards 0 or 1. By doing so, we can better control the information flow: the gates are mostly open or closed, instead of in a middle state, which makes the results more interpretable. Empirical studies show that (1) Although it seems that we restrict the model capacity, there is no performance drop: we achieve better or comparable performances due to its better generalization ability; (2) The outputs of gates are not sensitive to their inputs: we can easily compress the LSTM unit in multiple ways, e. g. , low-rank approximation and low-precision approximation. The compressed models are even better than the baseline models without compression.

ICML Conference 2017 Conference Paper

Asynchronous Stochastic Gradient Descent with Delay Compensation

Shuxin Zheng
Qi Meng
Taifeng Wang
Wei Chen 0034
Nenghai Yu
Zhiming Ma
Tie-Yan Liu

With the fast development of deep learning, it has become common to learn big neural networks using massive training data. Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients. That is, when a local worker adds its gradient to the global model, the global model may have been updated by other workers and this gradient becomes “delayed”. We propose a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD. This is achieved by leveraging Taylor expansion of the gradient function and efficient approximators to the Hessian matrix of the loss function. We call the new algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.

ICML Conference 2017 Conference Paper

Dual Supervised Learning

Yingce Xia
Tao Qin 0001
Wei Chen 0034
Jiang Bian 0002
Nenghai Yu
Tie-Yan Liu

Many supervised learning tasks are emerged in dual forms, e. g. , English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach dual supervised learning. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis.