Author name cluster

Zeming Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

ICML Conference 2025 Conference Paper

Identifying and Understanding Cross-Class Features in Adversarial Training

Zeming Wei
Steven Y. Guo
Yisen Wang 0001

Adversarial training (AT) has been considered one of the most effective methods for making deep neural networks robust against adversarial attacks, while the training mechanisms and dynamics of AT remain open research problems. In this paper, we present a novel perspective on studying AT through the lens of class-wise feature attribution. Specifically, we identify the impact of a key family of features on AT that are shared by multiple classes, which we call cross-class features. These features are typically useful for robust classification, which we offer theoretical evidence to illustrate through a synthetic data model. Through systematic studies across multiple model architectures and settings, we find that during the initial stage of AT, the model tends to learn more cross-class features until the best robustness checkpoint. As AT further squeezes the training robust loss and causes robust overfitting, the model tends to make decisions based on more class-specific features. Based on these discoveries, we further provide a unified view of two existing properties of AT, including the advantage of soft-label training and robust overfitting. Overall, these insights refine the current understanding of AT mechanisms and provide new perspectives on studying them. Our code is available at https: //github. com/PKU-ML/Cross-Class-Features-AT.

Details

NeurIPS Conference 2024 Conference Paper

A Theoretical Understanding of Self-Correction through In-context Alignment

Yifei Wang
Yuyang Wu
Zeming Wei
Stefanie Jegelka
Yisen Wang

Going beyond mimicking limited human experiences, recent studies show initial evidence that, like humans, large language models (LLMs) are capable of improving their abilities purely by self-correction, i. e. , correcting previous responses through self-examination, as seen in models like OpenAI o1. Nevertheless, little is known about how such capabilities arise. In this work, based on a simplified setup akin to an alignment task, we theoretically analyze self-correction from an in-context learning perspective, showing that when LLMs give relatively accurate self-examinations as rewards, they are capable of refining responses in an in-context way. Notably, going beyond previous theories on over-simplified linear transformers, our theoretical construction underpins the roles of several key designs of realistic transformers for self-correction: softmax attention, multi-head attention, and the MLP block. We validate these findings extensively on synthetic datasets. Inspired by these findings, we propose a simple self-correction strategy, Checking as Context (CaC), which finds novel applications in alleviating social bias and defending against LLM jailbreaks. We believe that these findings will inspire further research on understanding, exploiting, and enhancing self-correction for building better foundation models. Code is at https: //github. com/yifeiwang77/Self-Correction.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Adversarial Representation Engineering: A General Model Editing Framework for Large Language Models

Yihao Zhang
Zeming Wei
Jun Sun
Meng Sun

Since the rapid development of Large Language Models (LLMs) has achieved remarkable success, understanding and rectifying their internal complex mechanisms has become an urgent issue. Recent research has attempted to interpret their behaviors through the lens of inner representation. However, developing practical and efficient methods for applying these representations for general and flexible model editing remains challenging. In this work, we explore how to leverage insights from representation engineering to guide the editing of LLMs by deploying a representation discriminator as an editing oracle. We first identify the importance of a robust and reliable discriminator during editing, then propose an \textbf{A}dversarial \textbf{R}epresentation \textbf{E}ngineering (\textbf{ARE}) framework to provide a unified and interpretable approach for conceptual model editing without compromising baseline performance. Experiments on multiple tasks demonstrate the effectiveness of ARE in various model editing scenarios. Our code and data are available at \url{https: //github. com/Zhang-Yihao/Adversarial-Representation-Engineering}.

PDF Details DOI

NeurIPS Conference 2024 Conference Paper

Fight Back Against Jailbreaking via Prompt Adversarial Tuning

Yichuan Mo
Yuji Wang
Zeming Wei
Yisen Wang

While Large Language Models (LLMs) have achieved tremendous success in various applications, they are also susceptible to jailbreaking attacks. Several primary defense strategies have been proposed to protect LLMs from producing harmful information, mostly focusing on model fine-tuning or heuristical defense designs. However, how to achieve intrinsic robustness through prompt optimization remains an open problem. In this paper, motivated by adversarial training paradigms for achieving reliable robustness, we propose an approach named Prompt Adversarial Tuning (PAT) that trains a prompt control attached to the user prompt as a guard prefix. To achieve our defense goal whilst maintaining natural performance, we optimize the control prompt with both adversarial and benign prompts. Comprehensive experiments show that our method is effective against both grey-box and black-box attacks, reducing the success rate of advanced attacks to nearly 0, while maintaining the model's utility on the benign task and incurring only negligible computational overhead, charting a new perspective for future explorations in LLM security. Our code is available at https: //github. com/PKU-ML/PAT.

PDF Details DOI

ICML Conference 2024 Conference Paper

On the Duality Between Sharpness-Aware Minimization and Adversarial Training

Yihao Zhang
Hangzhou He
Jingyu Zhu
Huanran Chen
Yifei Wang
Zeming Wei

Adversarial Training (AT), which adversarially perturb the input samples during training, has been acknowledged as one of the most effective defenses against adversarial attacks, yet suffers from inevitably decreased clean accuracy. Instead of perturbing the samples, Sharpness-Aware Minimization (SAM) perturbs the model weights during training to find a more flat loss landscape and improve generalization. However, as SAM is designed for better clean accuracy, its effectiveness in enhancing adversarial robustness remains unexplored. In this work, considering the duality between SAM and AT, we investigate the adversarial robustness derived from SAM. Intriguingly, we find that using SAM alone can improve adversarial robustness. To understand this unexpected property of SAM, we first provide empirical and theoretical insights into how SAM can implicitly learn more robust features, and conduct comprehensive experiments to show that SAM can improve adversarial robustness notably without sacrificing any clean accuracy, shedding light on the potential of SAM to be a substitute for AT when accuracy comes at a higher priority. Code is available at https: //github. com/weizeming/SAM_AT.

Details

NeurIPS Conference 2023 Conference Paper

Architecture Matters: Uncovering Implicit Mechanisms in Graph Contrastive Learning

Xiaojun Guo
Yifei Wang
Zeming Wei
Yisen Wang

With the prosperity of contrastive learning for visual representation learning (VCL), it is also adapted to the graph domain and yields promising performance. However, through a systematic study of various graph contrastive learning (GCL) methods, we observe that some common phenomena among existing GCL methods that are quite different from the original VCL methods, including 1) positive samples are not a must for GCL; 2) negative samples are not necessary for graph classification, neither for node classification when adopting specific normalization modules; 3) data augmentations have much less influence on GCL, as simple domain-agnostic augmentations (e. g. , Gaussian noise) can also attain fairly good performance. By uncovering how the implicit inductive bias of GNNs works in contrastive learning, we theoretically provide insights into the above intriguing properties of GCL. Rather than directly porting existing VCL methods to GCL, we advocate for more attention toward the unique architecture of graph learning and consider its implicit influence when designing GCL methods. Code is available at https: //github. com/PKU-ML/ArchitectureMattersGCL.

PDF Details