Author name cluster

Juan Zhai

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

AAAI Conference 2026 Conference Paper

From Chaos to Clarity: A Knowledge Graph-Driven Audit Dataset Generation Framework for LLM Unlearning

Weipeng Jiang
Juan Zhai
Shiqing Ma
Ziyan Lei
Xiaofei Xie
Yige Wang
Chao Shen

Recently LLMs have faced increasing demands to selectively remove specific information through Machine Unlearning. While evaluating unlearning effectiveness is crucial, existing benchmarks suffer from fundamental limitations in audit dataset generation from unstructured corpora. We identify two critical challenges: ensuring audit adequacy and handling knowledge redundancy between forget and retain datasets. Current approaches rely on ad-hoc question generation from unstructured text, leading to unpredictable coverage gaps and evaluation blind spots. Knowledge redundancy between forget and retain corpora further obscures evaluation, making it difficult to distinguish genuine unlearning failures from legitimately retained knowledge. To bring clarity to this challenge, we propose LUCID, an automated framework that leverages knowledge graphs to achieve comprehensive audit dataset generation with fine-grained coverage and systematic redundancy elimination. By converting unstructured corpora into structured knowledge representations, it transforms the ad-hoc audit dataset generation process into a transparent and automated generation pipeline that ensures both adequacy and non-redundancy. Applying LUCID to the MUSE benchmark, we generated over 69,000 and 111,000 audit cases for News and Books datasets respectively, identifying thousands of previously undetected knowledge memorization instances. Our analysis reveals that knowledge redundancy significantly skews metrics, artificially inflating ROUGE from 19.7% to 26.1% and Entailment Scores from 32.4% to 35.2%, highlighting the necessity of deduplication for accurate assessment.

PDF Details DOI

ICLR Conference 2025 Conference Paper

STAFF: Speculative Coreset Selection for Task-Specific Fine-tuning

Xiaoyu Zhang 0013
Juan Zhai
Shiqing Ma
Chao Shen 0001
Tianlin Li
Weipeng Jiang
Yang Liu 0003

Task-specific fine-tuning is essential for the deployment of large language models (LLMs), but it requires significant computational resources and time. Existing solutions have proposed coreset selection methods to improve data efficiency and reduce model training overhead, but they still have limitations: ❶ Overlooking valuable samples at high pruning rates, which degrades the coreset’s performance. ❷ Requiring high time overhead during coreset selection to fine-tune and evaluate the target LLM. In this paper, we introduce STAFF, a speculative coreset selection method. STAFF leverages a small model from the same family as the target LLM to efficiently estimate data scores and then verifies the scores on the target LLM to accurately identify and allocate more selection budget to important regions while maintaining coverage of easy regions. We evaluate STAFF on three LLMs and three downstream tasks and show that STAFF improves the performance of SOTA methods by up to 54.3% and reduces selection overhead by up to 70.5% at different pruning rates. Furthermore, we observe that the coreset selected by STAFF at low pruning rates (i.e., 20%) can even obtain better fine-tuning performance than the full dataset.

Details

ICLR Conference 2023 Conference Paper

UNICORN: A Unified Backdoor Trigger Inversion Framework

Zhenting Wang
Kai Mei
Juan Zhai
Shiqing Ma

The backdoor attack, where the adversary uses inputs stamped with triggers (e.g., a patch) to activate pre-planted malicious behaviors, is a severe threat to Deep Neural Network (DNN) models. Trigger inversion is an effective way of identifying backdoor models and understanding embedded adversarial behaviors. A challenge of trigger inversion is that there are many ways of constructing the trigger. Existing methods cannot generalize to various types of triggers by making certain assumptions or attack-specific constraints. The fundamental reason is that existing work does not formally define the trigger and the inversion problem. This work formally defines and analyzes the trigger and the inversion problem. Then, it proposes a unified framework to invert backdoor triggers based on the formalization of triggers and the identified inner behaviors of backdoor models from our analysis. Our prototype UNICORN is general and effective in inverting backdoor triggers in DNNs. The code can be found at https://github.com/RU-System-Software-and-Security/UNICORN.

Details

NeurIPS Conference 2022 Conference Paper

Rethinking the Reverse-engineering of Trojan Triggers

Zhenting Wang
Kai Mei
Hailun Ding
Juan Zhai
Shiqing Ma

Deep Neural Networks are vulnerable to Trojan (or backdoor) attacks. Reverse-engineering methods can reconstruct the trigger and thus identify affected models. Existing reverse-engineering methods only consider input space constraints, e. g. , trigger size in the input space. Expressly, they assume the triggers are static patterns in the input space and fail to detect models with feature space triggers such as image style transformations. We observe that both input-space and feature-space Trojans are associated with feature space hyperplanes. Based on this observation, we design a novel reverse-engineering method that exploits the feature space constraint to reverse-engineer Trojan triggers. Results on four datasets and seven different attacks demonstrate that our solution effectively defends both input-space and feature-space Trojans. It outperforms state-of-the-art reverse-engineering methods and other types of defenses in both Trojaned model detection and mitigation tasks. On average, the detection accuracy of our method is 93%. For Trojan mitigation, our method can reduce the ASR (attack success rate) to only 0. 26% with the BA (benign accuracy) remaining nearly unchanged. Our code can be found at https: //github. com/RU-System-Software-and-Security/FeatureRE.

PDF Details

NeurIPS Conference 2022 Conference Paper

Training with More Confidence: Mitigating Injected and Natural Backdoors During Training

Zhenting Wang
Hailun Ding
Juan Zhai
Shiqing Ma

The backdoor or Trojan attack is a severe threat to deep neural networks (DNNs). Researchers find that DNNs trained on benign data and settings can also learn backdoor behaviors, which is known as the natural backdoor. Existing works on anti-backdoor learning are based on weak observations that the backdoor and benign behaviors can differentiate during training. An adaptive attack with slow poisoning can bypass such defenses. Moreover, these methods cannot defend natural backdoors. We found the fundamental differences between backdoor-related neurons and benign neurons: backdoor-related neurons form a hyperplane as the classification surface across input domains of all affected labels. By further analyzing the training process and model architectures, we found that piece-wise linear functions cause this hyperplane surface. In this paper, we design a novel training method that forces the training to avoid generating such hyperplanes and thus remove the injected backdoors. Our extensive experiments on five datasets against five state-of-the-art attacks and also benign training show that our method can outperform existing state-of-the-art defenses. On average, the ASR (attack success rate) of the models trained with NONE is 54. 83 times lower than undefended models under standard poisoning backdoor attack and 1. 75 times lower under the natural backdoor attack. Our code is available at https: //github. com/RU-System-Software-and-Security/NONE.

PDF Details