Author name cluster

Wenbo Zhou

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

8 papers

1 author row

AAAI Conference 2026 Conference Paper

EARG-Net: Edge-Aware Reconstruction-Guided Network for Image Manipulation Detection and Localization

Yanpu Yu
Zhaoxin Shi
Hanqing Zhao
Tianyi Wei
Wenbo Zhou
Nenghai Yu

Recent advances in image editing tools, particularly those used in content-aware retouching and object-level manipulation, have raised significant concerns regarding the authenticity of digital images. While many Image Manipulation Detection and Localization (IMDL) methods have been proposed, they often struggle with subtle forgeries, intricate boundary artifacts, and manipulations generated by unseen editing techniques. In this work, we propose a novel edge-aware framework that leverages the strong natural image priors of pre-trained inpainting models to harmonize manipulated regions. By guiding the inpainting process with generated edge-aware masks, our method reconstructs tampered areas using surrounding context, yielding perceptually coherent results. The pixel-wise residual between the original and reconstructed images reveals manipulation-sensitive inconsistencies—particularly around editing boundaries—thereby enabling accurate and generalizable detection and localization. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art performance, especially in challenging scenarios involving realistic and finely retouched image forgeries.

PDF Details DOI

AAAI Conference 2026 Conference Paper

MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

Xinyue Yu
Youqing Fang
Pingyu Wu
Guoyang Ye
Wenbo Zhou
Weiming Zhang
Song Xiao

Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores (nMOS=3.96, sMOS_t=3.86, sMOS_e=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.

PDF Details DOI

JAIR Journal 2025 Journal Article

Improving Local Search Algorithm for Pseudo Boolean Optimization

Yujiao Zhao
Yiyuan Wang
Yi Chu
Wenbo Zhou
Shaowei Cai
Minghao Yin

Pseudo-Boolean optimization (PBO) is usually used to model combinatorial optimization problems, especially for some real-world applications. Despite its significant importance in both theory and applications, the performance of current PBO solvers is still limited. This paper develops a novel local search algorithm for PBO, which has four main ideas. First, we design a new primary scoring function and a two-level selection strategy to evaluate all candidate variables. Second, we introduce a new weighting scheme to accurately guide the search process toward more promising directions. Third, we propose a novel deep optimization strategy to disturb some search processes. Fourth, an efficient solution space exploration mechanism is applied to help the algorithm jump out of local optimum. We conduct experiments on a broad range of public benchmarks, including three large-scale practical application benchmarks, two benchmarks from PB competitions, an integer linear programming optimization benchmark, a crafted combinatorial benchmark, and a combinatorial optimization knapsack benchmark to compare our proposed algorithm against twelve state-of-the-art competitors, including seven recently-proposed pure stochastic local search PBO solvers, a non-traditional stochastic local search combined with complete oracle, two complete PB solvers, and two mixed integer programming (MIP) solvers. Our proposed algorithm has been shown to perform best on these three real-world benchmarks. On the other five benchmarks, our algorithm shows competitive performance compared to state-of-the-art competitors, and it significantly outperforms all other local search algorithms, indicating that our algorithm greatly advances the state of the art in local search for solving PBO.

PDF Details DOI

NeurIPS Conference 2025 Conference Paper

T2V-OptJail: Discrete Prompt Optimization for Text-to-Video Jailbreak Attacks

Jiayang Liu
Siyuan Liang
Shiqian Zhao
Rong-Cheng Tu
Wenbo Zhou
Aishan Liu
Dacheng Tao
Siew Kei Lam

In recent years, fueled by the rapid advancement of diffusion models, text-to-video (T2V) generation models have achieved remarkable progress, with notable examples including Pika, Luma, Kling, and Open-Sora. Although these models exhibit impressive generative capabilities, they also expose significant security risks due to their vulnerability to jailbreak attacks, where the models are manipulated to produce unsafe content such as pornography, violence, or discrimination. Existing works such as T2VSafetyBench provide preliminary benchmarks for safety evaluation, but lack systematic methods for thoroughly exploring model vulnerabilities. To address this gap, we are the first to formalize the T2V jailbreak attack as a discrete optimization problem and propose a joint objective-based optimization framework, called \emph{T2V-OptJail}. This framework consists of two key optimization goals: bypassing the built-in safety filtering mechanisms to increase the attack success rate, preserving semantic consistency between the adversarial prompt and the unsafe input prompt, as well as between the generated video and the unsafe input prompt, to enhance content controllability. In addition, we introduce an iterative optimization strategy guided by prompt variants, where multiple semantically equivalent candidates are generated in each round, and their scores are aggregated to robustly guide the search toward optimal adversarial prompts. We conduct large-scale experiments on several T2V models, covering both open-source models (\textit{e. g. }, Open-Sora) and real commercial closed-source models (\textit{e. g. }, Pika, Luma, Kling). The experimental results show that the proposed method improves 11. 4\% and 10. 0\% over the existing state-of-the-art method (SoTA) in terms of attack success rate assessed by GPT-4, attack success rate assessed by human accessors, respectively, verifying the significant advantages of the method in terms of attack effectiveness and content control. This study reveals the potential abuse risk of the semantic alignment mechanism in the current T2V model and provides a basis for the design of subsequent jailbreak defense methods.

PDF Details

AAAI Conference 2024 Conference Paper

FaceRSA: RSA-Aware Facial Identity Cryptography Framework

Zhongyi Zhang
Tianyi Wei
Wenbo Zhou
Hanqing Zhao
Weiming Zhang
Nenghai Yu

With the flourishing of the Internet, sharing one's photos or automated processing of faces using computer vision technology has become an everyday occurrence. While enjoying the convenience, the concern for identity privacy is also emerging. Therefore, some efforts introduced the concept of ``password'' from traditional cryptography such as RSA into the face anonymization and deanonymization task to protect the facial identity without compromising the usability of the face image. However, these methods either suffer from the poor visual quality of the synthesis results or do not possess the full cryptographic properties, resulting in compromised security. In this paper, we present the first facial identity cryptography framework with full properties analogous to RSA. Our framework leverages the powerful generative capabilities of StyleGAN to achieve megapixel-level facial identity anonymization and deanonymization. Thanks to the great semantic decoupling of StyleGAN's latent space, the identity encryption and decryption process are performed in latent space by a well-designed password mapper in the manner of editing latent code. Meanwhile, the password-related information is imperceptibly hidden in the edited latent code owing to the redundant nature of the latent space. To make our cryptographic framework possesses all the properties analogous to RSA, we propose three types of loss functions: single anonymization loss, sequential anonymization loss, and associated anonymization loss. Extensive experiments and ablation analyses demonstrate the superiority of our method in terms of the quality of synthesis results, identity-irrelevant attributes preservation, deanonymization accuracy, and completeness of properties analogous to RSA.

PDF Details DOI

AAAI Conference 2022 Conference Paper

FInfer: Frame Inference-Based Deepfake Detection for High-Visual-Quality Videos

Juan Hu
Xin Liao
Jinwen Liang
Wenbo Zhou
Zheng Qin

Deepfake has ignited hot research interests in both academia and industry due to its potential security threats. Many countermeasures have been proposed to mitigate such risks. Current Deepfake detection methods achieve superior performances in dealing with low-visual-quality Deepfake media which can be distinguished by the obvious visual artifacts. However, with the development of deep generative models, the realism of Deepfake media has been significantly improved and becomes tough challenging to current detection models. In this paper, we propose a frame inferencebased detection framework (FInfer) to solve the problem of high-visual-quality Deepfake detection. Specifically, we first learn the referenced representations of the current and future frames’ faces. Then, the current frames’ facial representations are utilized to predict the future frames’ facial representations by using an autoregressive model. Finally, a representationprediction loss is devised to maximize the discriminability of real videos and fake videos. We demonstrate the effectiveness of our FInfer framework through information theory analyses. The entropy and mutual information analyses indicate the correlation between the predicted representations and referenced representations in real videos is higher than that of high-visual-quality Deepfake videos. Extensive experiments demonstrate the performance of our method is promising in terms of in-dataset detection performance, detection efficiency, and cross-dataset detection performance in high-visualquality Deepfake videos.

PDF Details

AAAI Conference 2021 Conference Paper

Initiative Defense against Facial Manipulation

Qidong Huang
Jie Zhang
Wenbo Zhou
Weiming Zhang
Nenghai Yu

Benefiting from the development of generative adversarial networks (GAN), facial manipulation has achieved significant progress in both academia and industry recently. It inspires an increasing number of entertainment applications but also incurs severe threats to individual privacy and even political security meanwhile. To mitigate such risks, many countermeasures have been proposed. However, the great majority methods are designed in a passive manner, which is to detect whether the facial images or videos are tampered after their wide propagation. These detection-based methods have a fatal limitation, that is, they only work for ex-post forensics but can not prevent the engendering of malicious behavior. To address the limitation, in this paper, we propose a novel framework of initiative defense to degrade the performance of facial manipulation models controlled by malicious users. The basic idea is to actively inject imperceptible venom into target facial data before manipulation. To this end, we first imitate the target manipulation model with a surrogate model, and then devise a poison perturbation generator to obtain the desired venom. An alternating training strategy are further leveraged to train both the surrogate model and the perturbation generator. Two typical facial manipulation tasks: face attribute editing and face reenactment, are considered in our initiative defense framework. Extensive experiments demonstrate the effectiveness and robustness of our framework in different settings. Finally, we hope this work can shed some light on initiative countermeasures against more adversarial scenarios.

PDF Details

AAAI Conference 2020 Conference Paper

Model Watermarking for Image Processing Networks

Jie Zhang
DongDong Chen
Jing Liao
Han Fang
Weiming Zhang
Wenbo Zhou
Hao Cui
Nenghai Yu

Deep learning has achieved tremendous success in numerous industrial applications. As training a good model often needs massive high-quality data and computation resources, the learned models often have signiﬁcant business values. However, these valuable deep models are exposed to a huge risk of infringements. For example, if the attacker has the full information of one target model including the network structure and weights, the model can be easily ﬁnetuned on new datasets. Even if the attacker can only access the output of the target model, he/she can still train another similar surrogate model by generating a large scale of input-output training pairs. How to protect the intellectual property of deep models is a very important but seriously under-researched problem. There are a few recent attempts at classiﬁcation network protection only. In this paper, we propose the ﬁrst model watermarking framework for protecting image processing models. To achieve this goal, we leverage the spatial invisible watermarking mechanism. Speciﬁcally, given a black-box target model, a uniﬁed and invisible watermark is hidden into its outputs, which can be regarded as a special task-agnostic barrier. In this way, when the attacker trains one surrogate model by using the input-output pairs of the target model, the hidden watermark will be learned and extracted afterward. To enable watermarks from binary bits to high-resolution images, both traditional and deep spatial invisible watermarking mechanism are considered. Experiments demonstrate the robustness of the proposed watermarking mechanism, which can resist surrogate models learned with different network structures and objective functions. Besides deep models, the proposed method is also easy to be extended to protect data and traditional image processing algorithms.

PDF Details