Arrow Research search

Author name cluster

Nenghai Yu

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

54 papers
2 author rows

Possible papers

54

AAAI Conference 2026 Conference Paper

EARG-Net: Edge-Aware Reconstruction-Guided Network for Image Manipulation Detection and Localization

  • Yanpu Yu
  • Zhaoxin Shi
  • Hanqing Zhao
  • Tianyi Wei
  • Wenbo Zhou
  • Nenghai Yu

Recent advances in image editing tools, particularly those used in content-aware retouching and object-level manipulation, have raised significant concerns regarding the authenticity of digital images. While many Image Manipulation Detection and Localization (IMDL) methods have been proposed, they often struggle with subtle forgeries, intricate boundary artifacts, and manipulations generated by unseen editing techniques. In this work, we propose a novel edge-aware framework that leverages the strong natural image priors of pre-trained inpainting models to harmonize manipulated regions. By guiding the inpainting process with generated edge-aware masks, our method reconstructs tampered areas using surrounding context, yielding perceptually coherent results. The pixel-wise residual between the original and reconstructed images reveals manipulation-sensitive inconsistencies—particularly around editing boundaries—thereby enabling accurate and generalizable detection and localization. Extensive experiments across multiple benchmarks demonstrate that our approach achieves state-of-the-art performance, especially in challenging scenarios involving realistic and finely retouched image forgeries.

AAAI Conference 2026 Conference Paper

MagicPaint: Operate Anything for Image Inpainting with Diffusion Model

  • Qinhong Yang
  • DongDong Chen
  • Qi Chu
  • Tao Gong
  • Qiankun Liu
  • Zhentao Tan
  • Xulin Li
  • Huamin Feng

Recent diffusion-based models have significantly improved inpainting quality. However, existing methods struggle with multi-task inpainting due to conflicting optimization objectives, and current datasets are typically limited to task-specific scenarios, hindering joint training. To address these challenges, we propose MagicPaint, a unified diffusion-based inpainting model that supports object addition, removal, and unconditional inpainting across both text and image modalities. MagicPaint semantically decouples operation types and target content by learnable tokens in MMToken Module, effectively reconciling conflicting optimization objectives and enabling robust multi-task, multi-modal inpainting. Besides, a novel inpainting paradigm named MagicMask, encodes operating intent directly into the mask and applies a mask loss for spatially precise supervision. In addition, existing inpainting datasets are insufficient for multi-task and multi-modal scenarios, limiting the capability of inpainting models. Thus, we further introduce a new dataset comprising 2.1M image tuples. It is dedicatedly designed to support diverse inpainting scenarios and significantly improves upon existing datasets, particularly in object removal. Through efforts from both model and data perspectives, MagicPaint enables users to operate anything—add, remove or inpaint content which is specified through either text or image modalities in a seamless and unified manner. Extensive experiments demonstrate that MagicPaint achieves state-of-the-art performance across three key tasks (i.e., text-guided addition, image-guided addition, and object removal) and produces outputs with superior visual consistency and contextual fidelity compared to existing methods.

TMLR Journal 2025 Journal Article

\copyright Plug-in Authorization for Human Copyright Protection in Text-to-Image Model

  • Chao Zhou
  • Huishuai Zhang
  • Jiang Bian
  • Weiming Zhang
  • Nenghai Yu

This paper addresses the contentious issue of copyright infringement in images generated by text-to-image models, sparking debates among AI developers, content creators, and legal entities. State-of-the-art models create high-quality content without crediting original creators, causing concern in the artistic community and model providers. To mitigate this, we propose the ©Plug-in Authorization framework, introducing three operations: addition, extraction, and combination. Addition involves training a ©plug-in for specific copyright, facilitating proper credit attribution. The extraction allows creators to reclaim copyright from infringing models, and the combination enables users to merge different ©plug-ins. These operations act as permits, incentivizing fair use and providing flexibility in authorization. We present innovative approaches, ``Reverse LoRA'' for extraction and ``EasyMerge'' for seamless combination. Experiments in artist-style replication and cartoon IP recreation demonstrate ©plug-ins' effectiveness, offering a valuable solution for human copyright protection in the age of generative AIs. The code is available at \url{https://github.com/zc1023/-Plug-in-Authorization.git}

IJCAI Conference 2025 Conference Paper

BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

  • Xiuwei Shang
  • Guoqiang Chen
  • Shaoyin Cheng
  • Benlong Wu
  • Li Hu
  • Gangyang Li
  • Weiming Zhang
  • Nenghai Yu

Binary analysis is crucial for software security, offering insights into compiled programs without source code. As large language models (LLMs) excel in language tasks, their potential for complex decoding binary data structures is growing. However, the lack of standardized benchmarks hinders their evaluation and progress in this domain. To bridge this gap, we introduce BinMetric, a first comprehensive benchmark designed specifically to evaluate LLMs performance on binary analysis tasks. BinMetric comprises 1, 000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc. , which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.

ICML Conference 2025 Conference Paper

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

  • Wei Fan
  • Kejiang Chen
  • Chang Liu 0089
  • Weiming Zhang 0001
  • Nenghai Yu

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https: //de-antifake. github. io.

NeurIPS Conference 2025 Conference Paper

LD-RoViS: Training-free Robust Video Steganography for Deterministic Latent Diffusion Model

  • Xiangkun Wang
  • Kejiang Chen
  • Lincong Li
  • Weiming Zhang
  • Nenghai Yu

Existing video steganography methods primarily embed secret information by modifying video content in the spatial or compressed domains. However, such methods are prone to distortion drift and are easily detected by steganalysis. Generative steganography, which avoids direct modification of the cover data, offers a promising alternative. Despite recent advances, most generative steganography studies focus on images and are difficult to extend to videos because of compression-induced distortions and the unique architecture of video generation models. To address these challenges, we propose LD-RoViS, a training-free and robust video steganography framework for the deterministic latent diffusion model. By modulating implicit conditional parameters during the diffusion process, LD-RoViS constructs a dedicated steganographic channel. Additionally, we introduce a novel multi-mask mechanism to mitigate errors caused by video compression and post-processing. The experimental results demonstrate that LD-RoViS can embed approximately 12, 000 bits of data into a 5-second video with an extraction accuracy exceeding 99\%. Our implementation is available at https: //github. com/xiangkun1999/LD-RoViS.

AAAI Conference 2025 Conference Paper

Rethinking Masked Data Reconstruction Pretraining for Strong 3D Action Representation Learning

  • Tao Gong
  • Qi Chu
  • Bin Liu
  • Nenghai Yu

In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. For example, MAMP shows that instead of following the prevalent masked joint reconstruction, explicit masked motion reconstruction is key to the success of learning effective feature representation for 3D action recognition. However, we find that if we make a simple and effective change to the reconstructed target of masked joint reconstruction, masked joint reconstruction can achieve the same results as masked motion reconstruction. The devil is in the special characteristic of 3D skeleton data and the normalization process of training targets. We need to dig for all effective information of targets during normalization. Besides, considering that mask data reconstruction focuses more on learning local relations in input data for fulfilling the reconstruction task, instead of modeling the relation among samples, we further employ contrastive learning to learn more discriminative 3D action representations. We show that contrastive learning can consistently boost the performance of model pre-trained by masked joint prediction under various settings, especially in the semi-supervised setting that has a very limited number of labeled samples. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed pre-training strategy achieves state-of-the-art results without bells and whistles.

NeurIPS Conference 2025 Conference Paper

STEAD: Robust Provably Secure Linguistic Steganography with Diffusion Language Model

  • Yuang Qi
  • Na Zhao
  • Qiyi Yao
  • Benlong Wu
  • Weiming Zhang
  • Nenghai Yu
  • Kejiang Chen

Recent provably secure linguistic steganography (PSLS) methods rely on mainstream autoregressive language models (ARMs) to address historically challenging tasks, that is, to disguise covert communication as ``innocuous'' natural language communication. However, due to the characteristic of sequential generation of ARMs, the stegotext generated by ARM-based PSLS methods will produce serious error propagation once it changes, making existing methods unavailable under an active tampering attack. To address this, we propose a robust, provably secure linguistic steganography with diffusion language models (DLMs). Unlike ARMs, DLMs can generate text in a partially parallel manner, allowing us to find robust positions for steganographic embedding that can be combined with error-correcting codes. Furthermore, we introduce error correction strategies, including pseudo-random error correction and neighborhood search correction, during steganographic extraction. Theoretical proof and experimental results demonstrate that our method is secure and robust. It can resist token ambiguity in stegotext segmentation and, to some extent, withstand token-level attacks of insertion, deletion, and substitution.

NeurIPS Conference 2025 Conference Paper

StegoZip: Enhancing Linguistic Steganography Payload in Practice with Large Language Models

  • Jun Jiang
  • Zijin Yang
  • Weiming Zhang
  • Nenghai Yu
  • Kejiang Chen

Generative steganography has emerged as an active research area, yet its practical system is constrained by the inherent secret payload limitation caused by low entropy in generating stego texts. This payload limitation necessitates the use of lengthy stego texts or frequent transmissions, which increases the risk of suspicion by adversaries. Previous studies have mainly focused on payload enhancement through optimized entropy utilization while overlooking the crucial role of secret message processing. To address this gap, we propose StegoZip, a framework that leverages large language models to optimize secret message processing. StegoZip consists of two core components: semantic redundancy pruning and index-based compression coding. The former dynamically prunes the secret message to extract a low-semantic representation, whereas the latter further compresses it into compact binary codes. When integrated with state-of-the-art steganographic methods under lossless decoding, StegoZip achieves 2. 5$\times$ the payload of the baselines while maintaining comparable processing time in practice. This enhanced payload significantly improves covertness by mitigating the risks associated with frequent transmissions while maintaining provable content security.

NeurIPS Conference 2025 Conference Paper

T2SMark: Balancing Robustness and Diversity in Noise-as-Watermark for Diffusion Models

  • Jindong Yang
  • Han Fang
  • Weiming Zhang
  • Nenghai Yu
  • Kejiang Chen

Diffusion models have advanced rapidly in recent years, producing high-fidelity images while raising concerns about intellectual property protection and the misuse of generative AI. Image watermarking for diffusion models, particularly Noise-as-Watermark (NaW) methods, encode watermark as specific standard Gaussian noise vector for image generation, embedding the infomation seamlessly while maintaining image quality. For detection, the generation process is inverted to recover the initial noise vector containing the watermark before extraction. However, existing NaW methods struggle to balance watermark robustness with generation diversity. Some methods achieve strong robustness by heavily constraining initial noise sampling, which degrades user experience, while others preserve diversity but prove too fragile for real-world deployment. To address this issue, we propose T2SMark, a two-stage watermarking scheme based on Tail-Truncated Sampling (TTS). Unlike prior methods that simply map bits to positive or negative values, TTS enhances robustness by embedding bits exclusively in the reliable tail regions while randomly sampling the central zone to preserve the latent distribution. Our two-stage framework then ensures sampling diversity by integrating a randomly generated session key into both encryption pipelines. We evaluate T2SMark on diffusion models with both U-Net and DiT backbones. Extensive experiments show that it achieves an optimal balance between robustness and diversity.

IJCAI Conference 2025 Conference Paper

Towards Anytime Retrieval: A Benchmark for Anytime Person Re-Identification

  • Xulin Li
  • Yan Lu
  • Bin Liu
  • Jiaze Li
  • Qinhong Yang
  • Tao Gong
  • Qi Chu
  • Mang Ye

In real applications, person re-identification (ReID) expects to retrieve the target person at any time, including both daytime and nighttime, ranging from short-term to long-term. However, existing ReID tasks and datasets cannot meet this requirement, as they are constrained by available time and only provide training and evaluation for specific scenarios. Therefore, we investigate a new task called Anytime Person Re-identification (AT-ReID), which aims to achieve effective retrieval in multiple scenarios based on variations in time. To address the AT-ReID problem, we collect the first large-scale dataset, AT-USTC, which contains 135k images of individuals wearing multiple clothes captured by RGB and IR cameras. Our data collection spans over an entire year and 270 volunteers were photographed on average 29. 1 times across different dates or scenes, 4-15 times more than current datasets, providing conditions for follow-up investigations in AT-ReID. Further, to tackle the new challenge of multi-scenario retrieval, we propose a unified model named Uni-AT, which comprises a multi-scenario ReID (MS-ReID) framework for scenario-specific features learning, a Mixture-of-Attribute-Experts (MoAE) module to alleviate inter-scenario interference, and a Hierarchical Dynamic Weighting (HDW) strategy to ensure balanced training across all scenarios. Extensive experiments show that our model leads to satisfactory results and exhibits excellent generalization to all scenarios.

AAAI Conference 2025 Conference Paper

Training-free Open-Vocabulary Semantic Segmentation via Diverse Prototype Construction and Sub-region Matching

  • Xuanpu Zhao
  • Dianmo Sheng
  • Zhentao Tan
  • Zhiwei Zhao
  • Tao Gong
  • Qi Chu
  • Bin Liu
  • Nenghai Yu

Open-vocabulary semantic segmentation (OVSS) aims to segment images of arbitrary categories specified by class labels. While previous approaches relied on extensive image-text pairs or dense semantic annotations, recent training-free methods attempted to overcome these limitations by constructing semantic prototypes in the construction stage and image-to-image matching (i.e., prototype matching) during testing. However, these methods often struggle to effectively capture the visual characteristics of categories and fail to utilize local features during prototype matching. To deal with these problems, we propose a novel training-free framework for OVSS that constructs diverse prototypes and performs fine-grained sub-region matching. Specifically, our method leverages Large Language Models (LLMs) to guide support image generation by descriptions of different attributes of categories and employs coarse-fine clustering to obtain diverse and robust part-level prototypes in the construction stage. During testing, we propose a sub-region matching method, which assigns part-level prototypes to sub-regions utilizing optimal transport, to fully utilize local image features among part-level prototypes. Extensive experiments demonstrate the effectiveness of our method and show that our method achieves state-of-the-art performance, outperforming previous methods across five datasets.

NeurIPS Conference 2025 Conference Paper

Vector Database Watermarking

  • Zhiwen Ren
  • Wei Fan
  • Qiyi Yao
  • Jing Qiu
  • Weiming Zhang
  • Nenghai Yu

Vector databases support machine learning tasks using Approximate Nearest Neighbour (ANN) query functionality, making them highly valuable digital assets. However, they also face security threats like unauthorized replication. By embedding stealth information, watermarking technology can be used for ownership authentication. This paper introduces a watermarking scheme specifically designed for vector databases. The scheme consists of four steps: generating identifiers, grouping, cryptographic mapping, and modification. Since watermark embedding requires modification of certain vectors, it may negatively affect the ANN query results. Further investigation reveals that in the widely used Hierarchical Navigable Small World (HNSW) indexing structure for vector databases, heuristic edge selection and pruning strategies result in some vectors having fewer edges or even none at all. These vectors exhibit significantly lower query frequencies than others, which means that modifying these vectors incurs less impact on query results. Based on this observation, we propose the Transparent Vector Priority (TVP) watermarking scheme, which prioritizes embedding the watermark in these low-query-frequency “transparent” vectors to minimize the impact of watermark embedding on query results. Experimental results show that compared to the current most effective and relevant watermarking schemes, the TVP scheme can significantly reduce the number of missed and false queries by approximately 75\%.

ICML Conference 2024 Conference Paper

AquaLoRA: Toward White-box Protection for Customized Stable Diffusion Models via Watermark LoRA

  • Weitao Feng 0001
  • Wenbo Zhou 0004
  • Jiyan He
  • Jie Zhang 0073
  • Tianyi Wei
  • Guanlin Li
  • Tianwei Zhang 0004
  • Weiming Zhang 0001

Diffusion models have achieved remarkable success in generating high-quality images. Recently, the open-source models represented by Stable Diffusion (SD) are thriving and are accessible for customization, giving rise to a vibrant community of creators and enthusiasts. However, the widespread availability of customized SD models has led to copyright concerns, like unauthorized model distribution and unconsented commercial use. To address it, recent works aim to let SD models output watermarked content for post-hoc forensics. Unfortunately, none of them can achieve the challenging white-box protection, wherein the malicious user can easily remove or replace the watermarking module to fail the subsequent verification. For this, we propose AquaLoRA as the first implementation under this scenario. Briefly, we merge watermark information into the U-Net of Stable Diffusion Models via a watermark LowRank Adaptation (LoRA) module in a two-stage manner. For watermark LoRA module, we devise a scaling matrix to achieve flexible message updates without retraining. To guarantee fidelity, we design Prior Preserving Fine-Tuning (PPFT) to ensure watermark learning with minimal impacts on model distribution, validated by proofs. Finally, we conduct extensive experiments and ablation studies to verify our design. Our code is available at github. com/Georgefwt/AquaLoRA.

ICLR Conference 2024 Conference Paper

Boosting Vanilla Lightweight Vision Transformers via Re-parameterization

  • Zhentao Tan
  • Xiaodan Li
  • Yue Wu
  • Qi Chu 0001
  • Le Lu 0001
  • Nenghai Yu
  • Jieping Ye

Large-scale Vision Transformers have achieved promising performance on downstream tasks through feature pre-training. However, the performance of vanilla lightweight Vision Transformers (ViTs) is still far from satisfactory compared to that of recent lightweight CNNs or hybrid networks. In this paper, we aim to unlock the potential of vanilla lightweight ViTs by exploring the adaptation of the widely-used re-parameterization technology to ViTs for improving learning ability during training without increasing the inference cost. The main challenge comes from the fact that CNNs perfectly complement with re-parameterization over convolution and batch normalization, while vanilla Transformer architectures are mainly comprised of linear and layer normalization layers. We propose to incorporate the nonlinear ensemble into linear layers by expanding the depth of the linear layers with batch normalization and fusing multiple linear features with hierarchical representation ability through a pyramid structure. We also discover and solve a new transformer-specific distribution rectification problem caused by multi-branch re-parameterization. Finally, we propose our Two-Dimensional Re-parameterized Linear module (TDRL) for ViTs. Under the popular self-supervised pre-training and supervised fine-tuning strategy, our TDRL can be used in these two stages to enhance both generic and task-specific representation. Experiments demonstrate that our proposed method not only boosts the performance of vanilla Vit-Tiny on various vision tasks to new state-of-the-art (SOTA) but also shows promising generality ability on other networks. Code will be available.

AAAI Conference 2024 Conference Paper

Data-Free Hard-Label Robustness Stealing Attack

  • Xiaojian Yuan
  • Kejiang Chen
  • Wen Huang
  • Jie Zhang
  • Weiming Zhang
  • Nenghai Yu

The popularity of Machine Learning as a Service (MLaaS) has led to increased concerns about Model Stealing Attacks (MSA), which aim to craft a clone model by querying MLaaS. Currently, most research on MSA assumes that MLaaS can provide soft labels and that the attacker has a proxy dataset with a similar distribution. However, this fails to encapsulate the more practical scenario where only hard labels are returned by MLaaS and the data distribution remains elusive. Furthermore, most existing work focuses solely on stealing the model accuracy, neglecting the model robustness, while robustness is essential in security-sensitive scenarios, e.g, face-scan payment. Notably, improving model robustness often necessitates the use of expensive techniques such as adversarial training, thereby further making stealing robustness a more lucrative prospect. In response to these identified gaps, we introduce a novel Data-Free Hard-Label Robustness Stealing (DFHL-RS) attack in this paper, which enables the stealing of both model accuracy and robustness by simply querying hard labels of the target model without the help of any natural data. Comprehensive experiments demonstrate the effectiveness of our method. The clone model achieves a clean accuracy of 77.86% and a robust accuracy of 39.51% against AutoAttack, which are only 4.71% and 8.40% lower than the target model on the CIFAR-10 dataset, significantly exceeding the baselines. Our code is available at: https://github.com/LetheSec/DFHL-RS-Attack.

NeurIPS Conference 2024 Conference Paper

DPIC: Decoupling Prompt and Intrinsic Characteristics for LLM Generated Text Detection

  • Xiao Yu
  • Yuang Qi
  • Kejiang Chen
  • Guoqiang Chen
  • Xi Yang
  • Pengyuan Zhu
  • Xiuwei Shang
  • Weiming Zhang

Large language models (LLMs) have the potential to generate texts that pose risks of misuse, such as plagiarism, planting fake reviews on e-commerce platforms, or creating inflammatory false tweets. Consequently, detecting whether a text is generated by LLMs has become increasingly important. Existing high-quality detection methods usually require access to the interior of the model to extract the intrinsic characteristics. However, since we do not have access to the interior of the black-box model, we must resort to surrogate models, which impacts detection quality. In order to achieve high-quality detection of black-box models, we would like to extract deep intrinsic characteristics of the black-box model generated texts. We view the generation process as a coupled process of prompt and intrinsic characteristics of the generative model. Based on this insight, we propose to decouple prompt and intrinsic characteristics (DPIC) for LLM-generated text detection method. Specifically, given a candidate text, DPIC employs an auxiliary LLM to reconstruct the prompt corresponding to the candidate text, then uses the prompt to regenerate text by the auxiliary LLM, which makes the candidate text and the regenerated text align with their prompts, respectively. Then, the similarity between the candidate text and the regenerated text is used as a detection feature, thus eliminating the prompt in the detection process, which allows the detector to focus on the intrinsic characteristics of the generative model. Compared to the baselines, DPIC has achieved an average improvement of 6. 76\% and 2. 91\% in detecting texts from different domains generated by GPT4 and Claude3, respectively.

AAAI Conference 2024 Conference Paper

FaceRSA: RSA-Aware Facial Identity Cryptography Framework

  • Zhongyi Zhang
  • Tianyi Wei
  • Wenbo Zhou
  • Hanqing Zhao
  • Weiming Zhang
  • Nenghai Yu

With the flourishing of the Internet, sharing one's photos or automated processing of faces using computer vision technology has become an everyday occurrence. While enjoying the convenience, the concern for identity privacy is also emerging. Therefore, some efforts introduced the concept of ``password'' from traditional cryptography such as RSA into the face anonymization and deanonymization task to protect the facial identity without compromising the usability of the face image. However, these methods either suffer from the poor visual quality of the synthesis results or do not possess the full cryptographic properties, resulting in compromised security. In this paper, we present the first facial identity cryptography framework with full properties analogous to RSA. Our framework leverages the powerful generative capabilities of StyleGAN to achieve megapixel-level facial identity anonymization and deanonymization. Thanks to the great semantic decoupling of StyleGAN's latent space, the identity encryption and decryption process are performed in latent space by a well-designed password mapper in the manner of editing latent code. Meanwhile, the password-related information is imperceptibly hidden in the edited latent code owing to the redundant nature of the latent space. To make our cryptographic framework possesses all the properties analogous to RSA, we propose three types of loss functions: single anonymization loss, sequential anonymization loss, and associated anonymization loss. Extensive experiments and ablation analyses demonstrate the superiority of our method in terms of the quality of synthesis results, identity-irrelevant attributes preservation, deanonymization accuracy, and completeness of properties analogous to RSA.

AAAI Conference 2024 Conference Paper

MotionGPT: Finetuned LLMs Are General-Purpose Motion Generators

  • Yaqi Zhang
  • Di Huang
  • Bin Liu
  • Shixiang Tang
  • Yan Lu
  • Lu Chen
  • Lei Bai
  • Qi Chu

Generating realistic human motion from given action descriptions has experienced significant advancements because of the emerging requirement of digital humans. While recent works have achieved impressive results in generating motion directly from textual action descriptions, they often support only a single modality of the control signal, which limits their application in the real digital human industry. This paper presents a Motion General-Purpose generaTor (MotionGPT) that can use multimodal control signals, e.g., text and single-frame poses, for generating consecutive human motions by treating multimodal signals as special input tokens in large language models (LLMs). Specifically, we first quantize multimodal control signals into discrete codes and then formulate them in a unified prompt instruction to ask the LLMs to generate the motion answer. Our MotionGPT demonstrates a unified human motion generation model with multimodal control signals by tuning a mere 0.4% of LLM parameters. To the best of our knowledge, MotionGPT is the first method to generate human motion by multimodal control signals, which we hope can shed light on this new direction. Visit our webpage at https://qiqiapink.github.io/MotionGPT/.

AAAI Conference 2024 Conference Paper

MuST: Robust Image Watermarking for Multi-Source Tracing

  • Guanjie Wang
  • Zehua Ma
  • Chang Liu
  • Xi Yang
  • Han Fang
  • Weiming Zhang
  • Nenghai Yu

In recent years, with the popularity of social media applications, massive digital images are available online, which brings great convenience to image recreation. However, the use of unauthorized image materials in multi-source composite images is still inadequately regulated, which may cause significant loss and discouragement to the copyright owners of the source image materials. Ideally, deep watermarking techniques could provide a solution for protecting these copyrights based on their encoder-noise-decoder training strategy. Yet existing image watermarking schemes, which are mostly designed for single images, cannot well address the copyright protection requirements in this scenario, since the multi-source image composing process commonly includes distortions that are not well investigated in previous methods, e.g., the extreme downsizing. To meet such demands, we propose MuST, a multi-source tracing robust watermarking scheme, whose architecture includes a multi-source image detector and minimum external rectangle operation for multiple watermark resynchronization and extraction. Furthermore, we constructed an image material dataset covering common image categories and designed the simulation model of the multi-source image composing process as the noise layer. Experiments demonstrate the excellent performance of MuST in tracing sources of image materials from the composite images compared with SOTA watermarking methods, which could maintain the extraction accuracy above 98% to trace the sources of at least 3 different image materials while keeping the average PSNR of watermarked image materials higher than 42.51 dB. We released our code on https://github.com/MrCrims/MuST

AAAI Conference 2024 Conference Paper

TCI-Former: Thermal Conduction-Inspired Transformer for Infrared Small Target Detection

  • Tianxiang Chen
  • Zhentao Tan
  • Qi Chu
  • Yue Wu
  • Bin Liu
  • Nenghai Yu

Infrared small target detection (ISTD) is critical to national security and has been extensively applied in military areas. ISTD aims to segment small target pixels from background. Most ISTD networks focus on designing feature extraction blocks or feature fusion modules, but rarely describe the ISTD process from the feature map evolution perspective. In the ISTD process, the network attention gradually shifts towards target areas. We abstract this process as the directional movement of feature map pixels to target areas through convolution, pooling and interactions with surrounding pixels, which can be analogous to the movement of thermal particles constrained by surrounding variables and particles. In light of this analogy, we propose Thermal Conduction-Inspired Transformer (TCI-Former) based on the theoretical principles of thermal conduction. According to thermal conduction differential equation in heat dynamics, we derive the pixel movement differential equation (PMDE) in the image domain and further develop two modules: Thermal Conduction-Inspired Attention (TCIA) and Thermal Conduction Boundary Module (TCBM). TCIA incorporates finite difference method with PMDE to reach a numerical approximation so that target body features can be extracted. To further remove errors in boundary areas, TCBM is designed and supervised by boundary masks to refine target body features with fine boundary details. Experiments on IRSTD-1k and NUAA-SIRST demonstrate the superiority of our method.

ICML Conference 2024 Conference Paper

Transferable Facial Privacy Protection against Blind Face Restoration via Domain-Consistent Adversarial Obfuscation

  • Kui Zhang
  • Hang Zhou 0007
  • Jie Zhang 0073
  • Wenbo Zhou 0004
  • Weiming Zhang 0001
  • Nenghai Yu

With the rise of social media and the proliferation of facial recognition surveillance, concerns surrounding privacy have escalated significantly. While numerous studies have concentrated on safeguarding users against unauthorized face recognition, a new and often overlooked issue has emerged due to advances in facial restoration techniques: traditional methods of facial obfuscation may no longer provide a secure shield, as they can potentially expose anonymous information to human perception. Our empirical study shows that blind face restoration (BFR) models can restore obfuscated faces with high probability by simply retraining them on obfuscated (e. g. , pixelated) faces. To address it, we propose a transferable adversarial obfuscation method for privacy protection against BFR models. Specifically, we observed a common characteristic among BFR models, namely, their capability to approximate an inverse mapping of a transformation from a high-quality image domain to a low-quality image domain. Leveraging this shared model attribute, we have developed a domain-consistent adversarial method for generating obfuscated images. In essence, our method is designed to minimize overfitting to surrogate models during the perturbation generation process, thereby enhancing the generalization of adversarial obfuscated facial images. Extensive experiments on various BFR models demonstrate the effectiveness and transferability of the proposed method.

AAAI Conference 2024 Conference Paper

Unifying Multi-Modal Uncertainty Modeling and Semantic Alignment for Text-to-Image Person Re-identification

  • Zhiwei Zhao
  • Bin Liu
  • Yan Lu
  • Qi Chu
  • Nenghai Yu

Text-to-Image person re-identification (TI-ReID) aims to retrieve the images of target identity according to the given textual description. The existing methods in TI-ReID focus on aligning the visual and textual modalities through contrastive feature alignment or reconstructive masked language modeling (MLM). However, these methods parameterize the image/text instances as deterministic embeddings and do not explicitly consider the inherent uncertainty in pedestrian images and their textual descriptions, leading to limited image-text relationship expression and semantic alignment. To address the above problem, in this paper, we propose a novel method that unifies multi-modal uncertainty modeling and semantic alignment for TI-ReID. Specifically, we model the image and textual feature vectors of pedestrian as Gaussian distributions, where the multi-granularity uncertainty of the distribution is estimated by incorporating batch-level and identity-level feature variances for each modality. The multi-modal uncertainty modeling acts as a feature augmentation and provides richer image-text semantic relationship. Then we present a bi-directional cross-modal circle loss to more effectively align the probabilistic features between image and text in a self-paced manner. To further promote more comprehensive image-text semantic alignment, we design a task that complements the masked language modeling, focusing on the cross-modality semantic recovery of global masked token after cross-modal interaction. Extensive experiments conducted on three TI-ReID datasets highlight the effectiveness and superiority of our method over state-of-the-arts.

AAAI Conference 2023 Conference Paper

AutoStegaFont: Synthesizing Vector Fonts for Hiding Information in Documents

  • Xi Yang
  • Jie Zhang
  • Han Fang
  • Chang Liu
  • Zehua Ma
  • Weiming Zhang
  • Nenghai Yu

Hiding information in text documents has been a hot topic recently, with the most typical schemes of utilizing fonts. By constructing several fonts with similar appearances, information can be effectively represented and embedded in documents. However, due to the unstructured characteristic, font vectors are more difficult to synthesize than font images. Existing methods mainly use handcrafted features to design the fonts manually, which is time-consuming and labor-intensive. Moreover, due to the diversity of fonts, handcrafted features are not generalizable to different fonts. Besides, in practice, since documents might be distorted through transmission, ensuring extractability under distortions is also an important requirement. Therefore, three requirements are imposed on vector font generation in this domain: automaticity, generalizability, and robustness. However, none of the existing methods can satisfy these requirements well and simultaneously. To satisfy the above requirements, we propose AutoStegaFont, an automatic vector font synthesis scheme for hiding information in documents. Specifically, we design a two-stage and dual-modality learning framework. In the first stage, we jointly train an encoder and a decoder to invisibly encode the font images with different information. To ensure robustness, we target designing a noise layer to work with the encoder and decoder during training. In the second stage, we employ a differentiable rasterizer to establish a connection between the image and the vector modality. Then, we design an optimization algorithm to convey the information from the encoded image to the corresponding vector. Thus the encoded font vectors can be automatically generated. Extensive experiments demonstrate the superior performance of our scheme in automatically synthesizing vector fonts for hiding information in documents, with robustness to distortions caused by low-resolution screenshots, printing, and photography. Besides, the proposed framework has better generalizability to fonts with diverse styles and languages.

AAAI Conference 2023 Conference Paper

DeAR: A Deep-Learning-Based Audio Re-recording Resilient Watermarking

  • Chang Liu
  • Jie Zhang
  • Han Fang
  • Zehua Ma
  • Weiming Zhang
  • Nenghai Yu

Audio watermarking is widely used for leaking source tracing. The robustness of the watermark determines the traceability of the algorithm. With the development of digital technology, audio re-recording (AR) has become an efficient and covert means to steal secrets. AR process could drastically destroy the watermark signal while preserving the original information. This puts forward a new requirement for audio watermarking at this stage, that is, to be robust to AR distortions. Unfortunately, none of the existing algorithms can effectively resist AR attacks due to the complexity of the AR process. To address this limitation, this paper proposes DeAR, a deep-learning-based audio re-recording resistant watermarking. Inspired by DNN-based image watermarking, we pioneer a deep learning framework for audio carriers, based on which the watermark signal can be effectively embedded and extracted. Meanwhile, in order to resist the AR attack, we delicately analyze the distortions that occurred in the AR process and design the corresponding distortion layer to cooperate with the proposed watermarking framework. Extensive experiments show that the proposed algorithm can resist not only common electronic channel distortions but also AR distortions. Under the premise of high-quality embedding (SNR=25.86dB), in the case of a common re-recording distance (20cm), the algorithm can effectively achieve an average bit recovery accuracy of 98.55%.

ICLR Conference 2023 Conference Paper

Exploring the Limits of Differentially Private Deep Learning with Group-wise Clipping

  • Jiyan He
  • Xuechen Li
  • Da Yu
  • Huishuai Zhang
  • Janardhan Kulkarni
  • Yin Tat Lee
  • Arturs Backurs
  • Nenghai Yu

Differentially private deep learning has recently witnessed advances in computational efficiency and privacy-utility trade-off. We explore whether further improvements along the two axes are possible and provide affirmative answers leveraging two instantiations of \emph{group-wise clipping}. To reduce the compute time overhead of private learning, we show that \emph{per-layer clipping}, where the gradient of each neural network layer is clipped separately, allows clipping to be performed in conjunction with backpropagation in differentially private optimization. This results in private learning that is as memory-efficient and almost as fast per training update as non-private learning for many workflows of interest. While per-layer clipping with constant thresholds tends to underperform standard flat clipping, per-layer clipping with adaptive thresholds matches or outperforms flat clipping under given training epoch constraints, hence attaining similar or better task performance within less wall time. To explore the limits of scaling (pretrained) models in differentially private deep learning, we privately fine-tune the 175 billion-parameter GPT-3. We bypass scaling challenges associated with clipping gradients that are distributed across multiple devices with \emph{per-device clipping} that clips the gradient of each model piece separately on its host device. Privately fine-tuning GPT-3 with per-device clipping achieves a task performance at $\epsilon=1$ better than what is attainable by non-privately fine-tuning the largest GPT-2 on a summarization task.

IJCAI Conference 2023 Conference Paper

Fluid Dynamics-Inspired Network for Infrared Small Target Detection

  • Tianxiang Chen
  • Qi Chu
  • Bin Liu
  • Nenghai Yu

Most infrared small target detection (ISTD) networks focus on building effective neural blocks or feature fusion modules but none describes the ISTD process from the image evolution perspective. The directional evolution of image pixels influenced by convolution, pooling and surrounding pixels is analogous to the movement of fluid elements constrained by surrounding variables ang particles. Inspired by this, we explore a novel research routine by abstracting the movement of pixels in the ISTD process as the flow of fluid in fluid dynamics (FD). Specifically, a new Fluid Dynamics-Inspired Network (FDI-Net) is devised for ISTD. Based on Taylor Central Difference (TCD) method, the TCD feature extraction block is designed, where convolution and Transformer structures are combined for local and global information. The pixel motion equation during the ISTD process is derived from the Navier–Stokes (N-S) equation, constructing a N-S Refinement Module that refines extracted features with edge details. Thus, the TCD feature extraction block determines the primary movement direction of pixels during detection, while the N-S Refinement Module corrects some skewed directions of the pixel stream to supplement the edge details. Experiments on IRSTD-1k and SIRST demonstrate that our method achieves SOTA performance in terms of evaluation metrics.

AAAI Conference 2023 Conference Paper

PeCo: Perceptual Codebook for BERT Pre-training of Vision Transformers

  • Xiaoyi Dong
  • Jianmin Bao
  • Ting Zhang
  • DongDong Chen
  • Weiming Zhang
  • Lu Yuan
  • Dong Chen
  • Fang Wen

This paper explores a better prediction target for BERT pre-training of vision transformers. We observe that current prediction targets disagree with human perception judgment. This contradiction motivates us to learn a perceptual prediction target. We argue that perceptually similar images should stay close to each other in the prediction target space. We surprisingly find one simple yet effective idea: enforcing perceptual similarity during the dVAE training. Moreover, we adopt a self-supervised transformer model for deep feature extraction and show that it works well for calculating perceptual similarity. We demonstrate that such learned visual tokens indeed exhibit better semantic meanings, and help pre-training achieve superior transfer performance in various downstream tasks. For example, we achieve 84.5% Top-1 accuracy on ImageNet-1K with ViT-B backbone, outperforming the competitive method BEiT by +1.3% under the same pre-training epochs. Our approach also gets significant improvement on object detection and segmentation on COCO and semantic segmentation on ADE20K. Equipped with a larger backbone ViT-H, we achieve the state-of-the-art ImageNet accuracy (88.3%) among methods using only ImageNet-1K data.

AAAI Conference 2023 Conference Paper

Pseudo Label-Guided Model Inversion Attack via Conditional Generative Adversarial Network

  • Xiaojian Yuan
  • Kejiang Chen
  • Jie Zhang
  • Weiming Zhang
  • Nenghai Yu
  • Yang Zhang

Model inversion (MI) attacks have raised increasing concerns about privacy, which can reconstruct training data from public models. Indeed, MI attacks can be formalized as an optimization problem that seeks private data in a certain space. Recent MI attacks leverage a generative adversarial network (GAN) as an image prior to narrow the search space, and can successfully reconstruct even the high-dimensional data (e.g., face images). However, these generative MI attacks do not fully exploit the potential capabilities of the target model, still leading to a vague and coupled search space, i.e., different classes of images are coupled in the search space. Besides, the widely used cross-entropy loss in these attacks suffers from gradient vanishing. To address these problems, we propose Pseudo Label-Guided MI (PLG-MI) attack via conditional GAN (cGAN). At first, a top-n selection strategy is proposed to provide pseudo-labels for public data, and use pseudo-labels to guide the training of the cGAN. In this way, the search space is decoupled for different classes of images. Then a max-margin loss is introduced to improve the search process on the subspace of a target class. Extensive experiments demonstrate that our PLG-MI attack significantly improves the attack success rate and visual quality for various datasets and models, notably, 2 ∼ 3× better than state-of-the-art attacks under large distributional shifts. Our code is available at: https://github.com/LetheSec/PLG-MI-Attack.

ICML Conference 2023 Conference Paper

X-Paste: Revisiting Scalable Copy-Paste for Instance Segmentation using CLIP and StableDiffusion

  • Hanqing Zhao
  • Dianmo Sheng
  • Jianmin Bao
  • Dongdong Chen 0001
  • Dong Chen 0003
  • Fang Wen 0001
  • Lu Yuan
  • Ce Liu 0001

Copy-Paste is a simple and effective data augmentation strategy for instance segmentation. By randomly pasting object instances onto new background images, it creates new training data for free and significantly boosts the segmentation performance, especially for rare object categories. Although diverse, high-quality object instances used in Copy-Paste result in more performance gain, previous works utilize object instances either from human-annotated instance segmentation datasets or rendered from 3D object models, and both approaches are too expensive to scale up to obtain good diversity. In this paper, we revisit Copy-Paste at scale with the power of newly emerged zero-shot recognition models (e. g. , CLIP) and text2image models (e. g. , StableDiffusion). We demonstrate for the first time that using a text2image model to generate images or zero-shot recognition model to filter noisily crawled images for different object categories is a feasible way to make Copy-Paste truly scalable. To make such success happen, we design a data acquisition and processing framework, dubbed “X-Paste", upon which a systematic study is conducted. On the LVIS dataset, X-Paste provides impressive improvements over the strong baseline CenterNet2 with Swin-L as the backbone. Specifically, it archives +2. 6 box AP and +2. 1 mask AP gains on all classes and even more significant gains with +6. 8 box AP +6. 5 mask AP on long-tail classes.

AAAI Conference 2022 Conference Paper

Tracing Text Provenance via Context-Aware Lexical Substitution

  • Xi Yang
  • Jie Zhang
  • Kejiang Chen
  • Weiming Zhang
  • Zehua Ma
  • Feng Wang
  • Nenghai Yu

Text content created by humans or language models is often stolen or misused by adversaries. Tracing text provenance can help claim the ownership of text content or identify the malicious users who distribute misleading content like machine-generated fake news. There have been some attempts to achieve this, mainly based on watermarking techniques. Specifically, traditional text watermarking methods embed watermarks by slightly altering text format like line spacing and font, which, however, are fragile to cross-media transmissions like OCR. Considering this, natural language watermarking methods represent watermarks by replacing words in original sentences with synonyms from handcrafted lexical resources (e. g. , WordNet), but they do not consider the substitution’s impact on the overall sentence’s meaning. Recently, a transformer-based network was proposed to embed watermarks by modifying the unobtrusive words (e. g. , function words), which also impair the sentence’s logical and semantic coherence. Besides, one well-trained network fails on other different types of text content. To address the limitations mentioned above, we propose a natural language watermarking scheme based on contextaware lexical substitution (LS). Specifically, we employ BERT to suggest LS candidates by inferring the semantic relatedness between the candidates and the original sentence. Based on this, a selection strategy in terms of synchronicity and substitutability is further designed to test whether a word is exactly suitable for carrying the watermark signal. Extensive experiments demonstrate that, under both objective and subjective metrics, our watermarking scheme can well preserve the semantic integrity of original sentences and has a better transferability than existing methods. Besides, the proposed LS approach outperforms the state-of-the-art approach on the Stanford Word Substitution Benchmark.

AAAI Conference 2021 Conference Paper

Initiative Defense against Facial Manipulation

  • Qidong Huang
  • Jie Zhang
  • Wenbo Zhou
  • Weiming Zhang
  • Nenghai Yu

Benefiting from the development of generative adversarial networks (GAN), facial manipulation has achieved significant progress in both academia and industry recently. It inspires an increasing number of entertainment applications but also incurs severe threats to individual privacy and even political security meanwhile. To mitigate such risks, many countermeasures have been proposed. However, the great majority methods are designed in a passive manner, which is to detect whether the facial images or videos are tampered after their wide propagation. These detection-based methods have a fatal limitation, that is, they only work for ex-post forensics but can not prevent the engendering of malicious behavior. To address the limitation, in this paper, we propose a novel framework of initiative defense to degrade the performance of facial manipulation models controlled by malicious users. The basic idea is to actively inject imperceptible venom into target facial data before manipulation. To this end, we first imitate the target manipulation model with a surrogate model, and then devise a poison perturbation generator to obtain the desired venom. An alternating training strategy are further leveraged to train both the surrogate model and the perturbation generator. Two typical facial manipulation tasks: face attribute editing and face reenactment, are considered in our initiative defense framework. Extensive experiments demonstrate the effectiveness and robustness of our framework in different settings. Finally, we hope this work can shed some light on initiative countermeasures against more adversarial scenarios.

AAAI Conference 2021 Conference Paper

Joint Color-irrelevant Consistency Learning and Identity-aware Modality Adaptation for Visible-infrared Cross Modality Person Re-identification

  • Zhiwei Zhao
  • Bin Liu
  • Qi Chu
  • Yan Lu
  • Nenghai Yu

Visible-infrared cross modality person re-identification (VI- ReID) is a core but challenging technology in the 24-hours intelligent surveillance system. How to eliminate the large modality gap lies in the heart of VI-ReID. Conventional methods mainly focus on directly aligning the heterogeneous modalities into the same space. However, due to the unbalanced color information between the visible and infrared images, the features of visible images tend to overfit the clothing color information, which would be harmful to the modality alignment. Besides, these methods mainly align the heterogeneous feature distributions in dataset-level while ignoring the valuable identity information, which may cause the feature misalignment of some identities and weaken the discrimination of features. To tackle above problems, we propose a novel approach for VI-ReID. It learns the colorirrelevant features through the color-irrelevant consistency learning (CICL) and aligns the identity-level feature distributions by the identity-aware modality adaptation (IAMA). The CICL and IAMA are integrated into a joint learning framework and can promote each other. Extensive experiments on two popular datasets SYSU-MM01 and RegDB demonstrate the superiority and effectiveness of our approach against the state-of-the-art methods.

ICLR Conference 2021 Conference Paper

Return-Based Contrastive Representation Learning for Reinforcement Learning

  • Guoqing Liu
  • Chuheng Zhang
  • Li Zhao 0007
  • Tao Qin 0001
  • Jinhua Zhu 0001
  • Jian Li 0015
  • Nenghai Yu
  • Tie-Yan Liu

Recently, various auxiliary tasks have been proposed to accelerate representation learning and improve sample efficiency in deep reinforcement learning (RL). However, existing auxiliary tasks do not take the characteristics of RL problems into consideration and are unsupervised. By leveraging returns, the most important feedback signals in RL, we propose a novel auxiliary task that forces the learnt representations to discriminate state-action pairs with different returns. Our auxiliary loss is theoretically justified to learn representations that capture the structure of a new form of state-action abstraction, under which state-action pairs with similar return distributions are aggregated together. Empirically, our algorithm outperforms strong baselines on complex tasks in Atari games and DeepMind Control suite, and achieves even better performance when combined with existing auxiliary tasks.

AAAI Conference 2021 Conference Paper

Temporal ROI Align for Video Object Recognition

  • Tao Gong
  • Kai Chen
  • Xinjiang Wang
  • Qi Chu
  • Feng Zhu
  • Dahua Lin
  • Nenghai Yu
  • Huamin Feng

Video object detection is challenging in the presence of appearance deterioration in certain video frames. Therefore, it is a natural choice to aggregate temporal information from other frames of the same video into the current frame. However, ROI Align, as one of the most core procedures of video detectors, still remains extracting features from a single-frame feature map for proposals, making the extracted ROI features lack temporal information from videos. In this work, considering the features of the same object instance are highly similar among frames in a video, a novel Temporal ROI Align operator is proposed to extract features from other frames feature maps for current frame proposals by utilizing feature similarity. The proposed Temporal ROI Align operator can extract temporal information from the entire video for proposals. We integrate it into single-frame video detectors and other state-of-the-art video detectors, and conduct quantitative experiments to demonstrate that the proposed Temporal ROI Align operator can consistently and significantly boost the performance. Besides, the proposed Temporal ROI Align can also be applied into video instance segmentation.

AAAI Conference 2020 Conference Paper

DASOT: A Unified Framework Integrating Data Association and Single Object Tracking for Online Multi-Object Tracking

  • Qi Chu
  • Wanli Ouyang
  • Bin Liu
  • Feng Zhu
  • Nenghai Yu

In this paper, we propose an online multi-object tracking (MOT) approach that integrates data association and single object tracking (SOT) with a unified convolutional network (ConvNet), named DASOTNet. The intuition behind integrating data association and SOT is that they can complement each other. Following Siamese network architecture, DASOT- Net consists of the shared feature ConvNet, the data association branch and the SOT branch. Data association is treated as a special re-identification task and solved by learning discriminative features for different targets in the data association branch. To handle the problem that the computational cost of SOT grows intolerably as the number of tracked objects increases, we propose an efficient two-stage tracking method in the SOT branch, which utilizes the merits of correlation features and can simultaneously track all the existing targets within one forward propagation. With feature sharing and the interaction between them, data association branch and the SOT branch learn to better complement each other. Using a multi-task objective, the whole network can be trained endto-end. Compared with state-of-the-art online MOT methods, our method is much faster while maintaining a comparable performance.

NeurIPS Conference 2020 Conference Paper

GreedyFool: Distortion-Aware Sparse Adversarial Attack

  • Xiaoyi Dong
  • DongDong Chen
  • Jianmin Bao
  • Chuan Qin
  • Lu Yuan
  • Weiming Zhang
  • Nenghai Yu
  • Dong Chen

Modern deep neural networks(DNNs) are vulnerable to adversarial samples. Sparse adversarial samples are a special branch of adversarial samples that can fool the target model by only perturbing a few pixels. The existence of the sparse adversarial attack points out that DNNs are much more vulnerable than people believed, which is also a new aspect for analyzing DNNs. However, current sparse adversarial attack methods still have some shortcomings on both sparsity and invisibility. In this paper, we propose a novel two-stage distortion-aware greedy-based method dubbed as ''GreedyFool". Specifically, it first selects the most effective candidate positions to modify by considering both the gradient(for adversary) and the distortion map(for invisibility), then drops some less important points in the reduce stage. Experiments demonstrate that compared with the start-of-the-art method, we only need to modify 3 times fewer pixels under the same sparse perturbation setting. For target attack, the success rate of our method is 9. 96% higher than the start-of-the-art method under the same pixel budget.

IJCAI Conference 2020 Conference Paper

GSM: Graph Similarity Model for Multi-Object Tracking

  • Qiankun Liu
  • Qi Chu
  • Bin Liu
  • Nenghai Yu

The popular tracking-by-detection paradigm for multi-object tracking (MOT) focuses on solving data association problem, of which a robust similarity model lies in the heart. Most previous works make effort to improve feature representation for individual object while leaving the relations among objects less explored, which may be problematic in some complex scenarios. In this paper, we focus on leveraging the relations among objects to improve robustness of the similarity model. To this end, we propose a novel graph representation that takes both the feature of individual object and the relations among objects into consideration. Besides, a graph matching module is specially designed for the proposed graph representation to alleviate the impact of unreliable relations. With the help of the graph representation and the graph matching module, the proposed graph similarity model, named GSM, is more robust to the occlusion and the targets sharing similar appearance. We conduct extensive experiments on challenging MOT benchmarks and the experimental results demonstrate the effectiveness of the proposed method.

AAAI Conference 2020 Conference Paper

Model Watermarking for Image Processing Networks

  • Jie Zhang
  • DongDong Chen
  • Jing Liao
  • Han Fang
  • Weiming Zhang
  • Wenbo Zhou
  • Hao Cui
  • Nenghai Yu

Deep learning has achieved tremendous success in numerous industrial applications. As training a good model often needs massive high-quality data and computation resources, the learned models often have significant business values. However, these valuable deep models are exposed to a huge risk of infringements. For example, if the attacker has the full information of one target model including the network structure and weights, the model can be easily finetuned on new datasets. Even if the attacker can only access the output of the target model, he/she can still train another similar surrogate model by generating a large scale of input-output training pairs. How to protect the intellectual property of deep models is a very important but seriously under-researched problem. There are a few recent attempts at classification network protection only. In this paper, we propose the first model watermarking framework for protecting image processing models. To achieve this goal, we leverage the spatial invisible watermarking mechanism. Specifically, given a black-box target model, a unified and invisible watermark is hidden into its outputs, which can be regarded as a special task-agnostic barrier. In this way, when the attacker trains one surrogate model by using the input-output pairs of the target model, the hidden watermark will be learned and extracted afterward. To enable watermarks from binary bits to high-resolution images, both traditional and deep spatial invisible watermarking mechanism are considered. Experiments demonstrate the robustness of the proposed watermarking mechanism, which can resist surrogate models learned with different network structures and objective functions. Besides deep models, the proposed method is also easy to be extended to protect data and traditional image processing algorithms.

NeurIPS Conference 2020 Conference Paper

Passport-aware Normalization for Deep Model Protection

  • Jie Zhang
  • DongDong Chen
  • Jing Liao
  • Weiming Zhang
  • Gang Hua
  • Nenghai Yu

Despite tremendous success in many application scenarios, deep learning faces serious intellectual property (IP) infringement threats. Considering the cost of designing and training a good model, infringements will significantly infringe the interests of the original model owner. Recently, many impressive works have emerged for deep model IP protection. However, they either are vulnerable to ambiguity attacks, or require changes in the target network structure by replacing its original normalization layers and hence cause significant performance drops. To this end, we propose a new passport-aware normalization formulation, which is generally applicable to most existing normalization layers and only needs to add another passport-aware branch for IP protection. This new branch is jointly trained with the target model but discarded in the inference stage. Therefore it causes no structure change in the target model. Only when the model IP is suspected to be stolen by someone, the private passport-aware branch is added back for ownership verification. Through extensive experiments, we verify its effectiveness in both image and 3D point recognition models. It is demonstrated to be robust not only to common attack techniques like fine-tuning and model compression, but also to ambiguity attacks. By further combining it with trigger-set based methods, both black-box and white-box verification can be achieved for enhanced security of deep learning models deployed in real systems.

AAAI Conference 2019 Conference Paper

Capacity Control of ReLU Neural Networks by Basis-Path Norm

  • Shuxin Zheng
  • Qi Meng
  • Huishuai Zhang
  • Wei Chen
  • Nenghai Yu
  • Tie-Yan Liu

Recently, path norm was proposed as a new capacity measure for neural networks with Rectified Linear Unit (ReLU) activation function, which takes the rescaling-invariant property of ReLU into account. It has been shown that the generalization error bound in terms of the path norm explains the empirical generalization behaviors of the ReLU neural networks better than that of other capacity measures. Moreover, optimization algorithms which take path norm as the regularization term to the loss function, like Path-SGD, have been shown to achieve better generalization performance. However, the path norm counts the values of all paths, and hence the capacity measure based on path norm could be improperly influenced by the dependency among different paths. It is also known that each path of a ReLU network can be represented by a small group of linearly independent basis paths with multiplication and division operation, which indicates that the generalization behavior of the network only depends on only a few basis paths. Motivated by this, we propose a new norm Basis-path Norm based on a group of linearly independent paths to measure the capacity of neural networks more accurately. We establish a generalization error bound based on this basis path norm, and show it explains the generalization behaviors of ReLU networks more accurately than previous capacity measures via extensive experiments. In addition, we develop optimization algorithms which minimize the empirical risk regularized by the basis-path norm. Our experiments on benchmark datasets demonstrate that the proposed regularization method achieves clearly better performance on the test set than the previous regularization approaches.

AAAI Conference 2019 Conference Paper

Trust Region Evolution Strategies

  • Guoqing Liu
  • Li Zhao
  • Feidiao Yang
  • Jiang Bian
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

Evolution Strategies (ES), a class of black-box optimization algorithms, has recently been demonstrated to be a viable alternative to popular MDP-based RL techniques such as Qlearning and Policy Gradients. ES achieves fairly good performance on challenging reinforcement learning problems and is easier to scale in a distributed setting. However, standard ES algorithms perform one gradient update per data sample, which is not very efficient. In this paper, with the purpose of more efficient using of sampled data, we propose a novel iterative procedure that optimizes a surrogate objective function, enabling to reuse data sample for multiple epochs of updates. We prove monotonic improvement guarantee for such procedure. By making several approximations to the theoretically-justified procedure, we further develop a practical algorithm called Trust Region Evolution Strategies (TRES). Our experiments demonstrate the effectiveness of TRES on a range of popular MuJoCo locomotion tasks in the OpenAI Gym, achieving better performance than ES algorithm.

ICML Conference 2018 Conference Paper

Model-Level Dual Learning

  • Yingce Xia
  • Xu Tan 0003
  • Fei Tian
  • Tao Qin 0001
  • Nenghai Yu
  • Tie-Yan Liu

Many artificial intelligence tasks appear in dual forms like English$\leftrightarrow$French translation and speech$\leftrightarrow$text transformation. Existing dual learning schemes, which are proposed to solve a pair of such dual tasks, explore how to leverage such dualities from data level. In this work, we propose a new learning framework, model-level dual learning, which takes duality of tasks into consideration while designing the architectures for the primal/dual models, and ties the model parameters that playing similar roles in the two tasks. We study both symmetric and asymmetric model-level dual learning. Our algorithms achieve significant improvements on neural machine translation and sentiment analysis.

ICML Conference 2017 Conference Paper

Asynchronous Stochastic Gradient Descent with Delay Compensation

  • Shuxin Zheng
  • Qi Meng
  • Taifeng Wang
  • Wei Chen 0034
  • Nenghai Yu
  • Zhiming Ma
  • Tie-Yan Liu

With the fast development of deep learning, it has become common to learn big neural networks using massive training data. Asynchronous Stochastic Gradient Descent (ASGD) is widely adopted to fulfill this task for its efficiency, which is, however, known to suffer from the problem of delayed gradients. That is, when a local worker adds its gradient to the global model, the global model may have been updated by other workers and this gradient becomes “delayed”. We propose a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD. This is achieved by leveraging Taylor expansion of the gradient function and efficient approximators to the Hessian matrix of the loss function. We call the new algorithm Delay Compensated ASGD (DC-ASGD). We evaluated the proposed algorithm on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.

NeurIPS Conference 2017 Conference Paper

Deliberation Networks: Sequence Generation Beyond One-Pass Decoding

  • Yingce Xia
  • Fei Tian
  • Lijun Wu
  • Jianxin Lin
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

The encoder-decoder framework has achieved promising progress for many sequence generation tasks, including machine translation, text summarization, dialog system, image captioning, etc. Such a framework adopts an one-pass forward process while decoding and generating a sequence, but lacks the deliberation process: A generated sequence is directly used as final output without further polishing. However, deliberation is a common behavior in human's daily life like reading news and writing papers/articles/books. In this work, we introduce the deliberation process into the encoder-decoder framework and propose deliberation networks for sequence generation. A deliberation network has two levels of decoders, where the first-pass decoder generates a raw sequence and the second-pass decoder polishes and refines the raw sentence with deliberation. Since the second-pass deliberation decoder has global information about what the sequence to be generated might be, it has the potential to generate a better sequence by looking into future words in the raw sentence. Experiments on neural machine translation and text summarization demonstrate the effectiveness of the proposed deliberation networks. On the WMT 2014 English-to-French translation task, our model establishes a new state-of-the-art BLEU score of 41. 5.

IJCAI Conference 2017 Conference Paper

Dual Inference for Machine Learning

  • Yingce Xia
  • Jiang Bian
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

Recent years have witnessed the rapid development of machine learning in solving artificial intelligence (AI) tasks in many domains, including translation, speech, image, etc. Within these domains, AI tasks are usually not independent. As a specific type of relationship, structural duality does exist between many pairs of AI tasks, such as translation from one language to another vs. its opposite direction, speech recognition vs. speech synthetization, image classification vs. image generation, etc. The importance of such duality has been magnified by some recent studies, which revealed that it can boost the learning of two tasks in the dual form. However, there has been little investigation on how to leverage this invaluable relationship into the inference stage of AI tasks. In this paper, we propose a general framework of dual inference which can take advantage of both existing models from two dual tasks, without re-training, to conduct inference for one individual task. Empirical studies on three pairs of specific dual tasks, including machine translation, sentiment analysis, and image processing have illustrated that dual inference can significantly improve the performance of each of individual tasks.

ICML Conference 2017 Conference Paper

Dual Supervised Learning

  • Yingce Xia
  • Tao Qin 0001
  • Wei Chen 0034
  • Jiang Bian 0002
  • Nenghai Yu
  • Tie-Yan Liu

Many supervised learning tasks are emerged in dual forms, e. g. , English-to-French translation vs. French-to-English translation, speech recognition vs. text to speech, and image classification vs. image generation. Two dual tasks have intrinsic connections with each other due to the probabilistic correlation between their models. This connection is, however, not effectively utilized today, since people usually train the models of two dual tasks separately and independently. In this work, we propose training the models of two dual tasks simultaneously, and explicitly exploiting the probabilistic correlation between them to regularize the training process. For ease of reference, we call the proposed approach dual supervised learning. We demonstrate that dual supervised learning can improve the practical performances of both tasks, for various applications including machine translation, image processing, and sentiment analysis.

AAMAS Conference 2016 Conference Paper

Best Action Selection in a Stochastic Environment

  • Yingce Xia
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

We study the problem of selecting the best action from multiple candidates in a stochastic environment. In such a stochastic setting, when taking an action, a player receives a random reward and affords a random cost, which are drawn from two unknown distributions. We target at selecting the best action, the one with the maximum ratio of the expected reward to the expected cost, after exploring the actions for n rounds. In particular, we study three mechanisms: (i) the uniform exploration mechanism MU; (ii) the successive elimination mechanism MSE; and (iii) the ratio confidence bound exploration mechanism MRCB. We prove that for all the three mechanisms, the probabilities that the best action is not selected (i. e. , the error probabilities) can be upper bounded by O(exp{−cn}), where c is a constant related to the mechanisms and coefficients about the actions. We then give an asymptotic lower bound of the error probabilities of the consistent mechanisms for Bernoulli setting, and discuss its relationship with the upper bounds in different aspects. Our proposed mechanisms can be degenerated to cover the cases where only the reward/costs are random. We also test the proposed mechanisms through numerical experiments.

IJCAI Conference 2016 Conference Paper

Budgeted Multi-Armed Bandits with Multiple Plays

  • Yingce Xia
  • Tao Qin
  • Weidong Ma
  • Nenghai Yu
  • Tie-Yan Liu

We study the multi-play budgeted multi-armed bandit (MP-BMAB) problem, in which pulling an arm receives both a random reward and a random cost, and a player pulls L( ≥ 1) arms at each round. The player targets at maximizing her total expected reward under a budget constraint B for the pulling costs. We present a multiple ratio confidence bound policy: At each round, we first calculate a truncated upper (lower) confidence bound for the expected reward (cost) of each arm, and then pull the L arms with the maximum ratio of the sum of the upper confidence bounds of rewards to the sum of the lower confidence bounds of costs. We design 0-1 integer linear fractional programming oracle that can pick such the L arms within polynomial time. We prove that the regret of our policy is sublinear in general and is log-linear for certain parameter settings. We further consider two special cases of MP-BMABs: (1) We derive a lower bound for any consistent policy for MP-BMABs with Bernoulli reward and cost distributions. (2) We show that the proposed policy can also solve conventional budgeted MAB problem (a special case of MP-BMABs with L = 1) and provides better theoretical results than existing UCB-based pulling policies.

NeurIPS Conference 2016 Conference Paper

Dual Learning for Machine Translation

  • Di He
  • Yingce Xia
  • Tao Qin
  • Liwei Wang
  • Nenghai Yu
  • Tie-Yan Liu
  • Wei-Ying Ma

While neural machine translation (NMT) is making good progress in the past two years, tens of millions of bilingual sentence pairs are needed for its training. However, human labeling is very costly. To tackle this training data bottleneck, we develop a dual-learning mechanism, which can enable an NMT system to automatically learn from unlabeled data through a dual-learning game. This mechanism is inspired by the following observation: any machine translation task has a dual task, e. g. , English-to-French translation (primal) versus French-to-English translation (dual); the primal and dual tasks can form a closed loop, and generate informative feedback signals to train the translation models, even if without the involvement of a human labeler. In the dual-learning mechanism, we use one agent to represent the model for the primal task and the other agent to represent the model for the dual task, then ask them to teach each other through a reinforcement learning process. Based on the feedback signals generated during this process (e. g. , the language-model likelihood of the output of a model, and the reconstruction error of the original sentence after the primal and dual translations), we can iteratively update the two models until convergence (e. g. , using the policy gradient methods). We call the corresponding approach to neural machine translation \emph{dual-NMT}. Experiments show that dual-NMT works very well on English$\leftrightarrow$French translation; especially, by learning from monolingual data (with 10\% bilingual data for warm start), it achieves a comparable accuracy to NMT trained from the full bilingual data for the French-to-English translation task.

IJCAI Conference 2015 Conference Paper

Thompson Sampling for Budgeted Multi-Armed Bandits

  • Yingce Xia
  • Haifang Li
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

Thompson sampling is one of the earliest randomized algorithms for multi-armed bandits (MAB). In this paper, we extend the Thompson sampling to Budgeted MAB, where there is random cost for pulling an arm and the total cost is constrained by a budget. We start with the case of Bernoulli bandits, in which the random rewards (costs) of an arm are independently sampled from a Bernoulli distribution. To implement the Thompson sampling algorithm in this case, at each round, we sample two numbers from the posterior distributions of the reward and cost for each arm, obtain their ratio, select the arm with the maximum ratio, and then update the posterior distributions. We prove that the distribution-dependent regret bound of this algorithm is O(ln B), where B denotes the budget. By introducing a Bernoulli trial, we further extend this algorithm to the setting that the rewards (costs) are drawn from general distributions, and prove that its regret bound remains almost the same. Our simulation results demonstrate the effectiveness of the proposed algorithm.

AAAI Conference 2014 Conference Paper

Incentivizing High-Quality Content from Heterogeneous Users: On the Existence of Nash Equilibrium

  • Yingce Xia
  • Tao Qin
  • Nenghai Yu
  • Tie-Yan Liu

We study the existence of pure Nash equilibrium (PNE) for the mechanisms used in Internet services (e. g. , online reviews and question-answering websites) to incentivize users to generate high-quality content. Most existing work assumes that users are homogeneous and have the same ability. However, real-world users are heterogeneous and their abilities can be very different from each other due to their diversity in background, culture, and profession. In this work, we consider the following setting: (1) the users are heterogeneous and each of them has a private type indicating the best quality of the content he/she can generate; (2) all the users share a fixed total reward. With this setting, we study the existence of pure Nash equilibrium of several mechanisms composed by different allocation rules, action spaces, and information availability. We prove the existence of PNE for some mechanisms and the non-existence for some other mechanisms. We also discuss how to find a PNE (if exists) through either a constructive way or a search algorithm.

TIST Journal 2011 Journal Article

Distance metric learning from uncertain side information for automated photo tagging

  • Lei Wu
  • Steven C.H. Hoi
  • Rong Jin
  • Jianke Zhu
  • Nenghai Yu

Automated photo tagging is an important technique for many intelligent multimedia information systems, for example, smart photo management system and intelligent digital media library. To attack the challenge, several machine learning techniques have been developed and applied for automated photo tagging. For example, supervised learning techniques have been applied to automated photo tagging by training statistical classifiers from a collection of manually labeled examples. Although the existing approaches work well for small testbeds with relatively small number of annotation words, due to the long-standing challenge of object recognition, they often perform poorly in large-scale problems. Another limitation of the existing approaches is that they require a set of high-quality labeled data, which is not only expensive to collect but also time consuming. In this article, we investigate a social image based annotation scheme by exploiting implicit side information that is available for a large number of social photos from the social web sites. The key challenge of our intelligent annotation scheme is how to learn an effective distance metric based on implicit side information (visual or textual) of social photos. To this end, we present a novel “Probabilistic Distance Metric Learning” (PDML) framework, which can learn optimized metrics by effectively exploiting the implicit side information vastly available on the social web. We apply the proposed technique to photo annotation tasks based on a large social image testbed with over 1 million tagged photos crawled from a social photo sharing portal. Encouraging results show that the proposed technique is effective and promising for social photo based annotation tasks.

NeurIPS Conference 2009 Conference Paper

Learning Bregman Distance Functions and Its Application for Semi-Supervised Clustering

  • Lei Wu
  • Rong Jin
  • Steven Hoi
  • Jianke Zhu
  • Nenghai Yu

Learning distance functions with side information plays a key role in many machine learning and data mining applications. Conventional approaches often assume a Mahalanobis distance function. These approaches are limited in two aspects: (i) they are computationally expensive (even infeasible) for high dimensional data because the size of the metric is in the square of dimensionality; (ii) they assume a fixed metric for the entire input space and therefore are unable to handle heterogeneous data. In this paper, we propose a novel scheme that learns nonlinear Bregman distance functions from side information using a non-parametric approach that is similar to support vector machines. The proposed scheme avoids the assumption of fixed metric because its local distance metric is implicitly derived from the Hessian matrix of a convex function that is used to generate the Bregman distance function. We present an efficient learning algorithm for the proposed scheme for distance function learning. The extensive experiments with semi-supervised clustering show the proposed technique (i) outperforms the state-of-the-art approaches for distance function learning, and (ii) is computationally efficient for high dimensional data.