Arrow Research search

Author name cluster

Zhenan Sun

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

17 papers
2 author rows

Possible papers

17

AAAI Conference 2026 Conference Paper

Artificial Immune System of Secure Face Recognition Against Adversarial Attacks (Abstract Reprint)

  • Min Ren
  • Yunlong Wang
  • Yuhao Zhu
  • Yongzhen Huang
  • Zhenan Sun
  • Qi Li
  • Tieniu Tan

Deep learning-based face recognition models are vulnerable to adversarial attacks. In contrast to general noises, the presence of imperceptible adversarial noises can lead to catastrophic errors in deep face recognition models. The primary difference between adversarial noise and general noise lies in its specificity. Adversarial attack methods give rise to noises tailored to the characteristics of the individual image and recognition model at hand. Diverse samples and recognition models can engender specific adversarial noise patterns, which pose significant challenges for adversarial defense. Addressing this challenge in the realm of face recognition presents a more formidable endeavor due to the inherent nature of face recognition as an open set task. In order to tackle this challenge, it is imperative to employ customized processing for each individual input sample. Drawing inspiration from the biological immune system, which can identify and respond to various threats, this paper aims to create an artificial immune system to provide adversarial defense for face recognition. The proposed defense model incorporates the principles of antibody cloning, mutation, selection, and memory mechanisms to generate a distinct antibody for each input sample, wherein the term antibody refers to a specialized noise removal manner. Furthermore, we introduce a self-supervised adversarial training mechanism that serves as a simulated rehearsal of immune system invasions. Extensive experimental results demonstrate the efficacy of the proposed method, surpassing state-of-the-art adversarial defense methods.

AAAI Conference 2026 Conference Paper

UniAlignment: Semantic Alignment for Unified Image Generation, Understanding, Manipulation and Perception

  • Xinyang Song
  • Libin Wang
  • Weining Wang
  • Shaozhen Liu
  • DanDan Zheng
  • Jingdong Chen
  • Qi Li
  • Zhenan Sun

The remarkable success of diffusion models in text-to-image generation has sparked growing interest in expanding their capabilities to a variety of multi-modal tasks, including image understanding, manipulation, and perception. These tasks require advanced semantic comprehension across both visual and textual modalities, especially in scenarios involving complex semantic instructions. However, existing approaches often rely heavily on vision-language models (VLMs) or modular designs for semantic guidance, leading to fragmented architectures and computational inefficiency. To address these challenges, we propose UniAlignment, a unified multimodal generation framework within a single diffusion transformer. UniAlignment introduces a dual-stream diffusion training strategy that incorporates both intrinsic-modal semantic alignment and cross-modal semantic alignment, thereby enhancing the model's cross-modal consistency and instruction-following robustness. Additionally, we present SemGen-Bench, a new benchmark specifically designed to evaluate multimodal semantic consistency under complex textual instructions. Extensive experiments across multiple tasks and benchmarks demonstrate that UniAlignment outperforms existing baselines, underscoring the significant potential of diffusion models in unified multimodal generation.

AAAI Conference 2026 Conference Paper

VividListener: Expressive and Controllable Listener Dynamics Modeling for Multi-Modal Responsive Interaction

  • Shiying Li
  • Xingqun Qi
  • Bingkun Yang
  • Weile Chen
  • Zezhao Tian
  • Muyi Sun
  • Qifeng Liu
  • Man Zhang

Generating responsive listener head dynamics with nuanced emotions and expressive reactions is crucial for dialogue modeling in various virtual avatar animations. Previous studies mainly focus on the direct short-term production of listener behavior. They overlook the fine-grained control over motion variations and emotional intensity, especially in long-sequence modeling. Moreover, the lack of long-term and large-scale paired speaker-listener corpora incorporating head dynamics and fine-grained multi-modality annotations limits the application of dialogue modeling. Therefore, we first newly collect a large-scale multi-turn dataset of 3D dyadic conversation containing more than 1.4M valid frames for multi-modal responsive interaction, dubbed ListenerX. Additionally, we propose VividListener, a novel framework enabling fine-grained, expressive, and controllable listener dynamics modeling. This framework leverages multi-modal conditions as guiding principles for fostering coherent interactions between speakers and listeners. Specifically, we design the Responsive Interaction Module (RIM) to adaptively represent the multi-modal interactive embeddings. RIM ensures the listener dynamics achieve fine-grained semantic coordination with textual descriptions and adjustments, while preserving expressive reaction with speaker behavior. Meanwhile, we propose the Emotional Intensity Tags (EIT) for emotion intensity editing with multi-modal information integration, applying to both text descriptions and listener motion amplitude. Extensive experiments conducted on our newly collected ListenerX dataset demonstrate that VividListener achieves state-of-the-art performance, realizing expressive and controllable listener dynamics.

NeurIPS Conference 2024 Conference Paper

DuQuant: Distributing Outliers via Dual Transformation Makes Stronger Quantized LLMs

  • haokun lin
  • Haobo Xu
  • Yichen Wu
  • Jingzhi Cui
  • Yingtao Zhang
  • Linzhan Mou
  • Linqi Song
  • Zhenan Sun

Quantization of large language models (LLMs) faces significant challenges, particularly due to the presence of outlier activations that impede efficient low-bit representation. Traditional approaches predominantly address Normal Outliers, which are activations across all tokens with relatively large magnitudes. However, these methods struggle with smoothing Massive Outliers that display significantly larger values, which leads to significant performance degradation in low-bit quantization. In this paper, we introduce DuQuant, a novel approach that utilizes rotation and permutation transformations to more effectively mitigate both massive and normal outliers. First, DuQuant starts by constructing the rotation matrix, using specific outlier dimensions as prior knowledge, to redistribute outliers to adjacent channels by block-wise rotation. Second, We further employ a zigzag permutation to balance the distribution of outliers across blocks, thereby reducing block-wise variance. A subsequent rotation further smooths the activation landscape, enhancing model performance. DuQuant simplifies the quantization process and excels in managing outliers, outperforming the state-of-the-art baselines across various sizes and types of LLMs on multiple tasks, even with 4-bit weight-activation quantization. Our code is available at https: //github. com/Hsu1023/DuQuant.

JBHI Journal 2024 Journal Article

Exploring Generalizable Distillation for Efficient Medical Image Segmentation

  • Xingqun Qi
  • Zhuojie Wu
  • Wenxuan Zou
  • Min Ren
  • Yifan Gao
  • Muyi Sun
  • Shanghang Zhang
  • Caifeng Shan

Efficient medical image segmentation aims to provide accurate pixel-wise predictions with a lightweight implementation framework. However, existing lightweight networks generally overlook the generalizability of the cross-domain medical segmentation tasks. In this paper, we propose Generalizable Knowledge Distillation (GKD), a novel framework for enhancing the performance of lightweight networks on cross-domain medical segmentation by generalizable knowledge distillation from powerful teacher networks. Considering the domain gaps between different medical datasets, we propose the Model-Specific Alignment Networks (MSAN) to obtain the domain-invariant representations. Meanwhile, a customized Alignment Consistency Training (ACT) strategy is designed to promote the MSAN training. Based on the domain-invariant vectors in MSAN, we propose two generalizable distillation schemes, Dual Contrastive Graph Distillation (DCGD) and Domain-Invariant Cross Distillation (DICD). In DCGD, two implicit contrastive graphs are designed to model the intra-coupling and inter-coupling semantic correlations. Then, in DICD, the domain-invariant semantic vectors are reconstructed from two networks (i. e. , teacher and student) with a crossover manner to achieve simultaneous generalization of lightweight networks, hierarchically. Moreover, a metric named Fréchet Semantic Distance (FSD) is tailored to verify the effectiveness of the regularized domain-invariant features. Extensive experiments conducted on the Liver, Retinal Vessel and Colonoscopy segmentation datasets demonstrate the superiority of our method, in terms of performance and generalization ability on lightweight networks.

AAAI Conference 2024 Conference Paper

Learning Explicit Contact for Implicit Reconstruction of Hand-Held Objects from Monocular Images

  • Junxing Hu
  • Hongwen Zhang
  • Zerui Chen
  • Mengcheng Li
  • Yunlong Wang
  • Yebin Liu
  • Zhenan Sun

Reconstructing hand-held objects from monocular RGB images is an appealing yet challenging task. In this task, contacts between hands and objects provide important cues for recovering the 3D geometry of the hand-held objects. Though recent works have employed implicit functions to achieve impressive progress, they ignore formulating contacts in their frameworks, which results in producing less realistic object meshes. In this work, we explore how to model contacts in an explicit way to benefit the implicit reconstruction of hand-held objects. Our method consists of two components: explicit contact prediction and implicit shape reconstruction. In the first part, we propose a new subtask of directly estimating 3D hand-object contacts from a single image. The part-level and vertex-level graph-based transformers are cascaded and jointly learned in a coarse-to-fine manner for more accurate contact probabilities. In the second part, we introduce a novel method to diffuse estimated contact states from the hand mesh surface to nearby 3D space and leverage diffused contact probabilities to construct the implicit neural representation for the manipulated object. Benefiting from estimating the interaction patterns between the hand and the object, our method can reconstruct more realistic object meshes, especially for object parts that are in contact with hands. Extensive experiments on challenging benchmarks show that the proposed method outperforms the current state of the arts by a great margin. Our code is publicly available at https://junxinghu.github.io/projects/hoi.html.

ICML Conference 2022 Conference Paper

Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring

  • Zhengquan Luo
  • Yunlong Wang 0003
  • Zilei Wang
  • Zhenan Sun
  • Tieniu Tan

Attributes skew hinders the current federated learning (FL) frameworks from consistent optimization directions among the clients, which inevitably leads to performance reduction and unstable convergence. The core problems lie in that: 1) Domain-specific attributes, which are non-causal and only locally valid, are indeliberately mixed into global aggregation. 2) The one-stage optimizations of entangled attributes cannot simultaneously satisfy two conflicting objectives, i. e. , generalization and personalization. To cope with these, we proposed disentangled federated learning (DFL) to disentangle the domain-specific and cross-invariant attributes into two complementary branches, which are trained by the proposed alternating local-global optimization independently. Importantly, convergence analysis proves that the FL system can be stably converged even if incomplete client models participate in the global aggregation, which greatly expands the application scope of FL. Extensive experiments verify that DFL facilitates FL with higher performance, better interpretability, and faster convergence rate, compared with SOTA FL methods on both manually synthesized and realistic attributes skew datasets.

AAAI Conference 2020 Conference Paper

Age Progression and Regression with Spatial Attention Modules

  • Qi Li
  • Yunfan Liu
  • Zhenan Sun

Age progression and regression refers to aesthetically rendering a given face image to present effects of face aging and rejuvenation, respectively. Although numerous studies have been conducted in this topic, there are two major problems: 1) multiple models are usually trained to simulate different age mappings, and 2) the photo-realism of generated face images is heavily influenced by the variation of training images in terms of pose, illumination, and background. To address these issues, in this paper, we propose a framework based on conditional Generative Adversarial Networks (cGANs) to achieve age progression and regression simultaneously. Particularly, since face aging and rejuvenation are largely different in terms of image translation patterns, we model these two processes using two separate generators, each dedicated to one age changing process. In addition, we exploit spatial attention mechanisms to limit image modifications to regions closely related to age changes, so that images with high visual fidelity could be synthesized for in-the-wild cases. Experiments on multiple datasets demonstrate the ability of our model in synthesizing lifelike face images at desired ages with personalized features well preserved, and keeping ageirrelevant regions unchanged.

AAAI Conference 2020 Conference Paper

Dynamic Graph Representation for Occlusion Handling in Biometrics

  • Min Ren
  • Yunlong Wang
  • Zhenan Sun
  • Tieniu Tan

The generalization ability of Convolutional neural networks (CNNs) for biometrics drops greatly due to the adverse effects of various occlusions. To this end, we propose a novel unified framework integrated the merits of both CNNs and graphical models to learn dynamic graph representations for occlusion problems in biometrics, called Dynamic Graph Representation (DGR). Convolutional features onto certain regions are re-crafted by a graph generator to establish the connections among the spatial parts of biometrics and build Feature Graphs based on these node representations. Each node of Feature Graphs corresponds to a specific part of the input image and the edges express the spatial relationships between parts. By analyzing the similarities between the nodes, the framework is able to adaptively remove the nodes representing the occluded parts. During dynamic graph matching, we propose a novel strategy to measure the distances of both nodes and adjacent matrixes. In this way, the proposed method is more convincing than CNNs-based methods because the dynamic graph method implies a more illustrative and reasonable inference of the biometrics decision. Experiments conducted on iris and face demonstrate the superiority of the proposed framework, which boosts the accuracy of occluded biometrics recognition by a large margin comparing with baseline methods.

IJCAI Conference 2020 Conference Paper

Reference Guided Face Component Editing

  • Qiyao Deng
  • Jie Cao
  • Yunfan Liu
  • Zhenhua Chai
  • Qi Li
  • Zhenan Sun

Face portrait editing has achieved great progress in recent years. However, previous methods either 1) operate on pre-defined face attributes, lacking the flexibility of controlling shapes of high-level semantic facial components (e. g. , eyes, nose, mouth), or 2) take manually edited mask or sketch as an intermediate representation for observable changes, but such additional input usually requires extra efforts to obtain. To break the limitations (e. g. shape, mask or sketch) of the existing methods, we propose a novel framework termed r FACE (Reference Guided FAce Component Editing) for diverse and controllable face component editing with geometric changes. Specifically, r-FACE takes an image inpainting model as the backbone, utilizing reference images as conditions for controlling the shape of face components. In order to encourage the framework to concentrate on the target face components, an example-guided attention module is designed to fuse attention features and the target face component features extracted from the reference image. Through extensive experimental validation and comparisons, we verify the effectiveness of the proposed framework.

AAAI Conference 2019 Conference Paper

Disentangled Variational Representation for Heterogeneous Face Recognition

  • Xiang Wu
  • Huaibo Huang
  • Vishal M. Patel
  • Ran He
  • Zhenan Sun

Visible (VIS) to near infrared (NIR) face matching is a challenging problem due to the significant domain discrepancy between the domains and a lack of sufficient data for training cross-modal matching algorithms. Existing approaches attempt to tackle this problem by either synthesizing visible faces from NIR faces, extracting domain-invariant features from these modalities, or projecting heterogeneous data onto a common latent space for cross-modal matching. In this paper, we take a different approach in which we make use of the Disentangled Variational Representation (DVR) for crossmodal matching. First, we model a face representation with an intrinsic identity information and its within-person variations. By exploring the disentangled latent variable space, a variational lower bound is employed to optimize the approximate posterior for NIR and VIS representations. Second, aiming at obtaining more compact and discriminative disentangled latent space, we impose a minimization of the identity information for the same subject and a relaxed correlation alignment constraint between the NIR and VIS modality variations. An alternative optimization scheme is proposed for the disentangled variational representation part and the heterogeneous face recognition network part. The mutual promotion between these two parts effectively reduces the NIR and VIS domain discrepancy and alleviates over-fitting. Extensive experiments on three challenging NIR-VIS heterogeneous face recognition databases demonstrate that the proposed method achieves significant improvements over the state-of-the-art methods.

NeurIPS Conference 2018 Conference Paper

IntroVAE: Introspective Variational Autoencoders for Photographic Image Synthesis

  • Huaibo Huang
  • Zhihang Li
  • Ran He
  • Zhenan Sun
  • Tieniu Tan

We present a novel introspective variational autoencoder (IntroVAE) model for synthesizing high-resolution photographic images. IntroVAE is capable of self-evaluating the quality of its generated samples and improving itself accordingly. Its inference and generator models are jointly trained in an introspective way. On one hand, the generator is required to reconstruct the input images from the noisy outputs of the inference model as normal VAEs. On the other hand, the inference model is encouraged to classify between the generated and real samples while the generator tries to fool it as GANs. These two famous generative frameworks are integrated in a simple yet efficient single-stream architecture that can be trained in a single stage. IntroVAE preserves the advantages of VAEs, such as stable training and nice latent manifold. Unlike most other hybrid models of VAEs and GANs, IntroVAE requires no extra discriminators, because the inference model itself serves as a discriminator to distinguish between the generated and real samples. Experiments demonstrate that our method produces high-resolution photo-realistic images (e. g. , CELEBA images at (1024^{2})), which are comparable to or better than the state-of-the-art GANs.

NeurIPS Conference 2018 Conference Paper

Learning a High Fidelity Pose Invariant Model for High-resolution Face Frontalization

  • Jie Cao
  • Yibo Hu
  • Hongwen Zhang
  • Ran He
  • Zhenan Sun

Face frontalization refers to the process of synthesizing the frontal view of a face from a given profile. Due to self-occlusion and appearance distortion in the wild, it is extremely challenging to recover faithful results and preserve texture details in a high-resolution. This paper proposes a High Fidelity Pose Invariant Model (HF-PIM) to produce photographic and identity-preserving results. HF-PIM frontalizes the profiles through a novel texture warping procedure and leverages a dense correspondence field to bind the 2D and 3D surface spaces. We decompose the prerequisite of warping into dense correspondence field estimation and facial texture map recovering, which are both well addressed by deep networks. Different from those reconstruction methods relying on 3D data, we also propose Adversarial Residual Dictionary Learning (ARDL) to supervise facial texture map recovering with only monocular images. Exhaustive experiments on both controlled and uncontrolled environments demonstrate that the proposed method not only boosts the performance of pose-invariant face recognition but also dramatically improves high-resolution frontalization appearances.

NeurIPS Conference 2017 Conference Paper

Deep Supervised Discrete Hashing

  • Qi Li
  • Zhenan Sun
  • Ran He
  • Tieniu Tan

With the rapid growth of image and video data on the web, hashing has been extensively studied for image or video search in recent years. Benefiting from recent advances in deep learning, deep hashing methods have achieved promising results for image retrieval. However, there are some limitations of previous deep hashing methods (e. g. , the semantic information is not fully exploited). In this paper, we develop a deep supervised discrete hashing algorithm based on the assumption that the learned binary codes should be ideal for classification. Both the pairwise label information and the classification information are used to learn the hash codes within one stream framework. We constrain the outputs of the last layer to be binary codes directly, which is rarely investigated in deep hashing algorithm. Because of the discrete nature of hash codes, an alternating minimization method is used to optimize the objective function. Experimental results have shown that our method outperforms current state-of-the-art methods on benchmark datasets.

AAAI Conference 2017 Conference Paper

Learning Invariant Deep Representation for NIR-VIS Face Recognition

  • Ran He
  • Xiang Wu
  • Zhenan Sun
  • Tieniu Tan

Visual versus near infrared (VIS-NIR) face recognition is still a challenging heterogeneous task due to large appearance difference between VIS and NIR modalities. This paper presents a deep convolutional network approach that uses only one network to map both NIR and VIS images to a compact Euclidean space. The low-level layers of this network are trained only on large-scale VIS data. Each convolutional layer is implemented by the simplest case of maxout operator. The highlevel layer is divided into two orthogonal subspaces that contain modality-invariant identity information and modalityvariant spectrum information respectively. Our joint formulation leads to an alternating minimization approach for deep representation at the training time and an efficient computation for heterogeneous data at the testing time. Experimental evaluations show that our method achieves 94% verification rate at FAR=0. 1% on the challenging CASIA NIR-VIS 2. 0 face recognition dataset. Compared with state-of-the-art methods, it reduces the error rate by 58% only with a compact 64-D representation.

IJCAI Conference 2016 Conference Paper

Group-Invariant Cross-Modal Subspace Learning

  • Jian Liang
  • Ran He
  • Zhenan Sun
  • Tieniu Tan

Cross-modal learning tries to find various types of heterogeneous data (e. g. , image) from a given query (e. g. , text). Most cross-modal algorithms heavily rely on semantic labels and benefit from a semantic-preserving aggregation of pairs of heterogeneous data. However, the semantic labels are not readily obtained in many real-world applications. This paper studies the aggregation of these pairs unsupervisedly. Apart from lower pairwise correspondences that force the data from one pair to be close to each other, we propose a novel concept, referred as groupwise correspondences, supposing that each paired heterogeneous data are from an identical latent group. We incorporate this groupwise correspondences into canonical correlation analysis (CCA) model, and seek a latent common subspace where data are naturally clustered into several latent groups. To simplify this nonconvex and nonsmooth problem, we introduce a non-negative orthogonal variable to represent the soft group membership, then two coupled computationally efficient subproblems (a generalized ratio-trace problem and a non-negative problem) are alternatively minimized to guarantee the proposed algorithm converges locally. Experimental results on two benchmark datasets demonstrate that the proposed unsupervised algorithm even achieves comparable performance to some state-of-the-art supervised cross-modal algorithms. Cross-modal learning tries to find various types of heterogeneous data (e. g. , image) from a given query (e. g. , text). Most cross-modal algorithms heavily rely on semantic labels and benefit from a semantic-preserving aggregation of pairs of heterogeneous data. However, the semantic labels are not readily obtained in many real-world applications. This paper studies the aggregation of these pairs unsupervisedly. Apart from lower pairwise correspondences that force the data from one pair to be close to each other, we propose a novel concept, referred as groupwise correspondences, supposing that each paired heterogeneous data are from an identical latent group. We incorporate this groupwise correspondences into canonical correlation analysis (CCA) model, and seek a latent common subspace where data are naturally clustered into several latent groups. To simplify this nonconvex and nonsmooth problem, we introduce a non-negative orthogonal variable to represent the soft group membership, then two coupled computationally efficient subproblems (a generalized ratio-trace problem and a non-negative problem) are alternatively minimized to guarantee the proposed algorithm converges locally. Experimental results on two benchmark datasets demonstrate that the proposed unsupervised algorithm even achieves comparable performance to some state-of-the-art supervised cross-modal algorithms.

AAAI Conference 2016 Conference Paper

Simultaneous Feature and Sample Reduction for Image-Set Classification

  • Man Zhang
  • Ran He
  • Dong Cao
  • Zhenan Sun
  • Tieniu Tan

Image-set classification is the assignment of a label to a given image set. In real-life scenarios such as surveillance videos, each image set often contains much redundancy in terms of features and samples. This paper introduces a joint learning method for image-set classification that simultaneously learns compact binary codes and removes redundant samples. The joint objective function of our model mainly includes two parts. The first part seeks a hashing function to generate binary codes that have larger inter-class and smaller intra-class distances. The second one reduces redundant samples with discrete constraints in a low-rank way. A kernel method based on anchor points is further used to reduce sample variations. The proposed discrete objective function is simplified to a series of sub-problems that admit an analytical solution, resulting in a high-quality discrete solution with a low computational cost. Experiments on three commonly used image-set datasets show that the proposed method for the tasks of face recognition from image sets is efficient and effective.