Arrow Research search

Author name cluster

Kun Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

20 papers
2 author rows

Possible papers

20

AAAI Conference 2026 Conference Paper

Backtrace Mamba: Reviving Critical Temporal Contexts via Hierarchical Memory Compression for Online Action Detection

  • Su Yan
  • Jiahua Li
  • Kun Wei
  • Cheng Deng

Online Action Detection (OAD) requires real-time prediction of ongoing actions without access to future frames, posing challenges in balancing computational efficiency and long-term dependencies modeling.Existing methods either suffer from slow training and limited temporal receptive fields, or face high computational costs and delayed inference, lacking the capability to tackle extra-long video inputs. Thus, we present a novel Mamba-based OAD framework (MOAD) that efficiently and effectively performs OAD.The hierarchical memory mechanism is introduced to intelligently store high-value scene and action frames based on motion-aware similarity metrics, preserving essential historical knowledge in an online manner. To further reduce storage space, we design a memory quantization method to compress the stored historical features. Additionally, the temporal soft pruning strategy built upon the memory bank is proposed to dynamically remove redundant features, reducing temporal redundancy while maintaining temporal coherence. Sufficient experiments on four challenging benchmarks prove our method significantly outperforms existing methods.

AAAI Conference 2026 Conference Paper

Diffusion-calibrated Continual Test-time Adaptation

  • Xu Yang
  • Moqi Li
  • Kun Wei

Continual Test-Time Domain Adaptation (CTTA) aims to adapt a pre-trained source model to a dynamically evolving target domain without requiring additional data collection or labeling efforts. A key challenge in this setting is to achieve rapid performance improvement in the current domain using unlabeled data, while avoiding impairing generalization to future domains in complex scenarios. To enhance the discriminative capability of the inference models, we propose a novel framework that integrates an external auxiliary generative model with a test-time adaptive method, leveraging cross-validation to identify reliable supervisory signals. Specifically, for each test instance, we utilize a diffusion module to generate a calibrated instance under the textual description of its predicted category. Based on the generated one, we design a learning strategy with the following components: (1) the calibrated instance and its category are used to form a supervisory signal; (2) the predicted category of the calibrated instance is compared with the test instance for selecting reliable signals. For these generated and selected instances, adaptive weighting is applied during optimization to stabilize the category distribution and preserve prediction diversity. Finally, based on the inverse process of diffusion, we construct the negative instance of the generated instance and introduce a robust contrastive learning to further calibrate model optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component.

AAAI Conference 2026 Conference Paper

Hearing More with Less: Multi-Modal Retrieval-and-Selection Augmented Conversational LLM-Based ASR

  • Bingshen Mu
  • Hexin Liu
  • Hongfei Xue
  • Kun Wei
  • Lei Xie

Automatic Speech Recognition (ASR) aims to convert human speech content into corresponding text. In conversational scenarios, effectively utilizing context can enhance its accuracy. Large Language Models' (LLMs) exceptional long-context understanding and reasoning abilities enable LLM-based ASR (LLM-ASR) to leverage historical context for recognizing conversational speech, which has a high degree of contextual relevance. However, existing conversational LLM-ASR methods use a fixed number of preceding utterances or the entire conversation history as context, resulting in significant ASR confusion and computational costs due to massive irrelevant and redundant information. This paper proposes a multi-modal retrieval-and-selection method named MARS that augments conversational LLM-ASR by enabling it to retrieve and select the most relevant acoustic and textual historical context for the current utterance. Specifically, multi-modal retrieval obtains a set of candidate historical contexts, each exhibiting high acoustic or textual similarity to the current utterance. Multi-modal selection calculates the acoustic and textual similarities for each retrieved candidate historical context and, by employing our proposed near-ideal ranking method to consider both similarities, selects the best historical context. Evaluations on the Interspeech 2025 Multilingual Conversational Speech Language Model Challenge dataset show that the LLM-ASR, when trained on only 1.5K hours of data and equipped with the MARS, outperforms the state-of-the-art top-ranking system trained on 179K hours of data.

AAAI Conference 2026 Conference Paper

Towards Robust Edge Model Adaptation via Elastic Architecture Search

  • Xianhang Chu
  • Xu Yang
  • Kun Wei
  • Xi Wang

Continual test-time adaptation (CTTA) enables online model adjustment under dynamic distribution shifts in real-world environments. However, most existing CTTA frameworks adopt fixed model architectures, lacking the structural flexibility required for deployment across heterogeneous edge devices with varying computational capacities. To address this, we propose an elastic framework for edge CTTA that performs resource-aware dynamic model search based on a pre-trained binary Supernet. This enables architectural flexibility by generating personalized models tailored to the resource constraints of different edge devices. Considering the evolving distribution of unlabeled data on edge devices during deployment, we introduce a pluggable lightweight fine-tuning mechanism. By inserting low-rank adapters into the frozen binary backbone, the model enables continual self-supervised adaptation with minimal computational overhead. In addition, we propose a structure-aware knowledge reflux mechanism that transfers the adaptation experience from fine-tuned edge models back into the Supernet. By distilling knowledge into structurally aligned Supernet paths, future architecture search is improved without requiring retraining. Experiments on multiple benchmarks validate that our method achieves state-of-the-art performance while significantly reducing resource consumption, with re-searched models after knowledge reflux showing further improvements.

AAAI Conference 2025 Conference Paper

Compress to One Point: Neural Collapse for Pre-Trained Model-Based Class-Incremental Learning

  • Kun Wei
  • Zhe Xu
  • Cheng Deng

Class-Incremental Learning (CIL) requires an artificial intelligence system to learn different tasks without class overlaps continually. To achieve CIL, some methods introduce the Pre-Trained Model (PTM) and leverage the generalized feature representation of PTM to learn downstream incremental tasks continually. However, the generalized feature representations of PTM are not adaptive and discriminative for these various incremental classes, which may be out of distribution for the pre-trained dataset. In addition, since the incremental classes cannot be learned at once, the class relationship cannot be constructed optimally, leading to undiscriminating feature representation for understream tasks. Thus, we propose a novel Pre-Trained Model-based Class-Incremental Learning (PTM-CIL) method to explore the potential of PTM and obtain optimal class relationships. Inspired by Neural Collapse theory, we introduce the frozen Equiangular Tight Frame classifier to construct optimal classifier structure for all seen classes, guiding the feature representation adaptation for downstream continual tasks. Specifically, Task-Related Adaptation is proposed to modulate the generalized feature representation to bridge the gap between the pre-trained dataset and various downstream datasets. Then, the Feature Compression Module is introduced to compress various features to the specific classifier weights, constructing the feature transfer pattern and satisfying the characteristic of Neural Collapse. Optimal Structural Alignment is designed to supervise the feature compression process, assisting in achieving optimal class relationships across different tasks. Sufficient experiments on seven datasets prove the effectiveness of our method.

NeurIPS Conference 2025 Conference Paper

Dual-Space Semantic Synergy Distillation for Continual Learning of Unlabeled Streams

  • Donghao Sun
  • Xi Wang
  • Xu Yang
  • Kun Wei
  • Cheng Deng

Continual learning from unlabeled data streams while effectively combating catastrophic forgetting poses an intractable challenge. Traditional methods predominantly rely on visual clustering techniques to generate pseudo labels, which are frequently plagued by problems such as noise and suboptimal quality, profoundly affecting the impact on the model evolution. To surmount these obstacles, we introduce an innovative approach that synergistically combines both visual and textual information to generate dual space hybrid pseudo labels for reliable model continual evolution. Specifically, by harnessing the capabilities of large multimodal models, we initially generate generalizable text descriptions for a few representative samples. These descriptions then undergo a `Coarse to Fine' refinement process to capture the subtle nuances between different data points, significantly enhancing the semantic accuracy of the descriptions. Simultaneously, a novel cross-modal hybrid approach seamlessly integrates these fine-grained textual descriptions with visual features, thereby creating a more robust and reliable supervisory signal. Finally, such descriptions are employed to alleviate the catastrophic forgetting issue via a semantic alignment distillation, which capitalizes on the stability inherent in language knowledge to effectively prevent the model from forgetting previously learned information. Comprehensive experiments conducted on a variety of benchmarks demonstrate that our proposed method attains state-of-the-art performance, and ablation studies further substantiate the effectiveness and superiority of the proposed method.

AAAI Conference 2025 Conference Paper

Energy vs. Noise: Towards Robust Temporal Action Localization in Open-World

  • Chenyu Mu
  • Jiahua Li
  • Kun Wei
  • Cheng Deng

Temporal Action Localization (TAL) aims to accurately identify the start and end times of actions in untrimmed videos and classify them according to specific labels. However, the complexity and imbalance between target actions and background in video data make this task particularly challenging. Although relying on large amounts of finely annotated data has led to some progress in existing methods, the presence of noisy labels in large-scale annotations limits their application in open-world scenarios. To address this issue, we take the perspective of the data itself, modeling the different energy patterns exhibited by the action foreground and background in video data to enhance video content inference. Specifically, we propose the Energy-Driven Meta Purifier (EDMP) method, which utilizes a meta-learning training paradigm to avoid dependence on extensive and precise manual annotations. Under this pipeline, we use energy modeling to distinguish between different actions and backgrounds from the perspective of energy differences, thereby improving the model's robustness to category noise. Additionally, these energy-based distinctions are employed to further refine action boundaries, enhancing the model's robustness to boundary noise. Experiments on THUMOS14 and ActivityNet1.3 datasets show that EDMP effectively enhances the robustness of TAL models.

IJCAI Conference 2025 Conference Paper

Outstanding Orthodontist: No More Artifactual Teeth in Talking Face

  • Zibo Su
  • Ziqi Zhang
  • Kun Wei
  • Xu Yang
  • Cheng Deng

Audio-driven talking face synthesis (TFS) enables the creation of realistic speaking videos by combining a single facial image with a speech audio clip. Unlike other facial features that naturally deform during speech, teeth represent unique rigid structures whose shape and size should remain constant throughout the video sequence. However, current methods often produce temporal inconsistencies and artifacts in the teeth region, resulting in a less realistic appearance of the generated videos. To address this, we propose OrthoNet, a plug-and-play framework designed to eliminate unrealistic teeth effects in audio-driven TFS. Our method introduces a Detail-oriented Teeth Aligner module, designed to preserve teeth details and adapt to their shape. It works with a Memory-guided Teeth Stabilizer that integrates a long-term memory bank for global teeth structure and a short-term memory module for local temporal dynamics. Through this framework, OrthoNet acts like an orthodontist for existing Audio2Video methods, ensuring that teeth maintain natural rigidity and temporal consistency even under varying degrees of teeth occlusion. Extensive experiments demonstrate that our method makes the teeth in generated videos appear more natural during speech, significantly enhancing the temporal consistency and structural stability of audio-driven video generation.

IJCAI Conference 2025 Conference Paper

Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation

  • Xuanxuan Ren
  • Xiangyu Li
  • Kun Wei
  • Xu Yang
  • Yanhua Yang

Segment Anything Model 2 (SAM2) is a new-generation, high-precision model for image and video segmentation, offering extensive application prospects across numerous computer vision fields. However, as a large-scale model, its huge memory demands and expansive computing costs pose challenges for practical deployment. This paper presents Q-MiniSAM2, an efficient Quantization-based segmentation benchmark tailored to optimize SAM2 by Minimizing memory consumption and accelerating computations. We begin with applying Post-Training Quantization (PTQ) to SAM2, requiring only a relatively small dataset for network calibration, thereby eliminating the need for retraining. Building upon PTQ, we further introduce a Hierarchy-based Video Quantization method to enhance the model’s capacity to capture video semantics and temporal correlations across different time scales. Furthermore, we observe that SAM2’s memory overhead is predominantly concentrated on processing historical frames, and the redundant cross-attention computations significantly increase memory and computational costs due to the imperceptible change of the short time intervals between these frames. To tackle this issue, an Adaptive Mutual-KV mechanism is proposed to mitigate excessive cross-attention by leveraging inter-frame similarities. Comprehensive experiments demonstrate that the proposed approach achieves superior performance compared to state-of-the-art methods, underscoring its potential for efficient and scalable video segmentation.

NeurIPS Conference 2024 Conference Paper

ACFun: Abstract-Concrete Fusion Facial Stylization

  • Jiapeng Ji
  • Kun Wei
  • Ziqi Zhang
  • Cheng Deng

Owing to advancements in image synthesis techniques, stylization methodologies for large models have garnered remarkable outcomes. However, when it comes to processing facial images, the outcomes frequently fall short of expectations. Facial stylization is predominantly challenged by two significant hurdles. Firstly, obtaining a large dataset of high-quality stylized images is difficult. The scarcity and diversity of artistic styles make it impractical to compile comprehensive datasets for each style. Secondly, while many methods can transfer colors and strokes from style images, these elements alone cannot fully capture a specific style, which encompasses both concrete and abstract visual elements. Additionally, facial stylization often alters the visual features of the face, making it challenging to balance these changes with the need to retain facial information. To address these issues, we propose a novel method called ACFun, which uses only one style image and one facial image for facial stylization. ACFun comprises an Abstract Fusion Module (AFun) and a Concrete Fusion Module (CFun), which separately learn the abstract and concrete features of the style and face. We also design a Face and Style Imagery Alignment Loss to align the style image with the face image in the latent space. Finally, we generate styled facial images from noise directly to complete the facial stylization task. Experiments show that our method outperforms others in facial stylization, producing highly artistic and visually pleasing results.

IJCAI Conference 2024 Conference Paper

Navigating Continual Test-time Adaptation with Symbiosis Knowledge

  • Xu Yang
  • Moqi Li
  • Jie Yin
  • Kun Wei
  • Cheng Deng

Continual test-time domain adaptation seeks to adapt the source pre-trained model to a continually changing target domain without incurring additional data acquisition or labeling costs. Unfortunately, existing mainstream methods may result in a detrimental cycle. This is attributed to noisy pseudo-labels caused by the domain shift, which immediately negatively impacts the model's knowledge. The long-term accumulation of these negative effects exacerbates the model's difficulty in generalizing to future domain shifts and contributes to catastrophic forgetting. To address these challenges, this paper introduces a Dual-stream Network that independently optimizes different parameters in each stream to capture symbiotic knowledge from continual domains, thereby ensuring generalization while enhancing instantaneous discrimination. Furthermore, to prevent catastrophic forgetting, a weighted soft parameter alignment method is designed to leverage knowledge from the source model. Finally, efforts are made to calibrate and explore reliable supervision signals to mitigate instantaneous negative optimization. These include label calibration with prior knowledge, label selection using self-adaptive confidence thresholds, and a soft-weighted contrastive module for capturing potential semantics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on several benchmark datasets.

IJCAI Conference 2023 Conference Paper

Exploring Safety Supervision for Continual Test-time Domain Adaptation

  • Xu Yang
  • Yanan Gu
  • Kun Wei
  • Cheng Deng

Continual test-time domain adaptation aims to adapt a source pre-trained model to a continually changing target domain without using any source data. Unfortunately, existing methods based on pseudo-label learning suffer from the changing target domain environment, and the quality of generated pseudo-labels is attenuated due to the domain shift, leading to instantaneous negative learning and long-term knowledge forgetting. To solve these problems, in this paper, we propose a simple yet effective framework for exploring safety supervision with three elaborate strategies: Label Safety, Sample Safety, and Parameter Safety. Firstly, to select reliable pseudo-labels, we define and adjust the confidence threshold in a self-adaptive manner according to the test-time learning status. Secondly, a soft-weighted contrastive learning module is presented to explore the highly-correlated samples and discriminate uncorrelated ones, improving the instantaneous efficiency of the model. Finally, we frame a Soft Weight Alignment strategy to normalize the distance between the parameters of the adapted model and the source pre-trained model, which alleviates the long-term problem of knowledge forgetting and significantly improves the accuracy of the adapted model in the late adaptation stage. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on several benchmark datasets.

IJCAI Conference 2023 Conference Paper

Hierarchical Prompt Learning for Compositional Zero-Shot Recognition

  • Henan Wang
  • Muli Yang
  • Kun Wei
  • Cheng Deng

Compositional Zero-Shot Learning (CZSL) aims to imitate the powerful generalization ability of human beings to recognize novel compositions of known primitive concepts that correspond to a state and an object, e. g. , purple apple. To fully capture the intra- and inter-class correlations between compositional concepts, in this paper, we propose to learn them in a hierarchical manner. Specifically, we set up three hierarchical embedding spaces that respectively model the states, the objects, and their compositions, which serve as three “experts” that can be combined in inference for more accurate predictions. We achieve this based on the recent success of large-scale pretrained vision-language models, e. g. , CLIP, which provides a strong initial knowledge of image-text relationships. To better adapt this knowledge to CZSL, we propose to learn three hierarchical prompts by explicitly fixing the unrelated word tokens in the three embedding spaces. Despite its simplicity, our proposed method consistently yields superior performance over current state-of-the-art approaches on three widely-used CZSL benchmarks.

AAAI Conference 2022 Conference Paper

Learning Universal Adversarial Perturbation by Adversarial Example

  • Maosen Li
  • Yanhua Yang
  • Kun Wei
  • Xu Yang
  • Heng Huang

Deep learning models have shown to be susceptible to universal adversarial perturbation (UAP), which has aroused wide concerns in the community. Compared with the conventional adversarial attacks that generate adversarial samples at the instance level, UAP can fool the target model for different instances with only a single perturbation, enabling us to evaluate the robustness of the model from a more effective and accurate perspective. The existing universal attack methods fail to exploit the differences and connections between the instance and universal levels to produce dominant perturbations. To address this challenge, we propose a new universal attack method that unifies instance-specific and universal attacks from a feature perspective to generate a more dominant UAP. Specifically, we reformulate the UAP generation task as a minimax optimization problem and then utilize the instance-specific attack method to solve the minimization problem thereby obtaining better training data for generating UAP. At the same time, we also introduce a consistency regularizer to explore the relationship between training data, thus further improving the dominance of the generated UAP. Furthermore, our method is generic with no additional assumptions about the training data and hence can be applied to both data-dependent (supervised) and dataindependent (unsupervised) manners. Extensive experiments demonstrate that the proposed method improves the performance by a significant margin over the existing methods in both data-dependent and data-independent settings. Code is available at https: //github. com/lisenxd/AT-UAP.

ICML Conference 2022 Conference Paper

ME-GAN: Learning Panoptic Electrocardio Representations for Multi-view ECG Synthesis Conditioned on Heart Diseases

  • Jintai Chen
  • Kuanlun Liao
  • Kun Wei
  • Haochao Ying
  • Danny Z. Chen
  • Jian Wu 0001

Electrocardiogram (ECG) is a widely used non-invasive diagnostic tool for heart diseases. Many studies have devised ECG analysis models (e. g. , classifiers) to assist diagnosis. As an upstream task, researches have built generative models to synthesize ECG data, which are beneficial to providing training samples, privacy protection, and annotation reduction. However, previous generative methods for ECG often neither synthesized multi-view data, nor dealt with heart disease conditions. In this paper, we propose a novel disease-aware generative adversarial network for multi-view ECG synthesis called ME-GAN, which attains panoptic electrocardio representations conditioned on heart diseases and projects the representations onto multiple standard views to yield ECG signals. Since ECG manifestations of heart diseases are often localized in specific waveforms, we propose a new "mixup normalization" to inject disease information precisely into suitable locations. In addition, we propose a "view discriminator" to revert disordered ECG views into a pre-determined order, supervising the generator to obtain ECG representing correct view characteristics. Besides, a new metric, rFID, is presented to assess the quality of the synthesized ECG signals. Comprehensive experiments verify that our ME-GAN performs well on multi-view ECG signal synthesis with trusty morbid manifestations.

AAAI Conference 2021 Conference Paper

Class-Incremental Instance Segmentation via Multi-Teacher Networks

  • Yanan Gu
  • Cheng Deng
  • Kun Wei

Although deep neural networks have achieved amazing results on instance segmentation, they are still ill-equipped when they are required to learn new tasks incrementally. Concretely, they suffer from “catastrophic forgetting”, an abrupt degradation of performance on old classes with the initial training data missing. Moreover, they are subjected to a negative transfer problem on new classes, which renders the model unable to update its knowledge while preserving the previous knowledge. To address these problems, we propose an incremental instance segmentation method that consists of three networks: Former Teacher Network (FTN), Current Student Network (CSN) and Current Teacher Network (CTN). Specifically, FTN supervises CSN to preserve the previous knowledge, and CTN supervises CSN to adapt to new classes. The supervision of two teacher networks is achieved by a distillation loss function for instances, bounding boxes, and classes. In addition, we adjust the supervision weights of different teacher networks to balance between the knowledge preservation for former classes and the adaption to new classes. Extensive experimental results on PASCAL 2012 SBD and COCO datasets show the effectiveness of the proposed method.

AAAI Conference 2021 Conference Paper

Generalized Zero-Shot Learning via Disentangled Representation

  • Xiangyu Li
  • Zhe Xu
  • Kun Wei
  • Cheng Deng

Zero-Shot Learning (ZSL) aims to recognize images belonging to unseen classes that are unavailable in the training process, while Generalized Zero-Shot Learning (GZSL) is a more realistic variant that both seen and unseen classes appear during testing. Most GZSL approaches achieve knowledge transfer based on the features of samples that inevitably contain information irrelevant to recognition, bringing negative influence for the performance. In this work, we propose a novel method, dubbed Disentangled-VAE, which aims to disentangle category-distilling factors and category-dispersing factors from visual as well as semantic features, respectively. In addition, a batch re-combining strategy on latent features is introduced to guide the disentanglement, encouraging the distilling latent features to be more discriminative for recognition. Extensive experiments demonstrate that our method outperforms the state-of-the-art approaches on four challenging benchmark datasets.

AAAI Conference 2021 Conference Paper

Incremental Embedding Learning via Zero-Shot Translation

  • Kun Wei
  • Cheng Deng
  • Xu Yang
  • Maosen Li

Modern deep learning methods have achieved great success in machine learning and computer vision fields by learning a set of pre-defined datasets. Howerver, these methods perform unsatisfactorily when applied into real-world situations. The reason of this phenomenon is that learning new tasks leads the trained model quickly forget the knowledge of old tasks, which is referred to as catastrophic forgetting. Current state-of-the-art incremental learning methods tackle catastrophic forgetting problem in traditional classification networks and ignore the problem existing in embedding networks, which are the basic networks for image retrieval, face recognition, zero-shot learning, etc. Different from traditional incremental classification networks, the semantic gap between the embedding spaces of two adjacent tasks is the main challenge for embedding networks under incremental learning setting. Thus, we propose a novel class-incremental method for embedding network, named as zero-shot translation class-incremental method (ZSTCI), which leverages zero-shot translation to estimate the semantic gap without any exemplars. Then, we try to learn a unified representation for two adjacent tasks in sequential learning process, which captures the relationships of previous classes and current classes precisely. In addition, ZSTCI can easily be combined with existing regularization-based incremental learning methods to further improve performance of embedding networks. We conduct extensive experiments on CUB-200-2011 and CI- FAR100, and the experiment results prove the effectiveness of our method. The code of our method has been released in https: //github. com/Drkun/ZSTCI.

NeurIPS Conference 2020 Conference Paper

Adversarial Learning for Robust Deep Clustering

  • Xu Yang
  • Cheng Deng
  • Kun Wei
  • Junchi Yan
  • Wei Liu

Deep clustering integrates embedding and clustering together to obtain the optimal nonlinear embedding space, which is more effective in real-world scenarios compared with conventional clustering methods. However, the robustness of the clustering network is prone to being attenuated especially when it encounters an adversarial attack. A small perturbation in the embedding space will lead to diverse clustering results since the labels are absent. In this paper, we propose a robust deep clustering method based on adversarial learning. Specifically, we first attempt to define adversarial samples in the embedding space for the clustering network. Meanwhile, we devise an adversarial attack strategy to explore samples that easily fool the clustering layers but do not impact the performance of the deep embedding. We then provide a simple yet efficient defense algorithm to improve the robustness of the clustering network. Experimental results on two popular datasets show that the proposed adversarial learning method can significantly enhance the robustness and further improve the overall clustering performance. Particularly, the proposed method is generally applicable to multiple existing clustering frameworks to boost their robustness. The source code is available at https: //github. com/xdxuyang/ALRDC.

IJCAI Conference 2020 Conference Paper

Lifelong Zero-Shot Learning

  • Kun Wei
  • Cheng Deng
  • Xu Yang

Zero-Shot Learning (ZSL) handles the problem that some testing classes never appear in training set. Existing ZSL methods are designed for learning from a fixed training set, which do not have the ability to capture and accumulate the knowledge of multiple training sets, causing them infeasible to many real-world applications. In this paper, we propose a new ZSL setting, named as Lifelong Zero-Shot Learning (LZSL), which aims to accumulate the knowledge during the learning from multiple datasets and recognize unseen classes of all trained datasets. Besides, a novel method is conducted to realize LZSL, which effectively alleviates the Catastrophic Forgetting in the continuous training process. Specifically, considering those datasets containing different semantic embeddings, we utilize Variational Auto-Encoder to obtain unified semantic representations. Then, we leverage selective retraining strategy to preserve the trained weights of previous tasks and avoid negative transfer when fine-tuning the entire model. Finally, knowledge distillation is employed to transfer knowledge from previous training stages to current stage. We also design the LZSL evaluation protocol and the challenging benchmarks. Extensive experiments on these benchmarks indicate that our method tackles LZSL problem effectively, while existing ZSL methods fail.