Arrow Research search

Author name cluster

Jiexi Yan

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
2 author rows

Possible papers

11

AAAI Conference 2026 Conference Paper

Channel-masked Asymmetric Distribution Matching for Cross-Domain Generalized Dataset Distillation

  • Qi Liu
  • Chenghao Xu
  • Jiexi Yan
  • guangtao lyu
  • Erkun Yang
  • Guihai Chen
  • Yanhua Yang

Dataset distillation has achieved remarkable progress as an effective approach for data compression. However, real-world data often comes from diverse domains, leading to potential mismatches between the domains of synthesized images and those of the evaluation set. Existing methods primarily assume domain alignment between them, which limits their generalization ability in the above cross-domain scenarios. In this paper, we aim to ensure that images synthesized from known domains maintain robust performance on unseen domains and propose a novel framework called Channel-masked Asymmetric Distribution Matching (CADM). During asymmetric distribution matching, domain-sensitive channels of real data are selectively masked at different layers to extract domain-invariant features that guide synthetic data optimization. To further improve synthetic data representation, we introduce a class-focused domain-agnostic regularization to capture class-relevant knowledge while ignoring domain-specific information. Experiments show that our method produces domain-robust synthetic data and substantially improves generalization performance on unseen domains.

AAAI Conference 2026 Conference Paper

Decomposing Prompts, Composing Actions: A Multi-Granularity Prompting Approach for Incremental Action Learning

  • Xinyi Cheng
  • Chenghao Xu
  • Xi Wang
  • Jiexi Yan
  • Yanhua Yang

Continual learning for action recognition is a critical capability for next-generation Extended Reality (XR) systems. Yet it faces a severe real-world challenge: strict user privacy that prohibits data rehearsal. While recent prompt-based continual learning methods show promise, we argue their core 'flat,' single-granularity design fundamentally misaligns with the complexity of human actions. This monolithic architecture fails to model the inherent hierarchical structure and overlooks standard action primitives shared across tasks, resulting in suboptimal performance and hindered knowledge transfer. To overcome this limitation, we propose DPCA, a novel spatio-temporal continual learning framework with multi-granularity adaptive prompting. DPCA learns three synergistic components to resolve this mismatch. First, the task-specific prompter employs a multi-granularity query system to capture the unique, compositional semantics of each action. Second, the task-agnostic prompter learns a globally shared vocabulary of ``action primitives," providing a stable and generalizable knowledge base to mitigate catastrophic forgetting. Finally, we introduce a Dissimilarity Attention Rectification at each granularity level, leveraging a reverse attention mechanism to model class-agnostic background information and effectively alleviating overfitting. The synergy between these components enables robust model adaptation without requiring access to past data. Rigorous experiments on multiple large-scale benchmarks (including NTU RGB+D), under a strict rehearsal-free, few-shot protocol, confirm that DPCA establishes a new state-of-the-art. This advance paves the way for the realization of truly adaptive and privacy-respecting XR systems.

AAAI Conference 2026 Conference Paper

Editing Is a Bargaining Game: Balanced Knowledge Editing in Large Language Models

  • Chenghao Xu
  • Jiexi Yan
  • Muli Yang
  • Fen Fang
  • Huilin Chen
  • Cheng Deng

Large Language Models (LLMs) are prone to generating incorrect or outdated information, thereby necessitating efficient and precise mechanisms for knowledge updates. Existing knowledge editing approaches, however, often encounter conflicts between two competing objectives: maintaining existing knowledge (preservation) and incorporating new information (editing). During gradient-based optimization, these conflicting objectives can lead to imbalanced update directions, where one gradient dominates, ultimately resulting in suboptimal learning dynamics. To address this challenge, we propose a balanced knowledge editing framework inspired by Nash bargaining theory. Our method guides the optimization process toward a Pareto stationary point, ensuring an equilibrium solution wherein any deviation from the final state would degrade the overall performance with respect to both objectives. This guarantees optimality in preserving prior knowledge while integrating new information. We empirically validate the effectiveness of our approach across a range of evaluation metrics on standard benchmark datasets. Extensive experiments show that our method consistently outperforms state-of-the-art techniques, achieving a superior balance between knowledge preservation and update accuracy.

IJCAI Conference 2025 Conference Paper

Dual-Perspective United Transformer for Object Segmentation in Optical Remote Sensing Images

  • Yanguang Sun
  • Jiexi Yan
  • Jianjun Qian
  • Chunyan Xu
  • Jian Yang
  • Lei Luo

Automatically segmenting objects from optical remote sensing images (ORSIs) is an important task. Most existing models are primarily based on either convolutional or Transformer features, each offering distinct advantages. Exploiting both advantages is valuable research, but it presents several challenges, including the heterogeneity between the two types of features, high complexity, and large parameters of the model. However, these issues are often overlooked in existing the ORSIs methods, causing sub-optimal segmentation. For that, we propose a novel Dual-Perspective United Transformer (DPU-Former) with a unique structure designed to simultaneously integrate long-range dependencies and spatial details. In particular, we design the global-local mixed attention, which captures diverse information through two perspectives and introduces a Fourier-space merging strategy to obviate deviations for efficient fusion. Furthermore, we present a gated linear feed-forward network to increase the expressive ability. Additionally, we construct a DPU-Former decoder to aggregate and strength features at different layers. Consequently, the DPU-Former model outperforms the state-of-the-art methods on multiple datasets. Code: https: //github. com/CSYSI/DPU-Former.

NeurIPS Conference 2025 Conference Paper

Smooth and Flexible Camera Movement Synthesis via Temporal Masked Generative Modeling

  • Chenghao Xu
  • guangtao lyu
  • Jiexi Yan
  • Muli Yang
  • Cheng Deng

In dance performances, choreographers define the visual expression of movement, while cinematographers shape its final presentation through camera work. Consequently, the synthesis of camera movements informed by both music and dance has garnered increasing research interest. While recent advancements have led to notable progress in this area, existing methods predominantly operate in an offline manner—that is, they require access to the entire dance sequence before generating corresponding camera motions. This constraint renders them impractical for real-time applications, particularly in live stage performances, where immediate responsiveness is essential. To address this limitation, we introduce a more practical yet challenging task: online camera movement synthesis, in which camera trajectories must be generated using only the current and preceding segments of dance and music. In this paper, we propose TemMEGA (Temporal Masked Generative Modeling), a unified framework capable of handling both online and offline camera movement generation. TemMEGA consists of three key components. First, a discrete camera tokenizer encodes camera motions as discrete tokens via a discrete quantization scheme. Second, a consecutive memory encoder captures historical context by jointly modeling long- and short-term temporal dependencies across dance and music sequences. Finally, a temporal conditional masked transformer is employed to predict future camera motions by leveraging masked token prediction. Extensive experimental evaluations demonstrate the effectiveness of our TemMEGA, highlighting its superiority in both online and offline camera movement synthesis.

ICLR Conference 2025 Conference Paper

Towards Unified Human Motion-Language Understanding via Sparse Interpretable Characterization

  • Guangtao Lyu
  • Chenghao Xu
  • Jiexi Yan
  • Muli Yang
  • Cheng Deng 0002

Recently, the comprehensive understanding of human motion has been a prominent area of research due to its critical importance in many fields. However, existing methods often prioritize specific downstream tasks and roughly align text and motion features within a CLIP-like framework. This results in a lack of rich semantic information which restricts a more profound comprehension of human motions, ultimately leading to unsatisfactory performance. Therefore, we propose a novel motion-language representation paradigm to enhance the interpretability of motion representations by constructing a universal motion-language space, where both motion and text features are concretely lexicalized, ensuring that each element of features carries specific semantic meaning. Specifically, we introduce a multi-phase strategy mainly comprising Lexical Bottlenecked Masked Language Modeling to enhance the language model's focus on high-entropy words crucial for motion semantics, Contrastive Masked Motion Modeling to strengthen motion feature extraction by capturing spatiotemporal dynamics directly from skeletal motion, Lexical Bottlenecked Masked Motion Modeling to enable the motion model to capture the underlying semantic features of motion for improved cross-modal understanding, and Lexical Contrastive Motion-Language Pretraining to align motion and text lexicon representations, thereby ensuring enhanced cross-modal coherence. Comprehensive analyses and extensive experiments across multiple public datasets demonstrate that our model achieves state-of-the-art performance across various tasks and scenarios.

AAAI Conference 2024 Conference Paper

Asymmetric Mutual Alignment for Unsupervised Zero-Shot Sketch-Based Image Retrieval

  • Zhihui Yin
  • Jiexi Yan
  • Chenghao Xu
  • Cheng Deng

In recent years, many methods have been proposed to address the zero-shot sketch-based image retrieval (ZS-SBIR) task, which is a practical problem in many applications. However, in real-world scenarios, on the one hand, we can not obtain training data with the same distribution as the test data, and on the other hand, the labels of training data are not available as usual. To tackle this issue, we focus on a new problem, namely unsupervised zero-shot sketch-based image retrieval (UZS-SBIR), where the available training data does not have labels while the training and testing categories are not overlapping. In this paper, we introduce a new asymmetric mutual alignment method (AMA) including a self-distillation module and a cross-modality mutual alignment module. First, we conduct self-distillation to extract the feature embeddings from unlabeled data. Due to the lack of available information in an unsupervised manner, we employ the cross-modality mutual alignment module to further excavate underlying intra-modality and inter-modality relationships from unlabeled data, and take full advantage of these correlations to align the feature embeddings in image and sketch domains. Meanwhile, the feature representations are enhanced by the intra-modality clustering relations, leading to better generalization ability to unseen classes. Moreover, we conduct an asymmetric strategy to update the teacher and student networks, respectively. Extensive experimental results on several benchmark datasets demonstrate the superiority of our method.

ICML Conference 2024 Conference Paper

Retrieval Across Any Domains via Large-scale Pre-trained Model

  • Jiexi Yan
  • Zhihui Yin
  • Chenghao Xu
  • Cheng Deng 0002
  • Heng Huang 0001

In order to enhance the generalization ability towards unseen domains, universal cross-domain image retrieval methods require a training dataset encompassing diverse domains, which is costly to assemble. Given this constraint, we introduce a novel problem of data-free adaptive cross-domain retrieval, eliminating the need for real images during training. Towards this goal, we propose a novel Text-driven Knowledge Integration (TKI) method, which exclusively utilizes a pre-trained vision-language model to implement an “aggregation after expansion" training strategy. Specifically, we extract diverse implicit domain-specific information through a set of learnable domain word vectors. Subsequently, a domain-agnostic universal projection, equipped with a non-Euclidean multi-layer perceptron, can be optimized using these assorted text descriptions through the text-proxied domain aggregation. Leveraging the cross-modal transferability phenomenon of the shared latent space, we can integrate the trained domain-agnostic universal projection with the pre-trained visual encoder to extract the features of the input image for the following retrieval during testing. Extensive experimental results on several benchmark datasets demonstrate the superiority of our method.

NeurIPS Conference 2022 Conference Paper

MetricFormer: A Unified Perspective of Correlation Exploring in Similarity Learning

  • Jiexi Yan
  • Erkun Yang
  • Cheng Deng
  • Heng Huang

Similarity learning can be significantly advanced by informative relationships among different samples and features. The current methods try to excavate the multiple correlations in different aspects, but cannot integrate them into a unified framework. In this paper, we provide to consider the multiple correlations from a unified perspective and propose a new method called MetricFormer, which can effectively capture and model the multiple correlations with an elaborate metric transformer. In MetricFormer, the feature decoupling block is adopted to learn an ensemble of distinct and diverse features with different discriminative characteristics. After that, we apply the batch-wise correlation block into the batch dimension of each mini-batch to implicitly explore sample relationships. Finally, the feature-wise correlation block is performed to discover the intrinsic structural pattern of the ensemble of features and obtain the aggregated feature embedding for similarity measuring. With three kinds of transformer blocks, we can learn more representative features through the proposed MetricFormer. Moreover, our proposed method can be flexibly integrated with any metric learning framework. Extensive experiments on three widely-used datasets demonstrate the superiority of our proposed method over state-of-the-art methods.

IJCAI Conference 2021 Conference Paper

Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval

  • Zhipeng Wang
  • Hao Wang
  • Jiexi Yan
  • Aming WU
  • Cheng Deng

Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal retrieval task, where abstract sketches are used as queries to retrieve natural images under zero-shot scenario. Most existing methods regard ZS-SBIR as a traditional classification problem and employ a cross-entropy or triplet-based loss to achieve retrieval, which neglect the problems of the domain gap between sketches and natural images and the large intra-class diversity in sketches. Toward this end, we propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR. Specifically, a cross-modal contrastive method is proposed to learn generalized representations to smooth the domain gap by mining relations with additional augmented samples. Furthermore, a category-specific memory bank with sketch features is explored to reduce intra-class diversity in the sketch domain. Extensive experiments demonstrate that our approach notably outperforms the state-of-the-art methods in both Sketchy and TU-Berlin datasets.

AAAI Conference 2018 Conference Paper

Dictionary Learning in Optimal Metric Space

  • Jiexi Yan
  • Cheng Deng
  • Xianglong Liu

Dictionary learning has been widely used in machine learning field to address many real-world applications, such as classi- fication and denoising. In recent years, many new dictionary learning methods have been proposed. Most of them are designed to solve unsupervised problem without any prior information or supervised problem with the label information. But in real world, as usual, we can only obtain limited side information as prior information rather than label information. The existing methods don’t take into account the side information, let alone learning a good dictionary through using the side information. To tackle it, we propose a new unified unsupervised model which naturally integrates metric learning to enhance dictionary learning model with fully utilizing the side information. The proposed method updates metric space and dictionary adaptively and alternatively, which ensures learning optimal metric space and dictionary simultaneously. Besides, our method can also deal well with highdimensional data. Extensive experiments show the efficiency of our proposed method, and a better performance can be derived in real-world image clustering applications.