Arrow Research search

Author name cluster

Xu Yang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

59 papers
2 author rows

Possible papers

59

AAAI Conference 2026 Conference Paper

Adaptive-Learngene: Continual Expansion and Task-Aware Selection of Learngenes for Dynamic Environments

  • Shuxia Lin
  • Qiufeng Wang
  • Chang Liu
  • Xu Yang
  • Xin Geng

Pre-trained Vision Transformer (ViT) models have achieved impressive performance across various computer vision tasks. However, most existing pre-trained models are built on fixed datasets and lack the flexibility to incorporate new pre-training data. When additional data becomes available, previous models must typically be retrained on both old and new data, which is costly and impractical, especially in privacy-sensitive or resource-constrained environments. Moreover, direct fine-tuning on downstream tasks does not provide mechanisms to adapt to the specific data distributions of those tasks, and it only supports fixed model sizes. To address these challenges, we propose Adaptive-Learngene, a novel framework in which the ancestry model is trained solely on newly available data, and a new component, termed a learngene, is extracted and added to a global learngene pool that expands incrementally. This design enables a dynamically evolving pool of learngenes without requiring access to previous data. For each new downstream task, the Task-Adaptive Learngene Selector (TALS) retrieves a sparse combination of learngenes that best match to the data distribution of the target task. TALS requires only a small amount of downstream data for this selection, enabling descendant models of different sizes to be efficiently initialized and tailored to specific data distributions and resource constraints. Extensive experiments on diverse downstream tasks demonstrate that our method matches or outperforms existing approaches while offering superior scalability, adaptability, and efficiency in dynamic learning environments.

AAAI Conference 2026 Conference Paper

Diffusion-calibrated Continual Test-time Adaptation

  • Xu Yang
  • Moqi Li
  • Kun Wei

Continual Test-Time Domain Adaptation (CTTA) aims to adapt a pre-trained source model to a dynamically evolving target domain without requiring additional data collection or labeling efforts. A key challenge in this setting is to achieve rapid performance improvement in the current domain using unlabeled data, while avoiding impairing generalization to future domains in complex scenarios. To enhance the discriminative capability of the inference models, we propose a novel framework that integrates an external auxiliary generative model with a test-time adaptive method, leveraging cross-validation to identify reliable supervisory signals. Specifically, for each test instance, we utilize a diffusion module to generate a calibrated instance under the textual description of its predicted category. Based on the generated one, we design a learning strategy with the following components: (1) the calibrated instance and its category are used to form a supervisory signal; (2) the predicted category of the calibrated instance is compared with the test instance for selecting reliable signals. For these generated and selected instances, adaptive weighting is applied during optimization to stabilize the category distribution and preserve prediction diversity. Finally, based on the inverse process of diffusion, we construct the negative instance of the generated instance and introduce a robust contrastive learning to further calibrate model optimization. Extensive experiments demonstrate that our method achieves state-of-the-art performance across multiple benchmarks. Ablation studies further validate the effectiveness of each proposed component.

AAAI Conference 2026 Conference Paper

Efficient and Effective In-context Demonstration Selection with Coreset

  • Zihua Wang
  • Jiarui Wang
  • Haiyang Xu
  • Ming Yan
  • Fei Huang
  • Xu Yang
  • Xiu-Shen Wei
  • Siya Mi

In-context learning (ICL) has emerged as a powerful paradigm for Large Visual Language Models (LVLMs), enabling them to leverage a few examples directly from input contexts. However, the effectiveness of this approach is heavily reliant on the selection of demonstrations, a process that is NP-hard. Traditional strategies, including random, similarity-based sampling and infoscore-based sampling, often lead to inefficiencies or suboptimal performance, struggling to balance both efficiency and effectiveness in demonstration selection. In this paper, we propose a novel demonstration selection framework named Coreset-based Dual Retrieval (CoDR). We show that samples within a diverse subset achieve a higher expected mutual information. To implement this, we introduce a cluster-pruning method to construct a diverse coreset that aligns more effectively with the query while maintaining diversity. Additionally, we develop a dual retrieval mechanism that enhances the selection process by achieving global demonstration selection while preserving efficiency. Experimental results demonstrate that our method significantly improves the ICL performance compared to the existing strategies, providing a robust solution for effective and efficient demonstration selection.

AAAI Conference 2026 Conference Paper

Extracting Multimodal Learngene in CLIP: Unveiling the Multimodal Generalizable Knowledge

  • Ruiming Chen
  • Junming Yang
  • Shiyu Xia
  • Xu Yang
  • Xin Geng

CLIP (Contrastive Language-Image Pre-training) has attracted widespread attention for its multimodal generalizable knowledge, which is significant for downstream tasks. However, the computational overhead of a large number of parameters and large-scale pre-training poses challenges of pre-training a different scale of CLIP. Learngene extracts the generalizable components termed as learngene from an ancestry model and initializes diverse descendant models with it. Previous Learngene paradigms fail to handle the generalizable knowledge in multimodal scenarios. In this paper, we put forward the idea of utilizing a multimodal block to extract the multimodal generalizable knowledge, which inspires us to propose MM-LG (Multimodal Learngene), a novel framework designed to extract and leverage generalizable components from CLIP. Specifically, we first establish multimodal and unimodal blocks to extract the multimodal and unimodal generalizable knowledge in a weighted-sum manner. Subsequently, we employ these components to numerically initialize descendant models of varying scales and modalities. Extensive experiments demonstrate MM-LG's effectiveness, which achieves performance gains over existing learngene approaches (e.g.,+3.1% on Oxford-IIIT PET and +4.13% on Flickr30k) and comparable or superior results to the pre-training and fine-tuning paradigm (e.g.,+1.9% on Oxford-IIIT PET and +3.65% on Flickr30k). Notably, MM-LG requires only around 25% of the parameter storage while reducing around 2.8× pre-training costs for diverse model scales compared to the pre-training and fine-tuning paradigm, making it particularly suitable for efficient deployment across diverse downstream tasks.

AAAI Conference 2026 Conference Paper

GraphIC: A Graph-Based In-Context Example Retrieval Model for Multi-Step Reasoning

  • Jiale Fu
  • Yaqing Wang
  • Simeng Han
  • Jiaming Fan
  • Xu Yang

In-context learning (ICL) enhances large language models (LLMs) by incorporating demonstration examples, yet its effectiveness heavily depends on the quality of selected examples. Current methods typically use text embeddings to measure semantic similarity, which often introduces bias in multi-step reasoning tasks. This occurs because text embeddings contain irrelevant semantic information and lack deeper reasoning structures. To address this, we propose GraphIC, a graph-based retrieval model that leverages reasoning-aware representation and specialized similarity metric for in-context example retrieval. GraphIC first constructs thought graphs—directed, node-attributed graphs that explicitly model reasoning steps and their dependencies—for candidate examples and queries. This approach filters out superficial semantics while preserving essential reasoning processes. Next, GraphIC retrieves examples using a novel similarity metric tailored for these graphs, capturing sequential reasoning patterns and asymmetry between examples. Comprehensive evaluations across mathematical reasoning, code generation, and logical reasoning tasks demonstrate that GraphIC outperforms 10 baseline methods. Our results highlight the importance of reasoning-aware retrieval in ICL, offering a robust solution for enhancing LLM performance in multi-step reasoning scenarios.

AAAI Conference 2026 Conference Paper

Learngene: Inheritable ‘Genes’ in Intelligent Agents (Abstract Reprint)

  • Fu Feng
  • Jing Wang
  • Xu Yang
  • Xin Geng

Biological intelligence has driven significant progress in artificial intelligence (AI), but a critical gap remains: biological systems inherit innate abilities from genes, with brains initialized by blueprints refined over 3.5 billion years of evolution, while machines rely heavily on inefficient, data-driven learning from scratch. This gap arises from the lack of a genetic mechanism in machines to transfer and accumulate inheritable knowledge across generations. To bridge this gap, we propose learngenes, network fragments that act as inheritable 'genes' for machines. Unlike conventional knowledge transfer methods, learngenes enable efficient and universal knowledge transfer by selectively encapsulating task-agnostic knowledge. To facilitate the transfer and accumulation of task-agnostic knowledge across generations, we introduce Genetic Reinforcement Learning (GRL), a framework that simulates the learning and evolution of organisms in intelligent agents following Lamarckian principles. Through GRL, we identify learngenes as network fragments within agents' policy networks, equipping newborn agents with innate abilities for rapid adaptation to novel tasks. We demonstrate the advantages of learngene-based knowledge transfer over evolution-based search and traditional pre-trained models, and show how learngenes evolve through the accumulation of task-agnostic knowledge. Overall, this work establishes a novel paradigm for knowledge transfer and model initialization in AI, offering new possibilities for more adaptive, efficient, and scalable learning systems.

AAAI Conference 2026 Conference Paper

Mix-QSAM2: Mixed-Precision Quantization for High Fidelity Segmentation in Resource Constrained Scenarios

  • Yuzhe Duan
  • Xuanxuan Ren
  • Guizhe Dong
  • Xu Yang
  • Yanhua Yang

The Segment Anything Model 2 (SAM2) has established a new benchmark for high-precision image and video segmentation, offering significant potential for a wide range of computer vision tasks. Despite its impressive performance, the model's substantial computational and memory requirements present a significant obstacle to its practical deployment on resource-constrained devices. In this paper, we introduce a novel framework for optimizing SAM2 through two synergistic, importance-driven strategies: quantization and memory management. Specifically, an Importance-driven Mixed-Precision Quantization scheme, which analyzes the sensitivity of each layer using a Weight-Activation Importance Score, is employed to enable a targeted bit-width assignment, preserving model accuracy by keeping critical layers at higher precision. Then, the Selective Importance-driven Synthesis (SIS) mechanism is proposed to address the inefficient accumulation of redundant data in the memory bank. SIS intelligently compresses the memory by identifying the most contextually similar historical frames and synthesizing them into a single, representative feature, thereby preserving informational diversity while enhancing temporal context understanding. Extensive experiments on the COCO and SA-V benchmarks validate our approach, showing that our optimized model consistently outperforms state-of-the-art quantization methods. Our work provides a principled framework for the co-design of quantization and dynamic memory management, offering a practical path toward deploying powerful video segmentation models in real-world applications.

AAAI Conference 2026 Conference Paper

Self-Indexing KVCache: Predicting Sparse Attention from Compressed Keys

  • Xu Yang
  • Jiapeng Zhang
  • Dongyang Zhao
  • Guo Chen
  • Zhuo Tang

The KV cache in self-attention has emerged as a major bottleneck in long-context and large-batch inference for LLMs. Existing approaches often treat sparsity prediction and compression as separate modules—relying on auxiliary index structures to select relevant tokens, and on complex quantization schemes to reduce memory usage. This fragmented design introduces redundant overhead and limits scalability. In this paper, we propose a novel paradigm: treating the compressed key representation not merely as storage, but as a self-indexing structure that directly enables efficient sparse attention. By designing a sign-based 1-bit vector quantization (VQ) scheme, our method unifies compression and retrieval in a single, hardware-friendly format. This approach eliminates the need for external indices or learning-based predictors, offering a lightweight yet robust solution for memory-constrained inference. All components are designed to be hardware-efficient and easy to implement. By implementing custom CUDA kernels, our method integrates seamlessly with FlashAttention, minimizing additional runtime and memory overhead. Experimental results demonstrate that our approach delivers both effectiveness and efficiency.

AAAI Conference 2026 Conference Paper

S²Flow: Towards Fast and Authentic Training-Free High-Resolution Video Generation

  • Chaoqun Wang
  • Shaobo Min
  • Xu Yang

Rectified flow models have shown strong potential in high-fidelity video generation, yet extending them to high-resolution remains challenging due to the high cost of full attention and error accumulation in the ODE-solving process. In this paper, we propose S^2Flow, a training-free framework that enables efficient and authentic high-resolution video generation by jointly exploring Flow-guided Sparse attention and Second-order ODE solution. Specifically, S^2Flow exploits and transfers the semantic and structural information from the low-resolution flow trajectory to guide the high-resolution flow in two aspects. First, S^2Flow dynamically captures the sparse patterns of the spatio-temporal attention maps from low-resolution videos to construct localized 3D windows, enabling efficient window attention in high-resolution inference. This can significantly reduce redundant computation while preserving contextual dependencies. Second, S^2Flow adopts a second-order ODE solver based on Taylor expansion, where the high-order derivative is approximated via central difference from the low-resolution flow, facilitating accurate high-resolution denoising. Extensive experiments on VBench dataset demonstrate that S^2Flow outperforms prior methods in both visual quality and inference speed, enabling 4x acceleration on 2560x1536 video generation.

AAAI Conference 2026 Conference Paper

Towards Robust Edge Model Adaptation via Elastic Architecture Search

  • Xianhang Chu
  • Xu Yang
  • Kun Wei
  • Xi Wang

Continual test-time adaptation (CTTA) enables online model adjustment under dynamic distribution shifts in real-world environments. However, most existing CTTA frameworks adopt fixed model architectures, lacking the structural flexibility required for deployment across heterogeneous edge devices with varying computational capacities. To address this, we propose an elastic framework for edge CTTA that performs resource-aware dynamic model search based on a pre-trained binary Supernet. This enables architectural flexibility by generating personalized models tailored to the resource constraints of different edge devices. Considering the evolving distribution of unlabeled data on edge devices during deployment, we introduce a pluggable lightweight fine-tuning mechanism. By inserting low-rank adapters into the frozen binary backbone, the model enables continual self-supervised adaptation with minimal computational overhead. In addition, we propose a structure-aware knowledge reflux mechanism that transfers the adaptation experience from fine-tuned edge models back into the Supernet. By distilling knowledge into structurally aligned Supernet paths, future architecture search is improved without requiring retraining. Experiments on multiple benchmarks validate that our method achieves state-of-the-art performance while significantly reducing resource consumption, with re-searched models after knowledge reflux showing further improvements.

NeurIPS Conference 2025 Conference Paper

Adversarial Graph Fusion for Incomplete Multi-view Semi-supervised Learning with Tensorial Imputation

  • Zhangqi Jiang
  • Tingjin Luo
  • Xu Yang
  • Xinyan Liang

View missing remains a significant challenge in graph-based multi-view semi-supervised learning, hindering their real-world applications. To address this issue, traditional methods introduce a missing indicator matrix and focus on mining partial structure among existing samples in each view for label propagation (LP). However, we argue that these disregarded missing samples sometimes induce discontinuous local structures, i. e. , sub-clusters, breaking the fundamental smoothness assumption in LP. Consequently, such a Sub-Cluster Problem (SCP) would distort graph fusion and degrade classification performance. To alleviate SCP, we propose a novel incomplete multi-view semi-supervised learning method, termed AGF-TI. Firstly, we design an adversarial graph fusion scheme to learn a robust consensus graph against the distorted local structure through a min-max framework. By stacking all similarity matrices into a tensor, we further recover the incomplete structure from the high-order consistency information based on the low-rank tensor learning. Additionally, the anchor-based strategy is incorporated to reduce the computational complexity. An efficient alternative optimization algorithm combining a reduced gradient descent method is developed to solve the formulated objective, with theoretical convergence. Extensive experimental results on various datasets validate the superiority of our proposed AGF-TI as compared to state-of-the-art methods. Code is available at https: //github. com/ZhangqiJiang07/AGF_TI.

NeurIPS Conference 2025 Conference Paper

Data Selection Matters: Towards Robust Instruction Tuning of Large Multimodal Models

  • Xu Yang
  • Chen Liu
  • Ying Wei

Selecting a compact subset of visual instruction–following data has emerged as an effective way to align large multimodal models with human intentions while avoiding the high cost of full-dataset training. Yet we observe that both full-data training and existing state-of-the-art data selection methods tend to inherit underlying dataset biases such as position bias and spurious correlations, leading to biased model behaviors. To address this issue, we introduce ARDS, a robustness-aware targeted visual instruction-selection framework that explicitly mitigates these weaknesses, sidestepping the need for access to downstream data or time-consuming gradient computation. Specifically, we first identify the worst-case evaluation subgroups through visual and textual task-specific perturbations. The robust training mixture is then constructed by prioritizing samples that are semantically closer to these subgroups in a rich multimodal embedding space. Extensive experiments demonstrate that ARDS substantially boosts both robustness and data efficiency for visual instruction tuning. We also showcase that the robust mixtures produced with a smaller model transfer effectively to larger architectures. Our code and selected datasets that have been demonstrated transferable across models are available at https: //github. com/xyang583/ARDS.

NeurIPS Conference 2025 Conference Paper

Dual-Space Semantic Synergy Distillation for Continual Learning of Unlabeled Streams

  • Donghao Sun
  • Xi Wang
  • Xu Yang
  • Kun Wei
  • Cheng Deng

Continual learning from unlabeled data streams while effectively combating catastrophic forgetting poses an intractable challenge. Traditional methods predominantly rely on visual clustering techniques to generate pseudo labels, which are frequently plagued by problems such as noise and suboptimal quality, profoundly affecting the impact on the model evolution. To surmount these obstacles, we introduce an innovative approach that synergistically combines both visual and textual information to generate dual space hybrid pseudo labels for reliable model continual evolution. Specifically, by harnessing the capabilities of large multimodal models, we initially generate generalizable text descriptions for a few representative samples. These descriptions then undergo a `Coarse to Fine' refinement process to capture the subtle nuances between different data points, significantly enhancing the semantic accuracy of the descriptions. Simultaneously, a novel cross-modal hybrid approach seamlessly integrates these fine-grained textual descriptions with visual features, thereby creating a more robust and reliable supervisory signal. Finally, such descriptions are employed to alleviate the catastrophic forgetting issue via a semantic alignment distillation, which capitalizes on the stability inherent in language knowledge to effectively prevent the model from forgetting previously learned information. Comprehensive experiments conducted on a variety of benchmarks demonstrate that our proposed method attains state-of-the-art performance, and ablation studies further substantiate the effectiveness and superiority of the proposed method.

NeurIPS Conference 2025 Conference Paper

FlowPrune: Accelerating Attention Flow Calculation by Pruning Flow Network

  • Shuo Xu
  • Yu Chen
  • Shuxia Lin
  • Xin Geng
  • Xu Yang

The Transformer architecture serves as the foundation of modern AI systems, powering recent advances in Large Language Models (LLMs) and Large Multimodal Models (LMMs). Central to these models, attention mechanisms capture contextual dependencies via token interactions. Beyond inference, attention has been widely adopted for interpretability, offering insights into model behavior. Among interpretability techniques, attention flow --- which traces global information transfer across layers --- provides a more comprehensive perspective than single-layer attention maps. However, computing attention flow is computationally intensive due to the high complexity of max-flow algorithms. To address this challenge, we introduce FlowPrune, a novel framework that accelerates attention flow analysis by pruning the attention graph before applying max-flow computations. FlowPrune uses the Max-Flow Min-Cut Theorem and two structural properties of Transformer to identify and eliminate non-critical graph regions. It comprises two components: Edge Pruning, which removes insignificant attention edges, and Layer Compression, which discards layers with minimal contributions to the flow. We conduct extensive experiments on LLaMA and LLaVA to evaluate the robustness and effectiveness of FlowPrune. Our results show that FlowPrune achieves high agreement with the original attention flow in both absolute and relative error metrics, as well as in identifying influential input tokens. Finally, case studies in both NLP and vision domains demonstrate that FlowPrune produces consistent interpretability outcomes as the original Attention Flow, validating its practical utility. The code for this paper is publicly available.

AAAI Conference 2025 Conference Paper

Inheriting Generalized Learngene for Efficient Knowledge Transfer across Multiple Tasks

  • Yuankun Zu
  • Shiyu Xia
  • Xu Yang
  • Qiufeng Wang
  • Han Zhang
  • Xin Geng

In practical applications, it is often necessary to transfer knowledge from large pretrained models to small ones with various architectures for tackling different tasks. The Learngene framework, proposed recently, firstly extracts one compact module termed as learngene from a large well-trained model, after which learngene is used to build descendant models for handling diverse tasks. In this paper, we aim to explore extracting and inheriting learngene which can be generalized across different model architectures and tasks, remaining understudied in previous works. Inspired by the existing observations that large kernel convolutional neural networks (CNNs) exhibit significant generalization potential across various architectures and tasks, we propose a novel two-stage Learngene method termed CLKG (Convolutional Learngene for Knowledge Generalization), which inherits convolutional kernels containing generalized knowledge as learngene to build diverse models for multiple tasks. Specifically, we construct an auxiliary model comprised of small kernels and train it through dense feature distillation to inherit the feature extraction ability from large kernel CNNs. After distillation, we select certain kernels from the auxiliary model as learngene based on three criteria: direct kernel extraction, priority to edge kernels, and continuous kernel selection. Subsequently, we adapt learngene according to the width of the descendant models and use it to initialize the backbone part of descendant models. Experiments on diverse vision tasks such as image classification, object detection and semantic segmentation demonstrate the superiority of CLKG. For example, compared with from scratch training, it brings 2.89% improvements on VOC12+SBD, and reduces around 2x training data volume and training epochs to achieve better results. Furthermore, compared to knowledge distillation method, CLKG significantly reduces negative transfer on certain datasets, e.g., achieves 1.88% performance improvements on NAO dataset despite domain differences.

NeurIPS Conference 2025 Conference Paper

KRIS-Bench: Benchmarking Next-Level Intelligent Image Editing Models

  • Yongliang Wu
  • Zonghui Li
  • Xinting Hu
  • Xinyu Ye
  • Xianfang Zeng
  • Gang Yu
  • Wenbo Zhu
  • Bernt Schiele

Recent advances in multi-modal generative models have enabled significant progress in instruction-based image editing. However, while these models produce visually plausible outputs, their capacity for knowledge-based reasoning editing tasks remains under-explored. In this paper, We introduce KRIS-Bench (Knowledge-based Reasoning in Image-editing Systems Benchmark), a diagnostic benchmark designed to assess models through a cognitively informed lens. Drawing from educational theory, KRIS-Bench categorizes editing tasks across three foundational knowledge types: Factual, Conceptual, and Procedural. Based on this taxonomy, we design 22 representative tasks spanning 7 reasoning dimensions and release 1, 267 high-quality annotated editing instances. To support fine-grained evaluation, we propose a comprehensive protocol that incorporates a novel Knowledge Plausibility metric, enhanced by knowledge hints and calibrated through human studies. Empirical results on nine state-of-the-art models reveal significant gaps in reasoning performance, highlighting the need for knowledge-centric benchmarks to advance the development of intelligent image editing systems.

IJCAI Conference 2025 Conference Paper

Outstanding Orthodontist: No More Artifactual Teeth in Talking Face

  • Zibo Su
  • Ziqi Zhang
  • Kun Wei
  • Xu Yang
  • Cheng Deng

Audio-driven talking face synthesis (TFS) enables the creation of realistic speaking videos by combining a single facial image with a speech audio clip. Unlike other facial features that naturally deform during speech, teeth represent unique rigid structures whose shape and size should remain constant throughout the video sequence. However, current methods often produce temporal inconsistencies and artifacts in the teeth region, resulting in a less realistic appearance of the generated videos. To address this, we propose OrthoNet, a plug-and-play framework designed to eliminate unrealistic teeth effects in audio-driven TFS. Our method introduces a Detail-oriented Teeth Aligner module, designed to preserve teeth details and adapt to their shape. It works with a Memory-guided Teeth Stabilizer that integrates a long-term memory bank for global teeth structure and a short-term memory module for local temporal dynamics. Through this framework, OrthoNet acts like an orthodontist for existing Audio2Video methods, ensuring that teeth maintain natural rigidity and temporal consistency even under varying degrees of teeth occlusion. Extensive experiments demonstrate that our method makes the teeth in generated videos appear more natural during speech, significantly enhancing the temporal consistency and structural stability of audio-driven video generation.

IJCAI Conference 2025 Conference Paper

Q-MiniSAM2: A Quantization-based Benchmark for Resource-Efficient Video Segmentation

  • Xuanxuan Ren
  • Xiangyu Li
  • Kun Wei
  • Xu Yang
  • Yanhua Yang

Segment Anything Model 2 (SAM2) is a new-generation, high-precision model for image and video segmentation, offering extensive application prospects across numerous computer vision fields. However, as a large-scale model, its huge memory demands and expansive computing costs pose challenges for practical deployment. This paper presents Q-MiniSAM2, an efficient Quantization-based segmentation benchmark tailored to optimize SAM2 by Minimizing memory consumption and accelerating computations. We begin with applying Post-Training Quantization (PTQ) to SAM2, requiring only a relatively small dataset for network calibration, thereby eliminating the need for retraining. Building upon PTQ, we further introduce a Hierarchy-based Video Quantization method to enhance the model’s capacity to capture video semantics and temporal correlations across different time scales. Furthermore, we observe that SAM2’s memory overhead is predominantly concentrated on processing historical frames, and the redundant cross-attention computations significantly increase memory and computational costs due to the imperceptible change of the short time intervals between these frames. To tackle this issue, an Adaptive Mutual-KV mechanism is proposed to mitigate excessive cross-attention by leveraging inter-frame similarities. Comprehensive experiments demonstrate that the proposed approach achieves superior performance compared to state-of-the-art methods, underscoring its potential for efficient and scalable video segmentation.

NeurIPS Conference 2025 Conference Paper

R&D-Agent-Quant: A Multi-Agent Framework for Data-Centric Factors and Model Joint Optimization

  • Yuante Li
  • Xu Yang
  • Xiao Yang
  • Xisen Wang
  • Weiqing Liu
  • Jiang Bian

Financial markets pose fundamental challenges for asset return prediction due to their high dimensionality, non-stationarity, and persistent volatility. Despite advances in large language models and multi-agent systems, current quantitative research pipelines suffer from limited automation, weak interpretability, and fragmented coordination across key components such as factor mining and model innovation. In this paper, we propose R&D-Agent for Quantitative Finance, in short R&D-Agent(Q), the first data-centric multi-agent framework designed to automate the full-stack research and development of quantitative strategies via coordinated factor-model co-optimization. R&D-Agent(Q) decomposes the quant process into two iterative stages: a Research stage that dynamically sets goal-aligned prompts, formulates hypotheses based on domain priors, and maps them to concrete tasks, and a Development stage that employs a code-generation agent, Co-STEER, to implement task-specific code, which is then executed in real-market backtests. The two stages are connected through a feedback stage that thoroughly evaluates experimental outcomes and informs subsequent iterations, with a multi-armed bandit scheduler for adaptive direction selection. Empirically, R&D-Agent(Q) achieves up to 2× higher annualized returns than classical factor libraries using 70% fewer factors, and outperforms state-of-the-art deep time-series models on real markets. Its joint factor–model optimization delivers a strong balance between predictive accuracy and strategy robustness. Our code is available at: https: //github. com/microsoft/RD-Agent.

NeurIPS Conference 2025 Conference Paper

RAPID Hand: Robust, Affordable, Perception-Integrated, Dexterous Manipulation Platform for Embodied Intelligence

  • Zhaoliang Wan
  • Zetong Bi
  • Zida Zhou
  • Hao Ren
  • Yiming Zeng
  • Yihan Li
  • Lu Qi
  • Xu Yang

This paper addresses the scarcity of low-cost but high-dexterity platforms for collecting real-world multi-fingered robot manipulation data towards generalist robot autonomy. To achieve it, we propose the RAPID Hand, a co-optimized hardware and software platform where the compact 20-DoF hand, robust whole-hand perception, and high-DoF teleoperation interface are jointly designed. Specifically, RAPID Hand adopts a compact and practical hand ontology and a hardware-level perception framework that stably integrates wrist-mounted vision, fingertip tactile sensing, and proprioception with sub-7 ms latency and spatial alignment. Collecting high-quality demonstrations on high-DoF hands is challenging, as existing teleoperation methods struggle with precision and stability on complex multi-fingered systems. We address this by co-optimizing hand design, perception integration, and teleoperation interface through a universal actuation scheme, custom perception electronics, and two retargeting constraints. We evaluate the platform’s hardware, perception, and teleoperation interface. Training a diffusion policy on collected data shows superior performance over prior works, validating the system’s capability for reliable, high-quality data collection. The platform is constructed from low-cost and off-the-shelf components and will be made public to ensure reproducibility and ease of adoption.

IJCAI Conference 2025 Conference Paper

Tackling Long-Tailed Data Challenges in Spiking Neural Networks via Heterogeneous Knowledge Distillation

  • Moqi Li
  • Xu Yang
  • Cheng Deng

Spiking Neural Networks (SNNs), inspired by the behavior of biological neurons, have gained significant research interest for resource-constrained edge devices and neuromorphic hardware due to their use of binary spike signals for inter-unit communication with low power consumption. However, the absence of research on spiking neural networks on long-tailed data has severely limited the deployment and application of this emerging network in practical scenarios. To fill this gap, this paper proposes a long-tail learning framework based on spiking neural networks, named LT-SpikingFormer, to alleviate the distribution bias between head and tail classes. LT-SpikingFormer adopts a widely trained Convolutional Neural Network to construct a heterogeneous knowledge distillation paradigm, offering balanced and reliable prior knowledge. Moreover, a multi-granularity hierarchical feature distillation objective is proposed to leverage cross-layer local features and network global predictions to facilitate refined information distillation to optimize the network, specifically for the performance of the tailed classes. Extensive experimental results demonstrate that our method performs well on several benchmark datasets.

AAAI Conference 2025 Conference Paper

Unlearning Concepts in Diffusion Model via Concept Domain Correction and Concept Preserving Gradient

  • Yongliang Wu
  • Shiji Zhou
  • Mingzhuo Yang
  • Lianzhe Wang
  • Heng Chang
  • Wenbo Zhu
  • Xinting Hu
  • Xiao Zhou

Text-to-image diffusion models have achieved remarkable success in generating photorealistic images. However, the inclusion of sensitive information during pre-training poses significant risks. Machine Unlearning (MU) offers a promising solution to eliminate sensitive concepts from these models. Despite its potential, existing MU methods face two main challenges: 1) limited generalization, where concept erasure is effective only within the unlearned set, failing to prevent sensitive concept generation from out-of-set prompts; and 2) utility degradation, where removing target concepts significantly impacts the model's overall performance. To address these issues, we propose a novel concept domain correction framework named \textbf{DoCo} (\textbf{Do}main \textbf{Co}rrection). By aligning the output domains of sensitive and anchor concepts through adversarial training, our approach ensures comprehensive unlearning of target concepts. Additionally, we introduce a concept-preserving gradient surgery technique that mitigates conflicting gradient components, thereby preserving the model's utility while unlearning specific concepts. Extensive experiments across various instances, styles, and offensive concepts demonstrate the effectiveness of our method in unlearning targeted concepts with minimal impact on related concepts, outperforming previous approaches even for out-of-distribution prompts.

AAAI Conference 2025 Conference Paper

Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark

  • Yongliang Wu
  • Wenbo Zhu
  • Jiawang Cao
  • Yi Lu
  • Bozheng Li
  • Weiheng Chi
  • Zihan Qiu
  • Lirian Su

The demand for producing short-form videos for sharing on social media platforms has experienced significant growth in recent times. Despite notable advancements in the fields of video summarization and highlight detection, which can create partially usable short films from raw videos, these approaches are often domain-specific and require an in-depth understanding of real-world video content. To tackle this predicament, we propose Repurpose-10K, an extensive dataset comprising over 10,000 videos with more than 120,000 annotated clips aimed at resolving the video long-to-short task. Recognizing the inherent constraints posed by untrained human annotators, which can result in inaccurate annotations for repurposed videos, we propose a two-stage solution to obtain annotations from real-world user-generated content. Furthermore, we offer a baseline model to address this challenging task by integrating audio, visual, and caption aspects through a cross-modal fusion and alignment framework. We aspire for our work to ignite groundbreaking research in the lesser-explored realms of video repurposing.

JBHI Journal 2024 Journal Article

An Automated Analysis Framework for Epidemiological Survey on COVID-19

  • Zichao Lin
  • Xialv Lin
  • Xu Yang

For a long time, the prevention and control of COVID-19 has received significant attention. A crucial aspect of controlling the disease's spread is the epidemiological survey of patients and the subsequent analysis of epidemiological survey reports (case reports). However, current mainstream analysis approaches are all made manually. This manual method is time-consuming and manpower-intensive. This paper designs an automated visual epidemiological survey analysis (AVESA) framework for the epidemiological survey on COVID-19. AVESA designs a deep neural network for information extraction from case reports and automatically constructs an epidemiological knowledge graph based on predefined pattern. Moreover, a multi-dimensional knowledge reasoning model is developed for conducting knowledge reasoning in the complete COVID-19 epidemiological knowledge graph. In the entity extraction sub-task and multi-task extraction sub-task, AVESA achieved F1 scores of 85. 12% and 92. 29% respectively on the constructed dataset, significantly outperforming the standalone information extraction models. In full-graph computing, all three experiments align closely with manual analysis standards. In the risk analysis experiment, the weighted PageRank algorithm showed an average improvement of 11. 21% in Top_Recall_n% over the standard PageRank algorithm. In the community detection experiment, the weighted Louvain algorithm showed a mere 4. 34% community difference rate compared to manual analysis.

NeurIPS Conference 2024 Conference Paper

BPQP: A Differentiable Convex Optimization Framework for Efficient End-to-End Learning

  • Jianming Pan
  • Zeqi Ye
  • Xiao Yang
  • Xu Yang
  • Weiqing Liu
  • Lewen Wang
  • Jiang Bian

Data-driven decision-making processes increasingly utilize end-to-end learnable deep neural networks to render final decisions. Sometimes, the output of the forward functions in certain layers is determined by the solutions to mathematical optimization problems, leading to the emergence of differentiable optimization layers that permit gradient back-propagation. However, real-world scenarios often involve large-scale datasets and numerous constraints, presenting significant challenges. Current methods for differentiating optimization problems typically rely on implicit differentiation, which necessitates costly computations on the Jacobian matrices, resulting in low efficiency. In this paper, we introduce BPQP, a differentiable convex optimization framework designed for efficient end-to-end learning. To enhance efficiency, we reformulate the backward pass as a simplified and decoupled quadratic programming problem by leveraging the structural properties of the Karush–Kuhn–Tucker (KKT) matrix. This reformulation enables the use of first-order optimization algorithms in calculating the backward pass gradients, allowing our framework to potentially utilize any state-of-the-art solver. As solver technologies evolve, BPQP can continuously adapt and improve its efficiency. Extensive experiments on both simulated and real-world datasets demonstrate that BPQP achieves a significant improvement in efficiency—typically an order of magnitude faster in overall execution time compared to other differentiable optimization layers. Our results not only highlight the efficiency gains of BPQP but also underscore its superiority over differential optimization layer baselines.

AAAI Conference 2024 Conference Paper

Building Variable-Sized Models via Learngene Pool

  • Boyu Shi
  • Shiyu Xia
  • Xu Yang
  • Haokun Chen
  • Zhiqiang Kou
  • Xin Geng

Recently, Stitchable Neural Networks (SN-Net) is proposed to stitch some pre-trained networks for quickly building numerous networks with different complexity and performance trade-offs. In this way, the burdens of designing or training the variable-sized networks, which can be used in application scenarios with diverse resource constraints, are alleviated. However, SN-Net still faces a few challenges. 1) Stitching from multiple independently pre-trained anchors introduces high storage resource consumption. 2) SN-Net faces challenges to build smaller models for low resource constraints. 3). SN-Net uses an unlearned initialization method for stitch layers, limiting the final performance. To overcome these challenges, motivated by the recently proposed Learngene framework, we propose a novel method called Learngene Pool. Briefly, Learngene distills the critical knowledge from a large pre-trained model into a small part (termed as learngene) and then expands this small part into a few variable-sized models. In our proposed method, we distill one pre-trained large model into multiple small models whose network blocks are used as learngene instances to construct the learngene pool. Since only one large model is used, we do not need to store more large models as SN-Net and after distilling, smaller learngene instances can be created to build small models to satisfy low resource constraints. We also insert learnable transformation matrices between the instances to stitch them into variable-sized models to improve the performance of these models. Exhaustive experiments have been implemented and the results validate the effectiveness of the proposed Learngene Pool compared with SN-Net.

IROS Conference 2024 Conference Paper

C 3 P-VoxelMap: Compact, Cumulative and Coalescible Probabilistic Voxel Mapping

  • Xu Yang
  • Wenhao Li
  • Qijie Ge
  • Lulu Suo
  • Weijie Tang
  • Zhengyu Wei
  • Longxiang Huang
  • Bo Wang

This work presents a compact, cumulative, and coalescible probabilistic voxel mapping method to enhance performance, accuracy, and memory efficiency in LiDAR odometry. Probabilistic voxel mapping requires storing past point clouds and re-iterating them to update the uncertainty at every iteration, which consumes large memory space and CPU cycles. To solve this problem, we propose a two-fold strategy. First, we introduce a compact point-free representation for probabilistic voxels and derive a cumulative update of the planar uncertainty without caching original point clouds. Our voxel structure only keeps track of a predetermined set of statistics for points that lie inside it. This method reduces the runtime complexity from O(MN) to O(N) and the space complexity from O(N) to O(1) where M is the number of iterations and N is the number of points. Second, to further minimize memory usage and enhance mapping accuracy, we provide a strategy to dynamically merge voxels associated with the same physical planes by taking advantage of the geometric features in the real world. Rather than constantly scanning for these coalescible voxels at every iteration, our merging strategy accumulates voxels in a locality-sensitive hash and triggers merging lazily. On-demand merging reduces memory footprint with minimal computational overhead and improves localization accuracy thanks to cross-voxel denoising. Experiments exhibit 20% higher accuracy, 20% faster performance, and 70% lower memory consumption than the state-of-the-art.

NeurIPS Conference 2024 Conference Paper

Cluster-Learngene: Inheriting Adaptive Clusters for Vision Transformers

  • Qiufeng Wang
  • Xu Yang
  • Fu Feng
  • Jing Wang
  • Xin Geng

In recent years, the merging of vast datasets with powerful computational resources has led to the emergence of large pre-trained models in the field of deep learning. However, the common practices often overgeneralize the applicability of these models, overlooking the task-specific resource constraints. To mitigate this issue, we propose \textbf{Cluster-Learngene}, which effectively clusters critical internal modules from a large ancestry model and then inherits them to initialize descendant models of elastic scales. Specifically, based on the density characteristics of attention heads, our method adaptively clusters attention heads of each layer and position-wise feed-forward networks (FFNs) in the ancestry model as the learngene. Moreover, we introduce priority weight-sharing and learnable parameter transformations that expand the learngene to initialize descendant models of elastic scales. Through extensive experimentation, we demonstrate that Cluster-Learngene not only is more efficient compared to other initialization methods but also customizes models of elastic scales according to downstream task resources.

AAAI Conference 2024 Conference Paper

Dynamic Reactive Spiking Graph Neural Network

  • Han Zhao
  • Xu Yang
  • Cheng Deng
  • Junchi Yan

Spiking Graph Neural Networks are emerging tools for analyzing graph data along with low energy consumption and certain biological fidelity. Existing methods directly integrate same-reactive spiking neurons into graph neural networks for processing propagated graphs. However, such same-reactive neurons are not biological-functionality enough compared to the brain's dynamic-reactive ones, limiting the model's expression. Meanwhile, insufficient long-range neighbor information can be excavated with the few-step propagated graph, restricting discrimination of graph spiking embeddings. Inspired by the dynamic cognition in the brain, we propose a Dynamic Reactive Spiking Graph Neural Network that can enhance model's expressive ability in higher biological fidelity. Specifically, we design dynamic reactive spiking neurons to process spiking graph inputs, which have unique optimizable thresholds to spontaneously explore dynamic reactive states between neurons. Moreover, discriminative graph positional spikes are learned and integrated adaptively into spiking outputs through our neurons, thereby exploring long-range neighbors more thoroughly. Finally, with the dynamic reactive mechanism and learnable positional integration, we can obtain a powerful and highly bio-fidelity model with low energy consumption. Experiments on various domain-related datasets can demonstrate the effectiveness of our model. Our code is available at https://github.com/hzhao98/DRSGNN.

IJCAI Conference 2024 Conference Paper

Exploring Learngene via Stage-wise Weight Sharing for Initializing Variable-sized Models

  • Shi-Yu Xia
  • Wenxuan Zhu
  • Xu Yang
  • Xin Geng

In practice, we usually need to build variable-sized models adapting for diverse resource constraints in different application scenarios, where weight initialization is an important step prior to training. The Learngene framework, introduced recently, firstly learns one compact part termed as learngene from a large well-trained model, after which learngene is expanded to initialize variable-sized models. In this paper, we start from analysing the importance of guidance for the expansion of well-trained learngene layers, inspiring the design of a simple but highly effective Learngene approach termed SWS (Stage-wise Weight Sharing), where both learngene layers and their learning process critically contribute to providing knowledge and guidance for initializing models at varying scales. Specifically, to learn learngene layers, we build an auxiliary model comprising multiple stages where the layer weights in each stage are shared, after which we train it through distillation. Subsequently, we expand these learngene layers containing stage information at their corresponding stage to initialize models of variable depths. Extensive experiments on ImageNet-1K demonstrate that SWS achieves consistent better performance compared to many models trained from scratch, while reducing around 6. 6× total training costs. In some cases, SWS performs better only after 1 epoch tuning. When initializing variable-sized models adapting for different resource constraints, SWS achieves better results while reducing around 20× parameters stored to initialize these models and around 10× pre-training costs, in contrast to the pre-training and fine-tuning approach.

NeurIPS Conference 2024 Conference Paper

Initializing Variable-sized Vision Transformers from Learngene with Learnable Transformation

  • Shiyu Xia
  • Yuankun Zu
  • Xu Yang
  • Xin Geng

In practical scenarios, it is necessary to build variable-sized models to accommodate diverse resource constraints, where weight initialization serves as a crucial step preceding training. The recently introduced Learngene framework firstly learns one compact module, termed learngene, from a large well-trained model, and then transforms learngene to initialize variable-sized models. However, the existing Learngene methods provide limited guidance on transforming learngene, where transformation mechanisms are manually designed and generally lack a learnable component. Moreover, these methods only consider transforming learngene along depth dimension, thus constraining the flexibility of learngene. Motivated by these concerns, we propose a novel and effective Learngene approach termed LeTs (Learnable Transformation), where we transform the learngene module along both width and depth dimension with a set of learnable matrices for flexible variablesized model initialization. Specifically, we construct an auxiliary model comprising the compact learngene module and learnable transformation matrices, enabling both components to be trained. To meet the varying size requirements of target models, we select specific parameters from well-trained transformation matrices to adaptively transform the learngene, guided by strategies such as continuous selection and magnitude-wise selection. Extensive experiments on ImageNet-1K demonstrate that Des-Nets initialized via LeTs outperform those with 100-epoch from scratch training after only 1 epoch tuning. When transferring to downstream image classification tasks, LeTs achieves better results while outperforming from scratch training after about 10 epochs within a 300-epoch training schedule.

NeurIPS Conference 2024 Conference Paper

Lever LM: Configuring In-Context Sequence to Lever Large Vision Language Models

  • Xu Yang
  • Yingzhe Peng
  • Haoxuan Ma
  • Shuo Xu
  • Chi Zhang
  • Yucheng Han
  • Hanwang Zhang

As Archimedes famously said, ``Give me a lever long enough and a fulcrum on which to place it, and I shall move the world'', in this study, we propose to use a tiny Language Model (LM), \eg, a Transformer with 67M parameters, to lever much larger Vision-Language Models (LVLMs) with 9B parameters. Specifically, we use this tiny \textbf{Lever-LM} to configure effective in-context demonstration (ICD) sequences to improve the In-Context Learinng (ICL) performance of LVLMs. Previous studies show that diverse ICD configurations like the selection and ordering of the demonstrations heavily affect the ICL performance, highlighting the significance of configuring effective ICD sequences. Motivated by this and by re-considering the the process of configuring ICD sequence, we find this is a mirror process of human sentence composition and further assume that effective ICD configurations may contain internal statistical patterns that can be captured by Lever-LM. Then a dataset with effective ICD sequences is constructed to train Lever-LM. After training, given novel queries, new ICD sequences are configured by the trained Lever-LM to solve vision-language tasks through ICL. Experiments show that these ICD sequences can improve the ICL performance of two LVLMs compared with some strong baselines in Visual Question Answering and Image Captioning, validating that Lever-LM can really capture the statistical patterns for levering LVLMs. The code is available at \url{https: //anonymous. 4open. science/r/Lever-LM-604A/}.

NeurIPS Conference 2024 Conference Paper

Linearly Decomposing and Recomposing Vision Transformers for Diverse-Scale Models

  • Shuxia Lin
  • Miaosen Zhang
  • Ruiming Chen
  • Qiufeng Wang
  • Xu Yang
  • Xin Geng

Vision Transformers (ViTs) are widely used in a variety of applications, while they usually have a fixed architecture that may not match the varying computational resources of different deployment environments. Thus, it is necessary to adapt ViT architectures to devices with diverse computational overheads to achieve an accuracy-efficient trade-off. This concept is consistent with the motivation behind Learngene. To achieve this, inspired by polynomial decomposition in calculus, where a function can be approximated by linearly combining several basic components, we propose to linearly decompose the ViT model into a set of components called learngenes during element-wise training. These learngenes can then be recomposed into differently scaled, pre-initialized models to satisfy different computational resource constraints. Such a decomposition-recomposition strategy provides an economical and flexible approach to generating different scales of ViT models for different deployment scenarios. Compared to model compression or training from scratch, which require to repeatedly train on large datasets for diverse-scale models, such strategy reduces computational costs since it only requires to train on large datasets once. Extensive experiments are used to validate the effectiveness of our method: ViTs can be decomposed and the decomposed learngenes can be recomposed into diverse-scale ViTs, which can achieve comparable or better performance compared to traditional model compression and pre-training methods. The code for our experiments is available in the supplemental material.

NeurIPS Conference 2024 Conference Paper

LIVE: Learnable In-Context Vector for Visual Question Answering

  • Yingzhe Peng
  • Chenduo Hao
  • Xinting Hu
  • Jiawei Peng
  • Xin Geng
  • Xu Yang

As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, applying ICL usually faces two major challenges: 1) using more ICDs will largely increase the inference time and 2) the performance is sensitive to the selection of ICDs. These challenges are further exacerbated in LMMs due to the integration of multiple data types and the combinational complexity of multimodal ICDs. Recently, to address these challenges, some NLP studies introduce non-learnable In-Context Vectors (ICVs) which extract useful task information from ICDs into a single vector and then insert it into the LLM to help solve the corresponding task. However, although useful in simple NLP tasks, these non-learnable methods fail to handle complex multimodal tasks like Visual Question Answering (VQA). In this study, we propose \underline{\textbf{L}}earnable \underline{\textbf{I}}n-Context \underline{\textbf{Ve}}ctor (LIVE) to distill essential task information from demonstrations, improving ICL performance in LMMs. Experiments show that LIVE can significantly reduce computational costs while enhancing accuracy in VQA tasks compared to traditional ICL and other non-learnable ICV methods.

NeurIPS Conference 2024 Conference Paper

Mixture of Adversarial LoRAs: Boosting Robust Generalization in Meta-Tuning

  • Xu Yang
  • Chen Liu
  • Ying Wei

This paper introduces AMT, an \textbf{A}dversarial \textbf{M}eta-\textbf{T}uning methodology, to boost the robust generalization of pre-trained models in the out-of-domain (OOD) few-shot learning. To address the challenge of transferring knowledge from source domains to unseen target domains, we construct the robust LoRAPool by meta-tuning LoRAs with dual perturbations applied to not only the inputs but also singular values and vectors of the weight matrices at various robustness levels. On top of that, we introduce a simple yet effective test-time merging mechanism to dynamically merge discriminative LoRAs for test-time task customization. Extensive evaluations demonstrate that AMT yields significant improvements, up to 12. 92\% in clean generalization and up to 49. 72\% in adversarial generalization, over previous state-of-the-art methods across a diverse range of OOD few-shot image classification tasks on three benchmarks, confirming the effectiveness of our approach to boost the robust generalization of pre-trained models. Our code is available at \href{https: //github. com/xyang583/AMT}{https: //github. com/xyang583/AMT}.

IJCAI Conference 2024 Conference Paper

Navigating Continual Test-time Adaptation with Symbiosis Knowledge

  • Xu Yang
  • Moqi Li
  • Jie Yin
  • Kun Wei
  • Cheng Deng

Continual test-time domain adaptation seeks to adapt the source pre-trained model to a continually changing target domain without incurring additional data acquisition or labeling costs. Unfortunately, existing mainstream methods may result in a detrimental cycle. This is attributed to noisy pseudo-labels caused by the domain shift, which immediately negatively impacts the model's knowledge. The long-term accumulation of these negative effects exacerbates the model's difficulty in generalizing to future domain shifts and contributes to catastrophic forgetting. To address these challenges, this paper introduces a Dual-stream Network that independently optimizes different parameters in each stream to capture symbiotic knowledge from continual domains, thereby ensuring generalization while enhancing instantaneous discrimination. Furthermore, to prevent catastrophic forgetting, a weighted soft parameter alignment method is designed to leverage knowledge from the source model. Finally, efforts are made to calibrate and explore reliable supervision signals to mitigate instantaneous negative optimization. These include label calibration with prior knowledge, label selection using self-adaptive confidence thresholds, and a soft-weighted contrastive module for capturing potential semantics. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on several benchmark datasets.

ICML Conference 2024 Conference Paper

One Meta-tuned Transformer is What You Need for Few-shot Learning

  • Xu Yang
  • Huaxiu Yao
  • Ying Wei

Pre-trained vision transformers have revolutionized few-shot image classification, and it has been recently demonstrated that the previous common practice of meta-learning in synergy with these pre-trained transformers still holds significance. In this work, we design a new framework centered exclusively on self-attention, called MetaFormer, which extends the vision transformers beyond patch token interactions to encompass relationships between samples and tasks simultaneously for further advancing their downstream task performance. Leveraging the intrinsical property of ViTs in handling local patch relationships, we propose Masked Sample Attention (MSA) to efficiently embed the sample relationships into the network, where an adaptive mask is attached for enhancing task-specific feature consistency and providing flexibility in switching between few-shot learning setups. To encapsulate task relationships while filtering out background noise, Patch-grained Task Attention (PTA) is designed to maintain a dynamic knowledge pool consolidating diverse patterns from historical tasks. MetaFormer demonstrates coherence and compatibility with off-the-shelf pre-trained vision transformers and shows significant improvements in both inductive and transductive few-shot learning scenarios, outperforming state-of-the-art methods by up to 8. 77% and 6. 25% on 12 in-domain and 10 cross-domain datasets, respectively.

AAAI Conference 2024 Conference Paper

Transformer as Linear Expansion of Learngene

  • Shiyu Xia
  • Miaosen Zhang
  • Xu Yang
  • Ruiming Chen
  • Haokun Chen
  • Xin Geng

We propose expanding the shared Transformer module to produce and initialize Transformers of varying depths, enabling adaptation to diverse resource constraints. Drawing an analogy to genetic expansibility, we term such module as learngene. To identify the expansion mechanism, we delve into the relationship between the layer's position and its corresponding weight value, and find that linear function appropriately approximates this relationship. Building on this insight, we present Transformer as Linear Expansion of learnGene (TLEG), a novel approach for flexibly producing and initializing Transformers of diverse depths. Specifically, to learn learngene, we firstly construct an auxiliary Transformer linearly expanded from learngene, after which we train it through employing soft distillation. Subsequently, we can produce and initialize Transformers of varying depths via linearly expanding the well-trained learngene, thereby supporting diverse downstream scenarios. Extensive experiments on ImageNet-1K demonstrate that TLEG achieves comparable or better performance in contrast to many individual models trained from scratch, while reducing around 2× training cost. When transferring to several downstream classification datasets, TLEG surpasses existing initialization methods by a large margin (e.g., +6.87% on iNat 2019 and +7.66% on CIFAR-100). Under the situation where we need to produce models of varying depths adapting for different resource constraints, TLEG achieves comparable results while reducing around 19× parameters stored to initialize these models and around 5× pre-training costs, in contrast to the pre-training and fine-tuning approach. When transferring a fixed set of parameters to initialize different models, TLEG presents better flexibility and competitive performance while reducing around 2.9× parameters stored to initialize, compared to the pre-training approach.

AILAW Journal 2023 Journal Article

Automating petition classification in Brazil’s legal system: a two-step deep learning approach

  • Yuri D. R. Costa
  • Hugo Oliveira
  • Valério Nogueira
  • Lucas Massa
  • Xu Yang
  • Adriano Barbosa
  • Krerley Oliveira
  • Thales Vieira

Abstract Automated classification of legal documents has been the subject of extensive research in recent years. However, this is still a challenging task for long documents, since it is difficult for a model to identify the most relevant information for classification. In this paper, we propose a two-stage supervised learning approach for the classification of petitions, a type of legal document that requests a court order. The proposed approach is based on a word-level encoder–decoder Seq2Seq deep neural network, such as a Bidirectional Long Short-Term Memory (BiLSTM) or a Bidirectional Encoder Representations from Transformers (BERT) model, and a document-level Support Vector Machine classifier. To address the challenges posed by the lengthy legal documents, the approach introduces a human-in-the-loop approach, whose task is to localize and tag relevant segments of text in the word-level training part, which dramatically reduces the dimension of the document classifier input vector. We performed experiments to validate our approach using a real-world dataset comprised of 270 intermediate petitions, which were carefully annotated by specialists from the 15th civil unit of the State of Alagoas, Brazil. Our results revealed that both BiLSTM and BERT-Convolutional Neural Networks variants achieved an accuracy of up to 95. 49%, and also outperformed baseline classifiers based on the Term Frequency–Inverse Document Frequency test vectorizer. The proposed approach is currently being utilized to automate the aforementioned justice unit, thereby increasing its efficiency in handling repetitive tasks.

NeurIPS Conference 2023 Conference Paper

Exploring Diverse In-Context Configurations for Image Captioning

  • Xu Yang
  • Yongliang Wu
  • Mingzhuo Yang
  • Haokun Chen
  • Xin Geng

After discovering that Language Models (LMs) can be good in-context few-shot learners, numerous strategies have been proposed to optimize in-context sequence configurations. Recently, researchers in Vision-Language (VL) domains also develop their few-shot learners, while they only use the simplest way, \ie, randomly sampling, to configure in-context image-text pairs. In order to explore the effects of varying configurations on VL in-context learning, we devised four strategies for image selection and four for caption assignment to configure in-context image-text pairs for image captioning. Here Image Captioning is used as the case study since it can be seen as the visually-conditioned LM. Our comprehensive experiments yield two counter-intuitive but valuable insights, highlighting the distinct characteristics of VL in-context learning due to multi-modal synergy, as compared to the NLP case. Furthermore, in our exploration of optimal combination strategies, we observed an average performance enhancement of 20. 9 in CIDEr scores compared to the baseline. The code is given in https: //github. com/yongliang-wu/ExploreCfg.

IJCAI Conference 2023 Conference Paper

Exploring Safety Supervision for Continual Test-time Domain Adaptation

  • Xu Yang
  • Yanan Gu
  • Kun Wei
  • Cheng Deng

Continual test-time domain adaptation aims to adapt a source pre-trained model to a continually changing target domain without using any source data. Unfortunately, existing methods based on pseudo-label learning suffer from the changing target domain environment, and the quality of generated pseudo-labels is attenuated due to the domain shift, leading to instantaneous negative learning and long-term knowledge forgetting. To solve these problems, in this paper, we propose a simple yet effective framework for exploring safety supervision with three elaborate strategies: Label Safety, Sample Safety, and Parameter Safety. Firstly, to select reliable pseudo-labels, we define and adjust the confidence threshold in a self-adaptive manner according to the test-time learning status. Secondly, a soft-weighted contrastive learning module is presented to explore the highly-correlated samples and discriminate uncorrelated ones, improving the instantaneous efficiency of the model. Finally, we frame a Soft Weight Alignment strategy to normalize the distance between the parameters of the adapted model and the source pre-trained model, which alleviates the long-term problem of knowledge forgetting and significantly improves the accuracy of the adapted model in the late adaptation stage. Extensive experimental results demonstrate that our method achieves state-of-the-art performance on several benchmark datasets.

NeurIPS Conference 2023 Conference Paper

Learning From Biased Soft Labels

  • Hua Yuan
  • Yu Shi
  • Ning Xu
  • Xu Yang
  • Xin Geng
  • Yong Rui

Since the advent of knowledge distillation, many researchers have been intrigued by the $\textit{dark knowledge}$ hidden in the soft labels generated by the teacher model. This prompts us to scrutinize the circumstances under which these soft labels are effective. Predominant existing theories implicitly require that the soft labels are close to the ground-truth labels. In this paper, however, we investigate whether biased soft labels are still effective. Here, bias refers to the discrepancy between the soft labels and the ground-truth labels. We present two indicators to measure the effectiveness of the soft labels. Based on the two indicators, we propose moderate conditions to ensure that, the biased soft label learning problem is both $\textit{classifier-consistent}$ and $\textit{Empirical Risk Minimization}$ (ERM) $\textit{learnable}$, which can be applicable even for large-biased soft labels. We further design a heuristic method to train Skillful but Bad Teachers (SBTs), and these teachers with accuracy less than 30\% can teach students to achieve accuracy over 90\% on CIFAR-10, which is comparable to models trained on the original data. The proposed indicators adequately measure the effectiveness of the soft labels generated in this process. Moreover, our theoretical framework can be adapted to elucidate the effectiveness of soft labels in three weakly-supervised learning paradigms, namely incomplete supervision, partial label learning and learning with additive noise. Experimental results demonstrate that our indicators can measure the effectiveness of biased soft labels generated by teachers or in these weakly-supervised learning paradigms.

IJCAI Conference 2022 Conference Paper

Attention-guided Contrastive Hashing for Long-tailed Image Retrieval

  • Xuan Kou
  • Chenghao Xu
  • Xu Yang
  • Cheng Deng

Image hashing is to represent an image using a binary code for efficient storage and accurate retrieval. Recently, deep hashing methods have shown great improvements on ideally balanced datasets, however, long-tailed data is more common due to rare samples or data collection costs in the real world. Toward that end, this paper introduces a simple yet effective model named Attention-guided Contrastive Hashing Network (ACHNet) for long-tailed hashing. Specifically, a cross attention feature enhancement module is proposed to predict the importance of features for hashing, alleviating the loss of information originated from data dimension reduction. Moreover, unlike recently sota contrastive methods that focus on instance-level discrimination, we optimize an innovative category-centered contrastive hashing to obtain discriminative results, which is more suitable for long-tailed scenarios. Experiments on two popular benchmarks verify the superiority of the proposed method. Our code is available at: https: //github. com/KUXN98/ACHNet.

AAAI Conference 2022 Conference Paper

Learning Universal Adversarial Perturbation by Adversarial Example

  • Maosen Li
  • Yanhua Yang
  • Kun Wei
  • Xu Yang
  • Heng Huang

Deep learning models have shown to be susceptible to universal adversarial perturbation (UAP), which has aroused wide concerns in the community. Compared with the conventional adversarial attacks that generate adversarial samples at the instance level, UAP can fool the target model for different instances with only a single perturbation, enabling us to evaluate the robustness of the model from a more effective and accurate perspective. The existing universal attack methods fail to exploit the differences and connections between the instance and universal levels to produce dominant perturbations. To address this challenge, we propose a new universal attack method that unifies instance-specific and universal attacks from a feature perspective to generate a more dominant UAP. Specifically, we reformulate the UAP generation task as a minimax optimization problem and then utilize the instance-specific attack method to solve the minimization problem thereby obtaining better training data for generating UAP. At the same time, we also introduce a consistency regularizer to explore the relationship between training data, thus further improving the dominance of the generated UAP. Furthermore, our method is generic with no additional assumptions about the training data and hence can be applied to both data-dependent (supervised) and dataindependent (unsupervised) manners. Extensive experiments demonstrate that the proposed method improves the performance by a significant margin over the existing methods in both data-dependent and data-independent settings. Code is available at https: //github. com/lisenxd/AT-UAP.

JBHI Journal 2022 Journal Article

PCXRNet: Pneumonia Diagnosis From Chest X-Ray Images Using Condense Attention Block and Multiconvolution Attention Block

  • Yibo Feng
  • Xu Yang
  • Dawei Qiu
  • Huan Zhang
  • Dejian Wei
  • Jing Liu

Coronavirus disease2019 (COVID-19)has become a global pandemic. Many recognition approaches based on convolutional neural networks have been proposed for COVID-19 chest X-ray images. However, only a few of them make good use of the potential inter- and intra-relationships of feature maps. Considering the limitation mentioned above, this paper proposes an attention-based convolutional neural network, called PCXRNet, for diagnosis of pneumonia using chest X-ray images. To utilize the information from the channels of the feature maps, we added a novel condense attention module (CDSE) that comprised of two steps: condensation step and squeeze-excitation step. Unlike traditional channel attention modules, CDSE first downsamples the feature map channel by channel to condense the information, followed by the squeeze-excitation step, in which the channel weights are calculated. To make the model pay more attention to informative spatial parts in every feature map, we proposed a multi-convolution spatial attention module (MCSA). It reduces the number of parameters and introduces more nonlinearity. The CDSE and MCSA complement each other in series to tackle the problem of redundancy in feature maps and provide useful information from and between feature maps. We used the ChestXRay2017 dataset to explore the internal structure of PCXRNet, and the proposed network was applied to COVID-19 diagnosis. As a result, the network achieves an accuracy of 94. 619%, recall of 94. 753%, precision of 95. 286%, and F1-score of 94. 996% on the COVID-19 dataset.

AAAI Conference 2022 Conference Paper

Towards End-to-End Image Compression and Analysis with Transformers

  • Yuanchao Bai
  • Xu Yang
  • Xianming Liu
  • Junjun Jiang
  • Yaowei Wang
  • Xiangyang Ji
  • Wen Gao

We propose an end-to-end image compression and analysis model with Transformers, targeting to the cloud-based image classification application. Instead of placing an existing Transformer-based image classification model directly after an image codec, we aim to redesign the Vision Transformer (ViT) model to perform image classification from the compressed features and facilitate image compression with the long-term information from the Transformer. Specifically, we first replace the patchify stem (i. e. , image splitting and embedding) of the ViT model with a lightweight image encoder modelled by a convolutional neural network. The compressed features generated by the image encoder are injected convolutional inductive bias and are fed to the Transformer for image classification bypassing image reconstruction. Meanwhile, we propose a feature aggregation module to fuse the compressed features with the selected intermediate features of the Transformer, and feed the aggregated features to a deconvolutional neural network for image reconstruction. The aggregated features can obtain the long-term information from the self-attention mechanism of the Transformer and improve the compression performance. The rate-distortion-accuracy optimization problem is finally solved by a two-step training strategy. Experimental results demonstrate the effectiveness of the proposed model in both the image compression and the classification tasks.

JBHI Journal 2021 Journal Article

Adaptive Stimulation Profiles Modulation for Foot Drop Correction Using Functional Electrical Stimulation: A Proof of Concept Study

  • Yurong Li
  • Xu Yang
  • Yuezhu Zhou
  • Jun Chen
  • Min Du
  • Yuan Yang

Functional electrical stimulation (FES) provides an effective way for foot drop (FD) correction. To overcome the redundant and blind stimulation problems in the state-of-the-art methods, this study proposes a closed-loop scheme for an adaptive electromyography (EMG)-modulated stimulation profile. The developed method detects real-time angular velocity during walking. It provides feedbacks to a long short-term memory (LSTM) neural network for predicting synchronous tibialis anterior (TA) EMG. Based on the prediction, it modulates the stimulation intensity, taking into account of the subject-specific dead zone and saturation of the electrically evoked activation. The proposed method is tested on ten able-bodied participants and six FD subjects as proof of concept. The experimental results show that the proposed method can successfully induce the dorsiflexion of the ankle joint, and generate an activation pattern similar to a natural gait, with the mean Correlation Coefficient of 0. 9021. Thus, the proposed method has the potential to help patients to retrieve normal gait.

IJCAI Conference 2021 Conference Paper

Graph Debiased Contrastive Learning with Joint Representation Clustering

  • Han Zhao
  • Xu Yang
  • Zhenru Wang
  • Erkun Yang
  • Cheng Deng

By contrasting positive-negative counterparts, graph contrastive learning has become a prominent technique for unsupervised graph representation learning. However, existing methods fail to consider the class information and will introduce false-negative samples in the random negative sampling, causing poor performance. To this end, we propose a graph debiased contrastive learning framework, which can jointly perform representation learning and clustering. Specifically, representations can be optimized by aligning with clustered class information, and simultaneously, the optimized representations can promote clustering, leading to more powerful representations and clustering results. More importantly, we randomly select negative samples from the clusters which are different from the positive sample's cluster. In this way, as the supervisory signals, the clustering results can be utilized to effectively decrease the false-negative samples. Extensive experiments on five datasets demonstrate that our method achieves new state-of-the-art results on graph clustering and classification tasks.

AAAI Conference 2021 Conference Paper

Incremental Embedding Learning via Zero-Shot Translation

  • Kun Wei
  • Cheng Deng
  • Xu Yang
  • Maosen Li

Modern deep learning methods have achieved great success in machine learning and computer vision fields by learning a set of pre-defined datasets. Howerver, these methods perform unsatisfactorily when applied into real-world situations. The reason of this phenomenon is that learning new tasks leads the trained model quickly forget the knowledge of old tasks, which is referred to as catastrophic forgetting. Current state-of-the-art incremental learning methods tackle catastrophic forgetting problem in traditional classification networks and ignore the problem existing in embedding networks, which are the basic networks for image retrieval, face recognition, zero-shot learning, etc. Different from traditional incremental classification networks, the semantic gap between the embedding spaces of two adjacent tasks is the main challenge for embedding networks under incremental learning setting. Thus, we propose a novel class-incremental method for embedding network, named as zero-shot translation class-incremental method (ZSTCI), which leverages zero-shot translation to estimate the semantic gap without any exemplars. Then, we try to learn a unified representation for two adjacent tasks in sequential learning process, which captures the relationships of previous classes and current classes precisely. In addition, ZSTCI can easily be combined with existing regularization-based incremental learning methods to further improve performance of embedding networks. We conduct extensive experiments on CUB-200-2011 and CI- FAR100, and the experiment results prove the effectiveness of our method. The code of our method has been released in https: //github. com/Drkun/ZSTCI.

NeurIPS Conference 2020 Conference Paper

Adversarial Learning for Robust Deep Clustering

  • Xu Yang
  • Cheng Deng
  • Kun Wei
  • Junchi Yan
  • Wei Liu

Deep clustering integrates embedding and clustering together to obtain the optimal nonlinear embedding space, which is more effective in real-world scenarios compared with conventional clustering methods. However, the robustness of the clustering network is prone to being attenuated especially when it encounters an adversarial attack. A small perturbation in the embedding space will lead to diverse clustering results since the labels are absent. In this paper, we propose a robust deep clustering method based on adversarial learning. Specifically, we first attempt to define adversarial samples in the embedding space for the clustering network. Meanwhile, we devise an adversarial attack strategy to explore samples that easily fool the clustering layers but do not impact the performance of the deep embedding. We then provide a simple yet efficient defense algorithm to improve the robustness of the clustering network. Experimental results on two popular datasets show that the proposed adversarial learning method can significantly enhance the robustness and further improve the overall clustering performance. Particularly, the proposed method is generally applicable to multiple existing clustering frameworks to boost their robustness. The source code is available at https: //github. com/xdxuyang/ALRDC.

IJCAI Conference 2020 Conference Paper

Lifelong Zero-Shot Learning

  • Kun Wei
  • Cheng Deng
  • Xu Yang

Zero-Shot Learning (ZSL) handles the problem that some testing classes never appear in training set. Existing ZSL methods are designed for learning from a fixed training set, which do not have the ability to capture and accumulate the knowledge of multiple training sets, causing them infeasible to many real-world applications. In this paper, we propose a new ZSL setting, named as Lifelong Zero-Shot Learning (LZSL), which aims to accumulate the knowledge during the learning from multiple datasets and recognize unseen classes of all trained datasets. Besides, a novel method is conducted to realize LZSL, which effectively alleviates the Catastrophic Forgetting in the continuous training process. Specifically, considering those datasets containing different semantic embeddings, we utilize Variational Auto-Encoder to obtain unified semantic representations. Then, we leverage selective retraining strategy to preserve the trained weights of previous tasks and avoid negative transfer when fine-tuning the entire model. Finally, knowledge distillation is employed to transfer knowledge from previous training stages to current stage. We also design the LZSL evaluation protocol and the challenging benchmarks. Extensive experiments on these benchmarks indicate that our method tackles LZSL problem effectively, while existing ZSL methods fail.

AAAI Conference 2018 Conference Paper

New l 2,1 -Norm Relaxation of Multi-Way Graph Cut for Clustering

  • Xu Yang
  • Cheng Deng
  • Xianglong Liu
  • Feiping Nie

The clustering methods have absorbed even-increasing attention in machine learning and computer vision communities in recent years. Exploring manifold information in multi-way graph cut clustering, such as ratio cut clustering, has shown its promising performance. However, traditional multi-way ratio cut clustering method is NP-hard and thus the spectral solution may deviate from the optimal one. In this paper, we propose a new relaxed multi-way graph cut clustering method, where 2, 1-norm distance instead of squared distance is utilized to preserve the solution having much more clearer cluster structures. Furthermore, the resulting solution is constrained with normalization to obtain more sparse representation, which can encourage the solution to contain more discrete values with many zeros. For the objective function, it is very difficult to optimize due to minimizing the ratio of two non-smooth items. To address this problem, we transform the objective function into a quadratic problem on the Stiefel manifold (QPSM), and introduce a novel yet efficient iterative algorithm to solve it. Experimental results on several benchmark datasets show that our method significantly outperforms several state-of-the-art clustering approaches.

IJCAI Conference 2016 Conference Paper

Sparsity Conditional Energy Label Distribution Learning for Age Estimation

  • Xu Yang
  • Xin Geng
  • Deyu Zhou

By observing that the faces at close ages are similar, some Label Distribution Learning (LDL) methods have been proposed to solve age estimation tasks that they treat age distributions as the training targets. However, these existent LDL methods are limited because they can hardly extract enough useful information from complex image features. In this paper, Sparsity Conditional Energy Label Distribution Learning (SCE-LDL) is proposed to solve this problem. In the proposed SCE-LDL, age distributions are used as the training targets and energy function is utilized to define the age distribution. By assigning a suitable energy function, SCE-LDL can learn distributed representations, which provides the model with strong expressiveness for capturing enough of the complexity of interest from image features. The sparsity constraints are also incorporated to ameliorate the model. Experiment results in two age datasets show remarkable advantages of the proposed SCE-LDL model over the previous proposed age estimation methods.

IJCAI Conference 2007 Conference Paper

  • Xu Yang
  • John Bigham

This paper proposes an approach for learning call admission control (CAC) policies in a cellular network that handles several classes of traffic with different resource requirements. The performance measures in cellular networks are long term revenue, utility, call blocking rate (CBR) and handoff failure rate (CDR). Reinforcement Learning (RL) can be used to provide the optimal solution, however such method fails when the state space and action space are huge. We apply a form of NeuroEvolution (NE) algorithm to inductively learn the CAC policies, which is called CN (Call Admission Control scheme using NE). Comparing with the Q Learning based CAC scheme in the constant traffic load shows that CN can not only approximate the optimal solution very well but also optimize the CBR and CDR in a more flexibility way. Additionally the simulation results demonstrate that the proposed scheme is capable of keeping the handoff dropping rate below a prespecified value while still maintaining an acceptable CBR in the presence of smoothly varying arrival rates of traffic, in which the state space is too large for practical deployment of the other learning scheme.