Arrow Research search

Author name cluster

Kai Jiang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

10 papers
2 author rows

Possible papers

10

AAAI Conference 2026 Conference Paper

TOP-RL: Task-Optimized Progressive Token Pruning with Reinforcement Learning for Vision Language Models

  • Hengyi Wang
  • Weiying Xie
  • Hui Jiang
  • Yaotao Wei
  • Kai Jiang
  • Mingxiang Cao
  • Chenhe Hao
  • Leyuan Fang

In recent years, Large Vision-Language Models (LVLMs) have significantly advanced multimodal tasks. However, their inference requires intensive processing of numerous visual tokens and incurs substantial computational overhead. Existing methods typically compress visual tokens either at the input stage or in early model layers, ignoring variations across tasks and depths. To address these limitations, we introduce TOP-RL, a Task-Optimized Progressive token pruning framework based on Reinforcement Learning. TOP-RL formulates visual token pruning as a multi-stage Markov Decision Process (MDP). It employs an agent trained with dense and fine-grained reward signals to progressively generate differentiable binary masks. This enables TOP-RL to adaptively select crucial visual tokens tailored to each task, effectively balancing accuracy and computational efficiency. Extensive experiments on leading multimodal datasets and advanced LVLMs validate that TOP-RL effectively learns task-optimized pruning policies, significantly boosting inference efficiency while preserving robust performance. For instance, LLaVA-NeXT equipped with TOP-RL achieves a 1.9x speedup in inference time and a 9.3x reduction in FLOPs, with 96% performance preserved.

IROS Conference 2025 Conference Paper

Compact LED-Based Displacement Sensing for Robot Fingers

  • Amr El-Azizi
  • Sharfin Islam
  • Pedro Piacenza
  • Kai Jiang
  • Ioannis Kymissis
  • Matei Ciocarlie

In this paper, we introduce a sensor designed for robotic fingers which can provide information on the displacements induced by external forces. Our sensor uses LEDs to sense the displacement between two plates connected by a transparent elastomer; when a force is applied to the finger, the elastomer displaces and the LED signals change. We show that using LEDs as both light emitters and receivers in this context provides high sensitivity, allowing such an emitter and receiver pairs to detect very small displacements. We characterize the standalone performance of the sensor by testing the ability of a supervised learning model to predict complete force and torque data from its raw signals, and obtain a mean error between 0. 05 and 0. 07 N across the three directions of force applied to the finger. Our method allows for compact packaging (fitting at the base of a finger) with no amplification electronics, low cost manufacturing, easy integration into a complete hand, and high overload shear forces and bending torques, suggesting future applicability to complete manipulation tasks.

AAAI Conference 2025 Conference Paper

DiffCLIP: Few-shot Language-driven Multimodal Classifier

  • Jiaqing Zhang
  • Mingxiang Cao
  • Xue Yang
  • Kai Jiang
  • Yunsong Li

Visual language models like Contrastive Language-Image Pretraining (CLIP) have shown impressive performance in analyzing natural images with language information. However, these models often encounter challenges when applied to specialized domains such as remote sensing due to the limited availability of image-text pairs for training. To tackle this issue, we introduce DiffCLIP, a novel framework that extends CLIP to effectively convey comprehensive language-driven semantic information for accurate classification of high-dimensional multimodal remote sensing images. DiffCLIP is a few-shot learning method that leverages unlabeled images for pretraining. It employs unsupervised mask diffusion learning to capture the distribution of diverse modalities without requiring labels. The modality-shared image encoder maps multimodal data into a unified subspace, extracting shared features with consistent parameters across modalities. A well-trained image encoder further enhances learning by aligning visual representations with class-label text information from CLIP. By integrating these approaches, DiffCLIP significantly boosts CLIP performance using a minimal number of image-text pairs. We evaluate DiffCLIP on widely used high-dimensional multimodal datasets, demonstrating its effectiveness in addressing few-shot annotated classification tasks. DiffCLIP achieves an overall accuracy improvement of 10.65% across three remote sensing datasets compared with CLIP, while utilizing only 2-shot image-text pairs.

NeurIPS Conference 2025 Conference Paper

Mixture of Noise for Pre-Trained Model-Based Class-Incremental Learning

  • Kai Jiang
  • Zhengyan Shi
  • Dell Zhang
  • Hongyuan Zhang
  • Xuelong Li

Class Incremental Learning (CIL) aims to continuously learn new categories while retaining the knowledge of old ones. Pre-trained models (PTMs) show promising capabilities in CIL. However, existing approaches that apply lightweight fine-tuning to backbones still induce parameter drift, thereby compromising the generalization capability of pre-trained models. Parameter drift can be conceptualized as a form of noise that obscures critical patterns learned for previous tasks. However, recent researches have shown that noise is not always harmful. For example, the large number of visual patterns learned from pre-training can be easily abused by a single task, and introducing appropriate noise can suppress some low-correlation features, thus leaving a margin for future tasks. To this end, we propose learning beneficial noise for CIL guided by information theory and propose Mixture of Noise (MiN), aiming to mitigate the degradation of backbone generalization from adapting new tasks. Specifically, task-specific noise is learned from high-dimension features of new tasks. Then, a set of weights is adjusted dynamically for optimal mixture of different task noise. Finally, MiN embeds the beneficial noise into the intermediate features to mask the response of inefficient patterns. Extensive experiments on six benchmark datasets demonstrate that MiN achieves state-of-the-art performance in most incremental settings, with particularly outstanding results in 50-steps incremental settings. This shows the significant potential for beneficial noise in continual learning. Code is available at https: //github. com/ASCIIJK/MiN-NeurIPS2025.

NeurIPS Conference 2025 Conference Paper

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

  • Jintao Zhang
  • Jia wei
  • Haoxu Wang
  • Pengle Zhang
  • Xiaoming Xu
  • Haofeng Huang
  • Kai Jiang
  • Jianfei Chen

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new $\texttt{FP4}$ Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves $\textbf{1038}$ $\texttt{TOPS}$ on $\texttt{RTX5090}$, which is a $\textbf{5}\times$ speedup over the fastest FlashAttention on $\texttt{RTX5090}$. Experiments show that our $\texttt{FP4}$ attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient $\texttt{8-bit}$ attention for both forward and backward propagation. Experiments indicate that $\texttt{8-bit}$ attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https: //github. com/thu-ml/SageAttention.

ICML Conference 2025 Conference Paper

Visual Generation Without Guidance

  • Huayu Chen
  • Kai Jiang
  • Kaiwen Zheng
  • Jianfei Chen 0001
  • Hang Su 0006
  • Jun Zhu 0001

Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free.

NeurIPS Conference 2024 Conference Paper

Domain Adaptation for Large-Vocabulary Object Detectors

  • Kai Jiang
  • Jiaxing Huang
  • Weiying Xie
  • Jie Lei
  • Yunsong Li
  • Ling Shao
  • Shijian Lu

Large-vocabulary object detectors (LVDs) aim to detect objects of many categories, which learn super objectness features and can locate objects accurately while applied to various downstream data. However, LVDs often struggle in recognizing the located objects due to domain discrepancy in data distribution and object vocabulary. At the other end, recent vision-language foundation models such as CLIP demonstrate superior open-vocabulary recognition capability. This paper presents KGD, a Knowledge Graph Distillation technique that exploits the implicit knowledge graphs (KG) in CLIP for effectively adapting LVDs to various downstream domains. KGD consists of two consecutive stages: 1) KG extraction that employs CLIP to encode downstream domain data as nodes and their feature distances as edges, constructing KG that inherits the rich semantic relations in CLIP explicitly; and 2) KG encapsulation that transfers the extracted KG into LVDs to enable accurate cross-domain object classification. In addition, KGD can extract both visual and textual KG independently, providing complementary vision and language knowledge for object localization and object classification in detection tasks over various downstream domains. Experiments over multiple widely adopted detection benchmarks show that KGD outperforms the state-of-the-art consistently by large margins. Codes will be released.

NeurIPS Conference 2024 Conference Paper

Open-Vocabulary Object Detection via Language Hierarchy

  • Jiaxing Huang
  • Jingyi Zhang
  • Kai Jiang
  • Shijian Lu

Recent studies on generalizable object detection have attracted increasing attention with additional weak supervision from large-scale datasets with image-level labels. However, weakly-supervised detection learning often suffers from image-to-box label mismatch, i. e. , image-levellabels do not convey precise object information. We design Language Hierarchical Self-training (LHST) that introduces language hierarchy into weakly-supervised detector training for learning more generalizable detectors. LHST expands the image-level labels with language hierarchy and enables co-regularization between the expanded labels and self-training. Specifically, the expanded labels regularize self-training by providing richer supervision and mitigating the image-to-box label mismatch, while self-training allows assessing and selecting the expanded labels according to the predicted reliability. In addition, we design language hierarchical prompt generation that introduces language hierarchy into prompt generation which helps bridge the vocabulary gaps between training and testing. Extensive experiments show that the proposed techniques achieve superior generalization performance consistently across 14 widely studied object detection datasets.

AAAI Conference 2021 Conference Paper

LREN: Low-Rank Embedded Network for Sample-Free Hyperspectral Anomaly Detection

  • Kai Jiang
  • Weiying Xie
  • Jie Lei
  • Tao Jiang
  • Yunsong Li

Hyperspectral anomaly detection (HAD) is a challenging task because it explores the intrinsic structure of complex highdimensional signals without any samples at training time. Deep neural networks (DNNs) can dig out the underlying distribution of hyperspectral data but are limited by the labeling of large-scale hyperspectral datasets, especially the low spatial resolution of hyperspectral data, which makes labeling more difficult. To tackle this problem while ensuring the detection performance, we present an unsupervised lowrank embedded network (LREN) in this paper. LREN is a joint learning network in which the latent representation is specifically designed for HAD, rather than merely as a feature input for the detector. And it searches the lowest rank representation based on a representative and discriminative dictionary in the deep latent space to estimate the residual efficiently. Considering the physically mixing properties in hyperspectral imaging, we develop a trainable density estimation module based on Gaussian mixture model (GMM) in the deep latent space to construct a dictionary that can better characterize the complex hyperspectral images (HSIs). The closed-form solution of the proposed low-rank learner surpasses existing approaches on four real hyperspectral datasets with different anomalies. We argue that this unified framework paves a novel way to combine feature extraction and anomaly estimation-based methods for HAD, which intends to learn the underlying representation tailored for HAD without the prerequisite of manually labeled data. Code available at https: //github. com/xdjiangkai/LREN.