Arrow Research search

Author name cluster

Jia Wei

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

ICML Conference 2025 Conference Paper

SageAttention2: Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization

  • Jintao Zhang
  • Haofeng Huang
  • Pengle Zhang
  • Jia Wei
  • Jun Zhu 0001
  • Jianfei Chen 0001

Although quantization for linear layers has been widely used, its application to accelerate the attention process remains limited. To further enhance the efficiency of attention computation compared to SageAttention while maintaining precision, we propose SageAttention2, which utilizes significantly faster 4-bit matrix multiplication (Matmul) alongside additional precision-enhancing techniques. First, we propose to quantize matrixes $(Q, K)$ to INT4 in a hardware-friendly thread-level granularity and quantize matrixes $(\widetilde P, V)$ to FP8. Second, we propose a method to smooth $Q$, enhancing the accuracy of INT4 $QK^\top$. Third, we propose a two-level accumulation strategy for $\widetilde PV$ to enhance the accuracy of FP8 $\widetilde PV$. The operations per second (OPS) of SageAttention2 surpass FlashAttention2 and xformers by about 3x and 4. 5x, respectively. Moreover, SageAttention2 matches the speed of FlashAttention3(fp8) on the Hopper GPUs, while delivering much higher accuracy. Comprehensive experiments confirm that our approach incurs negligible end-to-end metrics loss across diverse models, including those for language, image, and video generation.

NeurIPS Conference 2025 Conference Paper

SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

  • Jintao Zhang
  • Jia wei
  • Haoxu Wang
  • Pengle Zhang
  • Xiaoming Xu
  • Haofeng Huang
  • Kai Jiang
  • Jianfei Chen

The efficiency of attention is important due to its quadratic time complexity. We enhance the efficiency of attention through two key contributions: First, we leverage the new $\texttt{FP4}$ Tensor Cores in Blackwell GPUs to accelerate attention computation. Our implementation achieves $\textbf{1038}$ $\texttt{TOPS}$ on $\texttt{RTX5090}$, which is a $\textbf{5}\times$ speedup over the fastest FlashAttention on $\texttt{RTX5090}$. Experiments show that our $\texttt{FP4}$ attention can accelerate inference of various models in a plug-and-play way. Second, we pioneer low-bit attention to training tasks. Existing low-bit attention works like FlashAttention3 and SageAttention focus only on inference. However, the efficiency of training large models is also important. To explore whether low-bit attention can be effectively applied to training tasks, we design an accurate and efficient $\texttt{8-bit}$ attention for both forward and backward propagation. Experiments indicate that $\texttt{8-bit}$ attention achieves lossless performance in fine-tuning tasks but exhibits slower convergence in pretraining tasks. The code is available at https: //github. com/thu-ml/SageAttention.

ICLR Conference 2025 Conference Paper

SageAttention: Accurate 8-Bit Attention for Plug-and-play Inference Acceleration

  • Jintao Zhang
  • Jia Wei
  • Pengle Zhang
  • Jun Zhu 0001
  • Jianfei Chen 0001

The transformer architecture predominates across various models. As the heart of the transformer, attention has a computational complexity of $O(N^2)$, compared to $O(N)$ for linear transformations. When handling large sequence lengths, attention becomes the primary time-consuming component. Although quantization has proven to be an effective method for accelerating model inference, existing quantization methods primarily focus on optimizing the linear layer. In response, we first analyze the feasibility of quantization in attention detailedly. Following that, we propose SageAttention, a highly efficient and accurate quantization method for attention. The OPS (operations per second) of our approach outperforms FlashAttention2 and xformers by about 2.1x and 2.7x, respectively. SageAttention also achieves superior accuracy performance over FlashAttention3. Comprehensive experiments confirm that our approach incurs almost no end-to-end metrics loss across diverse models—including those for large language processing, image generation, and video generation. The code is available at https://github.com/thu-ml/SageAttention.

ICML Conference 2025 Conference Paper

SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

  • Jintao Zhang
  • Chendong Xiang
  • Haofeng Huang
  • Jia Wei
  • Haocheng Xi
  • Jun Zhu 0001
  • Jianfei Chen 0001

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i. e. , many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics.

IJCAI Conference 2022 Conference Paper

Unsupervised Multi-Modal Medical Image Registration via Discriminator-Free Image-to-Image Translation

  • Zekang Chen
  • Jia wei
  • Rui Li

In clinical practice, well-aligned multi-modal images, such as Magnetic Resonance (MR) and Computed Tomography (CT), together can provide complementary information for image-guided therapies. Multi-modal image registration is essential for the accurate alignment of these multi-modal images. However, it remains a very challenging task due to complicated and unknown spatial correspondence between different modalities. In this paper, we propose a novel translation-based unsupervised deformable image registration approach to convert the multi-modal registration problem to a mono-modal one. Specifically, our approach incorporates a discriminator-free translation network to facilitate the training of the registration network and a patchwise contrastive loss to encourage the translation network to preserve object shapes. Furthermore, we propose to replace an adversarial loss, that is widely used in previous multi-modal image registration methods, with a pixel loss in order to integrate the output of translation into the target modality. This leads to an unsupervised method requiring no ground-truth deformation or pairs of aligned images for training. We evaluate four variants of our approach on the public Learn2Reg 2021 datasets. The experimental results demonstrate that the proposed architecture achieves state-of-the-art performance. Our code is available at https: //github. com/heyblackC/DFMIR.

ECAI Conference 2020 Conference Paper

Inter-Slice Image Augmentation Based on Frame Interpolation for Boosting Medical Image Segmentation Accuracy

  • Zhaotao Wu
  • Jia Wei
  • Wenguang Yuan
  • Jiabing Wang
  • Tolga Tasdizen

We introduce the idea of inter-slice image augmentation whereby the numbers of the medical images and the corresponding segmentation labels are increased between two consecutive images in order to boost medical image segmentation accuracy. Unlike conventional data augmentation methods in medical imaging, which only increase the number of training samples directly by adding new virtual samples using simple parameterized transformations such as rotation, flipping, scaling, etc. , we aim to augment data based on the relationship between two consecutive images, which increases not only the number but also the information of training samples. For this purpose, we propose a frame-interpolation-based data augmentation method to generate intermediate medical images and the corresponding segmentation labels between two consecutive images. We train and test a supervised U-Net liver segmentation network on SLIVER07 and CHAOS2019, respectively, with the augmented training samples, and obtain segmentation scores exhibiting significant improvement compared to the conventional augmentation methods.

JBHI Journal 2014 Journal Article

Local and Global Preserving Semisupervised Dimensionality Reduction Based on Random Subspace for Cancer Classification

  • Xianfa Cai
  • Jia wei
  • Guihua Wen
  • Zhiwen Yu

Precise cancer classification is essential to the successful diagnosis and treatment of cancers. Although semisupervised dimensionality reduction approaches perform very well on clean datasets, the topology of the neighborhood constructed with most existing approaches is unstable in the presence of high-dimensional data with noise. In order to solve this problem, a novel local and global preserving semisupervised dimensionality reduction based on random subspace algorithm marked as RSLGSSDR, which utilizes random subspace for semisupervised dimensionality reduction, is proposed. The algorithm first designs multiple diverse graphs on different random subspace of datasets and then fuses these graphs into a mixture graph on which dimensionality reduction is performed. As the mixture graph is constructed in lower dimensionality, it can ease the issues on graph construction on high-dimensional samples such that it can hold complicated geometric distribution of datasets as the diversity of random subspaces. Experimental results on public gene expression datasets demonstrate that the proposed RSLGSSDR not only has superior recognition performance to competitive methods, but also is robust against a wide range of values of input parameters.