Author name cluster

Jiayin Wang

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers

1 author row

AAAI Conference 2026 Conference Paper

Making Every Head Count: Sparse Attention Without the Speed-Performance Trade-off

Mingkuan Zhao
Wentao Hu
Jiayin Wang
Xin Lai
Tianchen Huang
Yuheng Min
Rui Yan
Xiaoyan Zhu

The design of Large Language Models (LLMs) has long been hampered by a fundamental conflict within their core attention mechanism: its remarkable expressivity is built upon a computational complexity of O(H·N²) that grows quadratically with the context size (N) and linearly with the number of heads (H). This standard implementation harbors significant computational redundancy, as all heads independently compute attention over the same sequence space. Existing sparse methods, meanwhile, often trade information integrity for computational efficiency. To resolve this efficiency-performance trade-off, we propose SPAttention, whose core contribution is the introduction of a new paradigm we term Principled Structural Sparsity. SPAttention does not merely drop connections but instead reorganizes the computational task by partitioning the total attention workload into balanced, non-overlapping distance bands, assigning each head a unique segment. This approach transforms the multi-head attention mechanism from H independent O(N²) computations into a single, collaborative O(N²) computation, fundamentally reducing complexity by a factor of H. The structured inductive bias compels functional specialization among heads, enabling a more efficient allocation of computational resources from redundant modeling to distinct dependencies across the entire sequence span. Extensive empirical validation on the OLMoE-1B-7B and 0.25B-1.75B model series demonstrates that while delivering an approximately two-fold increase in training throughput, its performance is on par with standard dense attention, even surpassing it on select key metrics, while consistently outperforming representative sparse attention methods including Longformer, Reformer, and BigBird across all evaluation metrics. Our work demonstrates that thoughtfully designed structural sparsity can serve as an effective inductive bias that simultaneously improves both computational efficiency and model performance, opening a new avenue for the architectural design of next-generation, high-performance LLMs.

PDF Details DOI

AAAI Conference 2026 Conference Paper

Mosaic Pruning: A Hierarchical Framework for Generalizable Pruning of Mixture-of-Experts Models

Wentao Hu
Mingkuan Zhao
Shuangyong Song
Xiaoyan Zhu
Xin Lai
Jiayin Wang

Sparse Mixture-of-Experts (SMoE) architectures have enabled a new frontier in scaling Large Language Models (LLMs), offering superior performance by activating only a fraction of their total parameters during inference. However, their practical deployment is severely hampered by substantial static memory overhead, as all experts must be loaded into memory. Existing post-training pruning methods, while reducing model size, often derive their pruning criteria from a single, general-purpose corpus. This leads to a critical limitation: a catastrophic performance degradation when the pruned model is applied to other domains, necessitating a costly re-pruning for each new domain. To address this generalization gap, we introduce Mosaic Pruning (MoP). The core idea of MoP is to construct a functionally comprehensive set of experts through a structured ``cluster-then-select" process. This process leverages a similarity metric that captures expert performance across different task domains to functionally cluster the experts, and subsequently selects the most representative expert from each cluster based on our proposed Activation Variability Score. Unlike methods that optimize for a single corpus, our proposed Mosaic Pruning ensures that the pruned model retains a functionally complementary set of experts, much like the tiles of a mosaic that together form a complete picture of the original model's capabilities, enabling it to handle diverse downstream tasks.Extensive experiments on various MoE models demonstrate the superiority of our approach. MoP significantly outperforms prior work, achieving a 7.24\% gain on general tasks and 8.92\% on specialized tasks like math reasoning and code generation.

PDF Details DOI

AAAI Conference 2025 Conference Paper

Multi-Label Ranking Loss Minimization for Matrix Completion

Jiaxuan Li
Xiaoyan Zhu
Hongrui Wang
Yu Zhang
Xin Lai
Jiayin Wang

The common matrix completion methods minimize the rank of the matrix to be completed in addition to the Hamming loss between the incomplete and completed matrices. The rank of matrix measures the linear relation among the vectors of matrix, which may introduce ambiguity for data recovery. To cope with this issue, we extend multi-label ranking loss into matrix completion, and employ multi-label ranking loss minimization (MLRM) in this paper to exploit the relative correlation among matrix vectors. In MLRM, the original incomplete matrix is converted into a pairwise ranking matrix, and the approximation on this newly generated matrix can be viewed as a surrogate of multi-label ranking loss to replace the Hamming loss pattern in the existing methods. Extensive experiments demonstrate that MLRM outperforms the state-of-the-art matrix completion methods in varies of applications, including movie recommendation, drug-target interaction prediction and multi-label learning.

PDF Details DOI

AAAI Conference 2023 Conference Paper

AdaBoost.C2: Boosting Classifiers Chains for Multi-Label Classification

Jiaxuan Li
Xiaoyan Zhu
Jiayin Wang

During the last decades, multi-label classification (MLC) has attracted the attention of more and more researchers due to its wide real-world applications. Many boosting methods for MLC have been proposed and achieved great successes. However, these methods only extend existing boosting frameworks to MLC and take loss functions in multi-label version to guide the iteration. These loss functions generally give a comprehensive evaluation on the label set entirety, and thus the characteristics of different labels are ignored. In this paper, we propose a multi-path AdaBoost framework specific to MLC, where each boosting path is established for distinct label and the combination of them is able to provide a maximum optimization to Hamming Loss. In each iteration, classifiers chain is taken as the base classifier to strengthen the connection between multiple AdaBoost paths and exploit the label correlation. Extensive experiments demonstrate the effectiveness of the proposed method.

PDF Details DOI