Arrow Research search

Author name cluster

Yuanbo Wen

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

11 papers
1 author row

Possible papers

11

AAAI Conference 2026 Conference Paper

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

  • Xinguo Zhu
  • Shaohui Peng
  • Jiaming Guo
  • Yunji Chen
  • Qi Guo
  • Yuanbo Wen
  • Hang Qin
  • Ruizhi Chen

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While large language models (LLMs) offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3× speedup over LLMs, and 2.2× over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34× speedup. All models and datasets will be made publicly available.

IJCAI Conference 2025 Conference Paper

Automated Superscalar Processor Design by Learning Data Dependencies

  • Shuyao Cheng
  • Rui Zhang
  • Wenkai He
  • Pengwei Jin
  • Chongxiao Li
  • Zidong Du
  • Xing Hu
  • Yifan Hao

Automated processor design, which can significantly reduce human efforts and accelerate design cycles, has received considerable attention. While recent advancements have automatically designed single-cycle processors that execute one instruction per cycle, their performance cannot compete with modern superscalar processors that execute multiple instructions per cycle. Previous methods fail on superscalar processor design because they cannot address inter-instruction data dependencies, leading to inefficient sequential instruction execution. This paper proposes a novel approach to automatically designing superscalar processors using a hardware-friendly model called the Stateful Binary Speculation Diagram (State-BSD). We observe that processor parallelism can be enhanced through on-the-fly inter-instruction dependent data predictors, reusing the processor's internal states to learn the data dependency. To meet the challenge of both hardware-resource limitation and design functional correctness, State-BSD consists of two components: 1) a lightweight state-selector trained by simulated annealing method to detect the most reusable processor states and store them in a small buffer; and 2) a highly precise state-speculator trained by BSD expansion method to predict the inter-instruction dependent data using the selected states. It is the first work to achieve the automated superscalar processor design, i. e. QiMeng-CPU-v2, which improves the performance by about 380x than the state-of-the-art automated design and is comparable to human-designed superscalar processors such as ARM Cortex A53.

NeurIPS Conference 2025 Conference Paper

MigGPT: Harnessing Large Language Models for Automated Migration of Out-of-Tree Linux Kernel Patches Across Versions

  • Pucheng Dang
  • Di Huang
  • Dong Li
  • Kang Chen
  • Yuanbo Wen
  • Qi Guo
  • Xing Hu

Out-of-tree kernel patches are essential for adapting the Linux kernel to new hardware or enabling specific functionalities. Maintaining and updating these patches across different kernel versions demands significant effort from experienced engineers. Large language models (LLMs) have shown remarkable progress across various domains, suggesting their potential for automating out-of-tree kernel patch migration. However, our findings reveal that LLMs, while promising, struggle with incomplete code context understanding and inaccurate migration point identification. In this work, we propose MigGPT, a framework that employs a novel code fingerprint structure to retain code snippet information and incorporates three meticulously designed modules to improve the migration accuracy and efficiency of out-of-tree kernel patches. Furthermore, we establish a robust benchmark using real-world out-of-tree kernel patch projects to evaluate LLM capabilities. Evaluations show that MigGPT significantly outperforms the direct application of vanilla LLMs, achieving an average completion rate of 72. 59\% ($\uparrow 50. 74\%$) for migration tasks.

AAAI Conference 2025 Conference Paper

Multi-axis Prompt and Multi-dimension Fusion Network for All-in-one Weather-degraded Image Restoration

  • Yuanbo Wen
  • Tao Gao
  • Jing Zhang
  • Ziqi Li
  • Ting Chen

Existing approaches aiming to remove adverse weather degradations compromise the image quality and incur the long processing time. To this end, we introduce a multi-axis prompt and multi-dimension fusion network (MPMF-Net). Specifically, we develop a multi-axis prompts learning block (MPLB), which learns the prompts along three separate axis planes, requiring fewer parameters and achieving superior performance. Moreover, we present a multi-dimension feature interaction block (MFIB), which optimizes intra-scale feature fusion by segregating features along height, width and channel dimensions. This strategy enables more accurate mutual attention and adaptive weight determination. Additionally, we propose the coarse-scale degradation-free implicit neural representations (CDINR) to normalize the degradation levels of different weather conditions. Extensive experiments demonstrate the significant improvements of our model over the recent well-performing approaches in both reconstruction fidelity and inference time.

AAAI Conference 2025 Conference Paper

QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

  • Qirui Zhou
  • Yuanbo Wen
  • Ruizhi Chen
  • Ke Gao
  • Weiqiang Xiong
  • Ling Li
  • Qi Guo
  • Yanjun Wu

As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of General Matrix Multiplication (GEMM) with full utilization of ever-evolving hardware architectures (e.g. GPUs and RISC-V) is of paramount importance. While Large Language Models (LLMs) can generate functionally correct code for simple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code. In this paper, we propose a novel prompt mechanism called QiMeng-GEMM which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of informative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta-prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representative matrix dimensions unequivocally demonstrate QiMeng-GEMM’s superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113×. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of OpenBLAS on RISC-V CPUs. Notably, while human experts often take months to optimize GEMM, our approach reduces the development cost by over 240×.

NeurIPS Conference 2025 Conference Paper

QiMeng-MuPa: Mutual-Supervised Learning for Sequential-to-Parallel Code Translation

  • Changxin Ke
  • Rui Zhang
  • Shuo Wang
  • Li Ding
  • Guangli Li
  • Yuanbo Wen
  • Shuoming Zhang
  • Ruiyuan Xu

The rise of GPU-based high-performance computing (HPC) has driven the widespread adoption of parallel programming models such as CUDA. Yet, the inherent complexity of parallel programming creates a demand for the automated sequential-to-parallel approaches. However, data scarcity poses a significant challenge for machine learning-based sequential-to-parallel code translation. Although recent back-translation methods show promise, they still fail to ensure functional equivalence in the translated code. In this paper, we propose \textbf{QiMeng-MuPa}, a novel \textbf{Mu}tual-Supervised Learning framework for Sequential-to-\textbf{Pa}rallel code translation, to address the functional equivalence issue. QiMeng-MuPa consists of two models, a Translator and a Tester. Through an iterative loop consisting of Co-verify and Co-evolve steps, the Translator and the Tester mutually generate data for each other and improve collectively. The Tester generates unit tests to verify and filter functionally equivalent translated code, thereby evolving the Translator, while the Translator generates translated code as augmented input to evolve the Tester. Experimental results demonstrate that QiMeng-MuPa significantly enhances the performance of the base models: when applied to Qwen2. 5-Coder, it not only improves Pass@1 by up to 28. 91\% and boosts Tester performance by 68. 90\%, but also outperforms the previous state-of-the-art method CodeRosetta by 1. 56 and 6. 92 in BLEU and CodeBLEU scores, while achieving performance comparable to DeepSeek-R1 and GPT-4. 1. Our code is available at \url{https: //github. com/kcxain/mupa}.

NeurIPS Conference 2025 Conference Paper

QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code

  • Hainan Fang
  • Yuanbo Wen
  • Jun Bi
  • Yihan Wang
  • Tonghui He
  • Yanlin Tang
  • Di Huang
  • Jiaming Guo

Compilers, while essential, are notoriously complex systems that demand prohibitively expensive human expertise to develop and maintain. The recent advancements in Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation, which could potentially simplify compiler development for new architectures and facilitate the discovery of innovative optimization techniques. However, several critical obstacles impede its practical adoption. Firstly, a significant lack of dedicated benchmarks and robust evaluation methodologies hinders objective assessment and tracking of progress in the field. Secondly, systematically enhancing the reliability and performance of LLM-generated assembly remains a critical challenge. Addressing these challenges, this paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation. Leveraging this dataset, we first define a foundational Neural Compilation workflow and conduct a comprehensive evaluation of the capabilities of recent frontier LLMs on Neural Compilation, establishing new performance baselines. We further propose a self-evolving prompt optimization method that enables LLMs to iteratively evolve their internal prompt strategies by extracting insights from prior self-debugging traces, thereby enhancing their neural compilation capabilities. Experiments demonstrate that our method significantly improves both the functional correctness and the performance of LLM-generated assembly code. Compared to baseline prompts, the functional correctness rates improved from 44% to 64% on x86 64 and from 36% to 58% on aarch64, respectively. More significantly, among the 16 correctly generated x86 64 programs using our method, 14 (87. 5%) surpassed clang-O3 performance. These consistent improvements across diverse architectures (x86_64 and aarch64) and program distributions (NeuComBack L1 and L2) validate our method's superiority over conventional approaches and its potential for broader adoption in low-level neural compilation.

IJCAI Conference 2025 Conference Paper

QiMeng-TensorOp: One-Line Prompt is Enough for High-Performance Tensor Operator Generation with Hardware Primitives

  • Xuzhi Zhang
  • Shaohui Peng
  • Qirui Zhou
  • Yuanbo Wen
  • Qi Guo
  • Ruizhi Chen
  • Xinguo Zhu
  • Weiqiang Xiong

Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks. Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability. LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291× performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251% of OpenBLAS on RISC-V CPUs, and 124% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200× compared with human experts.

NeurIPS Conference 2023 Conference Paper

ANPL: Towards Natural Programming with Interactive Decomposition

  • Di Huang
  • Ziyuan Nan
  • Xing Hu
  • Pengwei Jin
  • Shaohui Peng
  • Yuanbo Wen
  • Rui Zhang
  • Zidong Du

Though LLMs are capable of generating plausible programs, it’s challenging to interact with the LLMs further to revise the program, especially if the user’s specific requirements are different from the initial proposal. In this paper, we introduce ANPL, an interactive programming system that ensures users can always refine the generated code towards their specific programmatic intents via structureddecompositions. Borrowing the paradigm of sketching from program synthesis, an ANPL program consists of a set of input-outputs that it must satisfy, a “sketch” — control/data flow expressed in precise code (e. g. Python), and “holes” — sub-modules to be implemented by the LLM specified with natural language. The user revises an ANPL program by either modifying the sketch, changing the language used to describe the holes, or providing additional input-outputs to a particular hole, turning it into a sub-ANPL program that can be solved recursively. This workflow allows the users to offload programming burdens to the LLM as much as possible while retaining the ability to pinpoint and resolve bugs locally, without exposing the rest of the program to the LLM. We deploy ANPL on the Abstraction and Reasoning Corpus (ARC), a set of unique tasks that are challenging for state-of-the-art AI systems, showing it outperforms baseline programming systems that (a) without the ability to decompose tasks interactively and (b) without the guarantee that the modules can be correctly composed together. Additional evaluations on APPS, HumanEval, and real-world programming tasks have validated that the ANPL framework is applicable to multiple programming domains. We release the ANPL solutions to the ARC tasks as a dataset, providing insights into how humans decompose novel tasks programmatically.

NeurIPS Conference 2023 Conference Paper

Emergent Communication for Rules Reasoning

  • Yuxuan Guo
  • Yifan Hao
  • Rui Zhang
  • Enshuai Zhou
  • Zidong Du
  • Xishan Zhang
  • Xinkai Song
  • Yuanbo Wen

Research on emergent communication between deep-learning-based agents has received extensive attention due to its inspiration for linguistics and artificial intelligence. However, previous attempts have hovered around emerging communication under perception-oriented environmental settings, that forces agents to describe low-level perceptual features intra image or symbol contexts. In this work, inspired by the classic human reasoning test (namely Raven's Progressive Matrix), we propose the Reasoning Game, a cognition-oriented environment that encourages agents to reason and communicate high-level rules, rather than perceived low-level contexts. Moreover, we propose 1) an unbiased dataset (namely rule-RAVEN) as a benchmark to avoid overfitting, 2) and a two-stage curriculum agent training method as a baseline for more stable convergence in the Reasoning Game, where contexts and semantics are bilaterally drifting. Experimental results show that, in the Reasoning Game, a semantically stable and compositional language emerges to solve reasoning problems. The emerged language helps agents apply the extracted rules to the generalization of unseen context attributes, and to the transfer between different context attributes or even tasks.