Arrow Research search

Author name cluster

Ke Gao

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
1 author row

Possible papers

7

EAAI Journal 2026 Journal Article

A combined air volume prediction model for hazardous mine tunnels using directed graph convolutional neural networks

  • Zhen Wang
  • Erkan Topal
  • Yongli Li
  • Ke Gao
  • Chen Yang

During actual mining, certain hazardous areas are difficult for personnel to access due to safety concerns, resulting in critical blind spots for air volume prediction. To address this issue, this research proposes a combined air volume prediction model for hazardous mine tunnels. First, the Time-Variant Filter Empirical Mode Decomposition is applied to the raw tunnel air volume data within the data augmentation module to perform noise reduction process. Subsequently, the processed data is integrated into the graph as initial feature nodes, achieving the transformation from two-dimensional data to graph based data. Then, Granger causality is employed to determine the weights and directions between pairs of tunnels within the graph. A Directed Graph Convolutional Neural network is utilized to learn spatial feature relationships within the graph data. To overcome the computational burden of using full-graph convolutional operations in Directed Graph Convolutional Neural network and the limitations of the fixed adjacency matrix assumption, Bidirectional Long Short Term Memory, Bidirectional Gated Recurrent Unit, and Bidirectional Temporal Convolutional Network were utilized to learn temporal features from the raw air volume data of hazardous tunnels. Simultaneously, multiple attention mechanisms are integrated into three bidirectional deep learning algorithms to enhance the prediction capabilities of individual models. Finally, the three individual prediction models are combined into a composite prediction model. The Sparrow Search Algorithm is employed to adjust the weights of each individual model within the combined prediction model, aiming to minimize prediction errors and ultimately obtain the final predicted air volume for hazardous tunnels.

AAAI Conference 2026 Conference Paper

QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

  • Xinguo Zhu
  • Shaohui Peng
  • Jiaming Guo
  • Yunji Chen
  • Qi Guo
  • Yuanbo Wen
  • Hang Qin
  • Ruizhi Chen

Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While large language models (LLMs) offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3× speedup over LLMs, and 2.2× over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34× speedup. All models and datasets will be made publicly available.

NeurIPS Conference 2025 Conference Paper

EasySpec: Layer-Parallel Speculative Decoding for Efficient Multi-GPU Utilization

  • Yize Wu
  • Ke Gao
  • Ling Li
  • Yanjun Wu

Speculative decoding is an effective and lossless method for Large Language Model (LLM) inference acceleration. It employs a smaller model to generate a draft token sequence, which is then verified by the original base model. In multi-GPU systems, inference latency can be further reduced through tensor parallelism (TP), while the optimal TP size of the draft model is typically smaller than that of the base model, leading to GPU idling during the drafting stage. We observe that such inefficiency stems from the sequential execution of layers, which is seemingly natural but actually unnecessary. Therefore, we propose EasySpec, a layer-parallel speculation strategy that optimizes the efficiency of multi-GPU utilization. EasySpec breaks the inter-layer data dependencies in the draft model, enabling multiple layers to run simultaneously across multiple devices as ``fuzzy'' speculation. After each drafting-and-verification iteration, the draft model’s key-value cache is calibrated in a single forward pass, preventing long-term fuzzy-error accumulation at minimal additional latency. EasySpec is a training-free and plug-in method. We evaluated EasySpec on several mainstream open-source LLMs, using smaller versions of models from the same series as drafters. The results demonstrate that EasySpec can achieve a peak speedup of 4. 17x compared to vanilla decoding, while preserving the original distributions of the base LLMs. Specifically, the drafting stage can be accelerated by up to 1. 62x with a maximum speculation accuracy drop of only 7\%. The code is available at https: //github. com/Yize-Wu/EasySpec.

AAAI Conference 2025 Conference Paper

QiMeng-GEMM: Automatically Generating High-Performance Matrix Multiplication Code by Exploiting Large Language Models

  • Qirui Zhou
  • Yuanbo Wen
  • Ruizhi Chen
  • Ke Gao
  • Weiqiang Xiong
  • Ling Li
  • Qi Guo
  • Yanjun Wu

As a crucial operator in numerous scientific and engineering computing applications, the automatic optimization of General Matrix Multiplication (GEMM) with full utilization of ever-evolving hardware architectures (e.g. GPUs and RISC-V) is of paramount importance. While Large Language Models (LLMs) can generate functionally correct code for simple tasks, they have yet to produce high-performance code. The key challenge resides in deeply understanding diverse hardware architectures and crafting prompts that effectively unleash the potential of LLMs to generate high-performance code. In this paper, we propose a novel prompt mechanism called QiMeng-GEMM which enables LLMs to comprehend the architectural characteristics of different hardware platforms and automatically search for the optimization combinations for GEMM. The key of QiMeng-GEMM is a set of informative, adaptive, and iterative meta-prompts. Based on this, a searching strategy for optimal combinations of meta-prompts is used to iteratively generate high-performance code. Extensive experiments conducted on 4 leading LLMs, various paradigmatic hardware platforms, and representative matrix dimensions unequivocally demonstrate QiMeng-GEMM’s superior performance in auto-generating optimized GEMM code. Compared to vanilla prompts, our method achieves a performance enhancement of up to 113×. Even when compared to human experts, our method can reach 115% of cuBLAS on NVIDIA GPUs and 211% of OpenBLAS on RISC-V CPUs. Notably, while human experts often take months to optimize GEMM, our approach reduces the development cost by over 240×.

IJCAI Conference 2025 Conference Paper

QiMeng-TensorOp: One-Line Prompt is Enough for High-Performance Tensor Operator Generation with Hardware Primitives

  • Xuzhi Zhang
  • Shaohui Peng
  • Qirui Zhou
  • Yuanbo Wen
  • Qi Guo
  • Ruizhi Chen
  • Xinguo Zhu
  • Weiqiang Xiong

Computation-intensive tensor operators constitute over 90% of the computations in Large Language Models (LLMs) and Deep Neural Networks. Automatically and efficiently generating high-performance tensor operators with hardware primitives is crucial for diverse and ever-evolving hardware architectures like RISC-V, ARM, and GPUs, as manually optimized implementation takes at least months and lacks portability. LLMs excel at generating high-level language codes, but they struggle to fully comprehend hardware characteristics and produce high-performance tensor operators. We introduce a tensor-operator auto-generation framework with a one-line user prompt (QiMeng-TensorOp), which enables LLMs to automatically exploit hardware characteristics to generate tensor operators with hardware primitives, and tune parameters for optimal performance across diverse hardware. Experimental results on various hardware platforms, SOTA LLMs, and typical tensor operators demonstrate that QiMeng-TensorOp effectively unleashes the computing capability of various hardware platforms, and automatically generates tensor operators of superior performance. Compared with vanilla LLMs, QiMeng-TensorOp achieves up to 1291× performance improvement. Even compared with human experts, QiMeng-TensorOp could reach 251% of OpenBLAS on RISC-V CPUs, and 124% of cuBLAS on NVIDIA GPUs. Additionally, QiMeng-TensorOp also significantly reduces development costs by 200× compared with human experts.

EAAI Journal 2023 Journal Article

CSCMOT: Multi-object tracking based on channel spatial cooperative attention mechanism

  • Fei Wang
  • Hao Yan
  • Libo Zhang
  • Ke Gao

Multi-object tracking has made good progress in recent years. Most mainstream methods use the fusion method of detection and Re-ID to complete the multi-target tracking technology. However, the current multi-tracking algorithm is slow and cannot meet the real-time requirements, which makes it difficult to implement in actual scenarios. In addition, the current mainstream multi-target tracking technology often has the problem of identity information jumping. Such frequent identity information hopping can lead to serious problems in some demanding practical applications, resulting in poor tracking performance. To solve these problems, we propose a simple framework CSCMOT. A non-parametric attention mechanism is adopted to focus on some feature points of the target without increasing the amount of computation, so as to reduce the amount of computation and improve the real-time performance of the algorithm. In addition, the jumping problem of identity information can be reduced by random simulation occlusion to improve tracking performance. Experiments show that the real-time performance of the proposed CSCMOT framework reaches 32. 5 FPS, which exceeds most of the mainstream methods. In addition, the ID-switch was reduced to 2493 on the MOT17 dataset. Made a great breakthrough, better to solve the problem of identity information jump. The tracking accuracy is also 71. 5, a competitive result that exceeds most of the mainstream methods. Effective data show that the framework improves the real-time performance of the algorithm, solves the problem of identity jump between targets, and is more conducive to experiment landing, which is easy to combine with the mobile robot platform.

NeurIPS Conference 2023 Conference Paper

NeRF-IBVS: Visual Servo Based on NeRF for Visual Localization and Navigation

  • Yuanze Wang
  • Yichao Yan
  • Dianxi Shi
  • Wenhan Zhu
  • Jianqiang Xia
  • Tan Jeff
  • Songchang Jin
  • Ke Gao

Visual localization is a fundamental task in computer vision and robotics. Training existing visual localization methods requires a large number of posed images to generalize to novel views, while state-of-the-art methods generally require dense ground truth 3D labels for supervision. However, acquiring a large number of posed images and dense 3D labels in the real world is challenging and costly. In this paper, we present a novel visual localization method that achieves accurate localization while using only a few posed images compared to other localization methods. To achieve this, we first use a few posed images with coarse pseudo-3D labels provided by NeRF to train a coordinate regression network. Then a coarse pose is estimated from the regression network with PNP. Finally, we use the image-based visual servo (IBVS) with the scene prior provided by NeRF for pose optimization. Furthermore, our method can provide effective navigation prior, which enable navigation based on IBVS without using custom markers and depth sensor. Extensive experiments on 7-Scenes and 12-Scenes datasets demonstrate that our method outperforms state-of-the-art methods under the same setting, with only 5\% to 25\% training data. Furthermore, our framework can be naturally extended to the visual navigation task based on IBVS, and its effectiveness is verified in simulation experiments.