Arrow Research search

Author name cluster

Jason Cong

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers
2 author rows

Possible papers

5

AAAI Conference 2025 Conference Paper

Dynamic-Width Speculative Beam Decoding for LLM Inference

  • Zongyue Qin
  • Zifan He
  • Neha Prakriya
  • Jason Cong
  • Yizhou Sun

Large language models (LLMs) based on transformer architecture have shown outstanding performance across numerous real-world tasks. However, the autoregressive nature of these models makes the inference process slow and costly. Speculative decoding has emerged as a promising solution, leveraging a smaller auxiliary model to draft future tokens, which are then validated simultaneously by the larger model, achieving a speed-up of 1-2x. Although speculative decoding matches the same distribution as multinomial sampling, multinomial sampling itself is prone to suboptimal outputs, where as beam sampling is widely recognized for producing higher-quality results by maintaining multiple candidate sequences at each step. This paper explores the novel integration of speculative decoding with beam sampling. However, there are four key challenges: (1) how to generate multiple sequences from the larger model's distribution given drafts sequences from the small model; (2) how to dynamically optimize the number of beams to balance efficiency and accuracy; (3) how to efficiently verify the multiple drafts in parallel; and (4) how to address the extra memory costs inherent in beam sampling. To address these challenges, we propose dynamic-width speculative beam decoding (DSBD). Specifically, we first introduce a novel draft and verification scheme that generates multiple sequences following the large model's distribution based on beam sampling trajectories from the small model. Then, we introduce an adaptive mechanism to dynamically tune the number of beams based on the context, optimizing efficiency and effectiveness. Besides, we extend tree-based parallel verification to handle multiple trees simultaneously, accelerating the verification process. Finally, we illustrate a simple modification to our algorithm to mitigate the memory overhead of beam sampling. Experimental results show that our approach achieves a 1.5-1.9x speed-up and1.8-2.5x lower energy consumption compared to beam sampling, with no loss in downstream performance. Moreover, it can produce significantly higher-quality outputs than speculative decoding, while maintaining similar time, memory, and energy costs. In summary, our method offers a more efficient and effective inference process for LLMs.

AAAI Conference 2025 Conference Paper

Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis

  • Weikai Li
  • Ding Wang
  • Zijian Ding
  • Atefeh Sohrabizadeh
  • Zongyue Qin
  • Jason Cong
  • Yizhou Sun

High-level synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA). HLS enables FPGA design with software programming languages by compiling the source code into an FPGA circuit. The source code includes a program (called ``kernel'') and several pragmas that instruct hardware synthesis, such as parallelization, pipeline, etc. While it is relatively easy for software developers to design the program, it heavily relies on hardware knowledge to design the pragmas, posing a big challenge for software developers. Recently, different machine learning algorithms, such as GNNs, have been proposed to automate the pragma design via performance prediction. However, when applying the trained model on new kernels, the significant domain shift often leads to unsatisfactory performance. We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE), that can be flexibly adapted to any GNN model. Different expert networks can learn to deal with different regions in the representation space, and they can utilize similar patterns between the old kernels and new kernels. In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision. To stably train the hierarchical MoE, we further propose a two-stage training method. Extensive experiments verify the effectiveness of the hierarchical MoE.

ICLR Conference 2025 Conference Paper

Optimized Multi-Token Joint Decoding With Auxiliary Model for LLM Inference

  • Zongyue Qin
  • Ziniu Hu
  • Zifan He
  • Neha Prakriya
  • Jason Cong
  • Yizhou Sun

Large language models (LLMs) have achieved remarkable success across diverse tasks, yet their inference processes are hindered by substantial time and energy demands due to single-token generation at each decoding step. While previous methods such as speculative decoding mitigate these inefficiencies by producing multiple tokens per step, each token is still generated by its single-token distribution, thereby enhancing speed without improving effectiveness. In contrast, our work simultaneously enhances inference speed and improves the output effectiveness. We consider multi-token joint decoding (MTJD), which generates multiple tokens from their joint distribution at each iteration, theoretically reducing perplexity and enhancing task performance. However, MTJD suffers from the high cost of sampling from the joint distribution of multiple tokens. Inspired by speculative decoding, we introduce multi-token assisted decoding (MTAD), a novel framework designed to accelerate MTJD. MTAD leverages a smaller auxiliary model to approximate the joint distribution of a larger model, incorporating a verification mechanism that not only ensures the accuracy of this approximation, but also improves the decoding efficiency over conventional speculative decoding. Theoretically, we demonstrate that MTAD closely approximates exact MTJD with bounded error. Empirical evaluations using Llama-2 and OPT models ranging from 13B to 70B parameters across various tasks reveal that MTAD reduces perplexity by 21.2% and improves downstream performance compared to standard single-token sampling. Furthermore, MTAD achieves a 1.42× speed-up and consumes 1.54× less energy than conventional speculative decoding methods. These results highlight MTAD’s ability to make multi-token joint decoding both effective and efficient, promoting more sustainable and high-performance deployment of LLMs.

NeurIPS Conference 2023 Conference Paper

Towards a Comprehensive Benchmark for High-Level Synthesis Targeted to FPGAs

  • Yunsheng Bai
  • Atefeh Sohrabizadeh
  • Zongyue Qin
  • Ziniu Hu
  • Yizhou Sun
  • Jason Cong

High-level synthesis (HLS) aims to raise the abstraction layer in hardware design, enabling the design of domain-specific accelerators (DSAs) like field-programmable gate arrays (FPGAs) using C/C++ instead of hardware description languages (HDLs). Compiler directives in the form of pragmas play a crucial role in modifying the microarchitecture within the HLS framework. However, the space of possible microarchitectures grows exponentially with the number of pragmas. Moreover, the evaluation of each candidate design using the HLS tool consumes significant time, ranging from minutes to hours, leading to a time-consuming optimization process. To accelerate this process, machine learning models have been used to predict design quality in milliseconds. However, existing open-source datasets for training such models are limited in terms of design complexity and available optimizations. In this paper, we present HLSyn, the first benchmark that addresses these limitations. It contains more complex programs with a wider range of optimization pragmas, making it a comprehensive dataset for training and evaluating design quality prediction models. The HLSyn benchmark consists of 42 unique programs/kernels, resulting in over 42, 000 labeled designs. We conduct an extensive comparison of state-of-the-art baselines to assess their effectiveness in predicting design quality. As an ongoing project, we anticipate expanding the HLSyn benchmark in terms of both quantity and variety of programs to further support the development of this field.

JBHI Journal 2014 Journal Article

System Light-Loading Technology for mHealth: Manifold-Learning-Based Medical Data Cleansing and Clinical Trials in WE-CARE Project

  • Anpeng Huang
  • Wenyao Xu
  • Zhinan Li
  • Linzhen Xie
  • Majid Sarrafzadeh
  • Xiaoming Li
  • Jason Cong

Cardiovascular disease (CVD) is a major issue to public health. It contributes 41% to the Chinese death rate each year. This huge loss encouraged us to develop a Wearable Efficient teleCARdiology systEm (WE-CARE) for early warning and prevention of CVD risks in real time. WE-CARE is expected to work 24/7 online for mobile health (mHealth) applications. Unfortunately, this purpose is often disrupted in system experiments and clinical trials, even if related enabling technologies work properly. This phenomenon is rooted in the overload issue of complex Electrocardiogram (ECG) data in terms of system integration. In this study, our main objective is to get a system light-loading technology to enable mHealth with a benchmarked ECG anomaly recognition rate. To achieve this objective, we propose an approach to purify clinical features from ECG raw data based on manifold learning, called the Manifold-based ECG-feature Purification algorithm. Our clinical trials verify that our proposal can detect anomalies with a recognition rate of up to 94% which is highly valuable in daily public health-risk alert applications based on clinical criteria. Most importantly, the experiment results demonstrate that the WE-CARE system enabled by our proposal can enhance system reliability by at least two times and reduce false negative rates to 0. 76%, and extend the battery life by 40. 54%, in the system integration level.