Arrow Research search

Author name cluster

Sehoon Kim

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

7 papers
2 author rows

Possible papers

7

NeurIPS Conference 2025 Conference Paper

Multipole Attention for Efficient Long Context Reasoning

  • Coleman Hooper
  • Sebastian Zhao
  • Luca Manolache
  • Sehoon Kim
  • Michael Mahoney
  • Sophia Shao
  • Kurt Keutzer
  • Amir Gholami

Large Reasoning Models (LRMs) have shown promising accuracy improvements on complex problem-solving tasks. While these models have attained high accuracy by leveraging additional computation at test time, they need to generate long chain-of-thought reasoning in order to think before answering, which requires generating thousands of tokens. While sparse attention methods can help reduce the KV cache pressure induced by this long autoregressive reasoning, these methods can introduce errors which disrupt the reasoning process. Our work addresses these challenges by introducing Multipole Attention, which accelerates autoregressive reasoning by only computing exact attention for the most important tokens, while maintaining approximate representations for the remaining tokens. Our method first performs clustering to group together semantically similar key vectors, and then uses the cluster centroids both to identify important key vectors and to approximate the remaining key vectors in order to retain high accuracy. Additionally, in order to accelerate long generation tasks, we design a fast cluster update process to quickly re-cluster the input and previously generated tokens, thereby allowing for accelerating attention to the previous output tokens. We evaluate our method using emerging LRMs such as Qwen-8B and Deepseek-R1-Distil-Qwen2. 5-14B, demonstrating that our approach can maintain accuracy on complex reasoning tasks even with aggressive attention sparsity settings. We also provide kernel implementations to demonstrate the practical efficiency gains from our method, achieving up to 4. 5$\times$ speedup for attention in long-context reasoning applications.

NeurIPS Conference 2024 Conference Paper

KVQuant: Towards 10 Million Context Length LLM Inference with KV Cache Quantization

  • Coleman Hooper
  • Sehoon Kim
  • Hiva Mohammadzadeh
  • Michael W. Mahoney
  • Yakun S. Shao
  • Kurt Keutzer
  • Amir Gholami

LLMs are seeing growing use for applications which require large context windows, and with these large context windows KV cache activations surface as the dominant contributor to memory consumption during inference. Quantization is a promising approach for compressing KV cache activations; however, existing solutions fail to represent activations accurately in sub-4-bit precision. Our work, KVQuant, facilitates low precision KV cache quantization by incorporating several novel methods: (i) Per-Channel Key Quantization, where we adjust the dimension along which we quantize the Key activations to better match the distribution; (ii) Pre-RoPE Key Quantization, where we quantize Key activations before the rotary positional embedding to mitigate its impact on quantization; (iii) Non-Uniform KV Cache Quantization, where we derive per-layer sensitivity-weighted non-uniform datatypes that better represent the distributions; and (iv) Per-Vector Dense-and-Sparse Quantization, where we isolate outliers separately for each vector to minimize skews in quantization ranges. By applying our method to the LLaMA, Llama-2, Llama-3, and Mistral models, we achieve < 0. 1 perplexity degradation with 3-bit quantization on both Wikitext-2 and C4, outperforming existing approaches. Our method enables serving LLaMA-7B with a context length of up to 1 million on a single A100-80GB GPU and up to 10 million on an 8-GPU system. We develop custom CUDA kernels for KVQuant, showing that we can achieve up to ~1. 7x speedups, compared to baseline fp16 matrix-vector multiplications, for the LLaMA-7B model.

NeurIPS Conference 2023 Conference Paper

Speculative Decoding with Big Little Decoder

  • Sehoon Kim
  • Karttikeya Mangalam
  • Suhong Moon
  • Jitendra Malik
  • Michael W. Mahoney
  • Amir Gholami
  • Kurt Keutzer

The recent emergence of Large Language Models based on the Transformer architecture has enabled dramatic advancements in the field of Natural Language Processing. However, these models have long inference latency, which limits their deployment and makes them prohibitively expensive for various real-time applications. The inference latency is further exacerbated by autoregressive generative tasks, as models need to run iteratively to generate tokens sequentially without leveraging token-level parallelization. To address this, we propose Big Little Decoder (BiLD), a framework that can improve inference efficiency and latency for a wide range of text generation applications. The BiLD framework contains two models with different sizes that collaboratively generate text. The small model runs autoregressively to generate text with a low inference cost, and the large model is only invoked occasionally to refine the small model’s inaccurate predictions in a non-autoregressive manner. To coordinate the small and large models, BiLD introduces two simple yet effective policies: (1) the fallback policy that determines when to hand control over to the large model; and (2) the rollback policy that determines when the large model needs to correct the small model's inaccurate predictions. To evaluate our framework across different tasks and models, we apply BiLD to various text generation scenarios encompassing machine translation on IWSLT 2017 De-En and WMT 2014 De-En, and summarization on XSUM and CNN/DailyMail. On an NVIDIA T4 GPU, our framework achieves a speedup of up to 2. 12x speedup with minimal generation quality degradation. Furthermore, our framework is fully plug-and-play and can be applied without any modifications in the training process or model architecture. Our code is open-sourced.

NeurIPS Conference 2022 Conference Paper

A Fast Post-Training Pruning Framework for Transformers

  • Woosuk Kwon
  • Sehoon Kim
  • Michael W. Mahoney
  • Joseph Hassoun
  • Kurt Keutzer
  • Amir Gholami

Pruning is an effective way to reduce the huge inference cost of Transformer models. However, prior work on pruning Transformers requires retraining the models. This can add high training cost and high complexity to model deployment, making it difficult to use in many practical situations. To address this, we propose a fast post-training pruning framework for Transformers that does not require any retraining. Given a resource constraint and a sample dataset, our framework automatically prunes the Transformer model using structured sparsity methods. To retain high accuracy without retraining, we introduce three novel techniques: (i) a lightweight mask search algorithm that finds which heads and filters to prune based on the Fisher information; (ii) mask rearrangement that complements the search algorithm; and (iii) mask tuning that reconstructs the output activations for each layer. We apply our method to BERT-base and DistilBERT, and we evaluate its effectiveness on GLUE and SQuAD benchmarks. Our framework achieves up to 2. 0x reduction in FLOPs and 1. 56x speedup in inference latency, while maintaining < 1% loss in accuracy. Importantly, our framework prunes Transformers in less than 3 minutes on a single GPU, which is over two orders of magnitude faster than existing pruning approaches that retrain the models.

NeurIPS Conference 2022 Conference Paper

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

  • Sehoon Kim
  • Amir Gholami
  • Albert Shaw
  • Nicholas Lee
  • Karttikeya Mangalam
  • Jitendra Malik
  • Michael W. Mahoney
  • Kurt Keutzer

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture’s design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7. 5%, 6. 5%, and 6. 0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3. 1%, 1. 4%, and 0. 6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.

NeurIPS Conference 2021 Conference Paper

Terra: Imperative-Symbolic Co-Execution of Imperative Deep Learning Programs

  • Taebum Kim
  • Eunji Jeong
  • Geon-Woo Kim
  • Yunmo Koo
  • Sehoon Kim
  • Gyeongin Yu
  • Byung-Gon Chun

Imperative programming allows users to implement their deep neural networks (DNNs) easily and has become an essential part of recent deep learning (DL) frameworks. Recently, several systems have been proposed to combine the usability of imperative programming with the optimized performance of symbolic graph execution. Such systems convert imperative Python DL programs to optimized symbolic graphs and execute them. However, they cannot fully support the usability of imperative programming. For example, if an imperative DL program contains a Python feature with no corresponding symbolic representation (e. g. , third-party library calls or unsupported dynamic control flows) they fail to execute the program. To overcome this limitation, we propose Terra, an imperative-symbolic co-execution system that can handle any imperative DL programs while achieving the optimized performance of symbolic graph execution. To achieve this, Terra builds a symbolic graph by decoupling DL operations from Python features. Then, Terra conducts the imperative execution to support all Python features, while delegating the decoupled operations to the symbolic execution. We evaluated Terra’s performance improvement and coverage with ten imperative DL programs for several DNN architectures. The results show that Terra can speed up the execution of all ten imperative DL programs, whereas AutoGraph, one of the state-of-the-art systems, fails to execute five of them.

IROS Conference 2002 Conference Paper

KAIST interactive bicycle racing simulator: the 2nd version with advanced features

  • Dong-Soo Kwon
  • Gi-Hun Yang
  • Youngjin Park
  • Sunmin Kim
  • Chong-Won Lee
  • Jae-Cheol Shin
  • Soonhung Han
  • Jonghwan Lee

This paper presents the KAIST interactive bicycle racing simulator system, which consists of a pair of bicycle simulators. The rider on the racing simulator experiences realistic sensations of motion, while being able to see the other bicycle simulator and having the audio-visual experience of riding in a velodrome or on the KAIST campus. The 2nd bicycle of the racing simulator system consists of a bicycle, a 4-DOF platform, a handlebar and a pedal resistance system to generate motion feelings; a real-time visual simulator a HMD and beam projection system; and a 3D sound system. The system has an integrating control network with an AOIM (Area Of Interest Management) based network structure for multiple simulators.