Arrow Research search

Author name cluster

Nicholas Lee

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

4 papers
2 author rows

Possible papers

4

ICML Conference 2025 Conference Paper

Plan-and-Act: Improving Planning of Agents for Long-Horizon Tasks

  • Lutfi Eren Erdogan
  • Nicholas Lee
  • Sehoon Kim 0001
  • Suhong Moon
  • Hiroki Furuta
  • Gopala Anumanchipalli
  • Kurt Keutzer
  • Amir Gholami

Large language models (LLMs) have shown remarkable advancements in enabling language agents to tackle simple tasks. However, applying them for complex, multi-step, long-horizon tasks remains a challenge. Recent work have found success by separating high-level planning from low-level execution, which enables the model to effectively balance high-level planning objectives and low-level execution details. However, generating accurate plans remains difficult since LLMs are not inherently trained for this task. To address this, we propose Plan-and-Act, a novel framework that incorporates explicit planning into LLM-based agents and introduces a scalable method to enhance plan generation through a novel synthetic data generation method. Plan-and-Act consists of a Planner model which generates structured, high-level plans to achieve user goals, and an Executor model that translates these plans into environment-specific actions. To train the Planner effectively, we introduce a synthetic data generation method that annotates ground-truth trajectories with feasible plans, augmented with diverse and extensive examples to enhance generalization. We evaluate Plan-and-Act using web navigation as a representative long-horizon planning environment, demonstrating a state-of-the-art 57. 58% success rate on the WebArena-Lite benchmark as well as a text-only state-of-the-art 81. 36% success rate on WebVoyager.

ICLR Conference 2025 Conference Paper

Sylber: Syllabic Embedding Representation of Speech from Raw Audio

  • Cheol Jun Cho
  • Nicholas Lee
  • Akshat Gupta
  • Dhruv Agarwal 0005
  • Ethan Chen
  • Alan W. Black
  • Gopala Anumanchipalli

Syllables are compositional units of spoken language that efficiently structure human speech perception and production. However, current neural speech representations lack such structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised learning (SSL) framework that bootstraps syllabic embeddings by distilling from its own initial unsupervised syllabic segmentation. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) novel phonological units suited for efficient spoken language modeling. Our proposed segmentation method is highly robust and generalizes to out-of-domain data and unseen languages without any tuning. By training token-to-speech generative models, fully intelligible speech can be reconstructed from Sylber tokens with a significantly lower bitrate than baseline SSL tokens. This suggests that our model effectively compresses speech into a compact sequence of tokens with minimal information loss. Lastly, we demonstrate that categorical perception—a linguistic phenomenon in speech perception—emerges naturally in Sylber, making the embedding space more categorical and sparse than previous speech features and thus supporting the high efficiency of our tokenization. Together, we present a novel SSL approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.

ICML Conference 2024 Conference Paper

An LLM Compiler for Parallel Function Calling

  • Sehoon Kim 0001
  • Suhong Moon
  • Ryan Tabrizi
  • Nicholas Lee
  • Michael W. Mahoney
  • Kurt Keutzer
  • Amir Gholami

The reasoning capabilities of the recent LLMs enable them to execute external function calls to overcome their inherent limitations, such as knowledge cutoffs, poor arithmetic skills, or lack of access to private data. This development has allowed LLMs to select and coordinate multiple functions based on the context to tackle more complex problems. However, current methods for function calling often require sequential reasoning and acting for each function which can result in high latency, cost, and sometimes inaccurate behavior. To address this, we introduce LLMCompiler, which executes functions in parallel to efficiently orchestrate multiple function calls. Drawing inspiration from the principles of classical compilers, LLMCompiler enables parallel function calling with three components: (i) a Function Calling Planner, formulating execution plans for function calling; (ii) a Task Fetching Unit, dispatching function calling tasks; and (iii) an Executor, executing these tasks in parallel. LLMCompiler automatically generates an optimized orchestration for the function calls and can be used with both open-source and closed-source models. We have benchmarked LLMCompiler on a range of tasks with different patterns of function calling. We observe consistent latency speedup of up to $3. 7 \times$, cost savings of up to $6. 7 \times$, and accuracy improvement of up to $\sim 9 %$ compared to ReAct. Our code is available at https: //github. com/SqueezeAILab/LLMCompiler.

NeurIPS Conference 2022 Conference Paper

Squeezeformer: An Efficient Transformer for Automatic Speech Recognition

  • Sehoon Kim
  • Amir Gholami
  • Albert Shaw
  • Nicholas Lee
  • Karttikeya Mangalam
  • Jitendra Malik
  • Michael W. Mahoney
  • Kurt Keutzer

The recently proposed Conformer model has become the de facto backbone model for various downstream speech tasks based on its hybrid attention-convolution architecture that captures both local and global features. However, through a series of systematic studies, we find that the Conformer architecture’s design choices are not optimal. After re-examining the design choices for both the macro and micro-architecture of Conformer, we propose Squeezeformer which consistently outperforms the state-of-the-art ASR models under the same training schemes. In particular, for the macro-architecture, Squeezeformer incorporates (i) the Temporal U-Net structure which reduces the cost of the multi-head attention modules on long sequences, and (ii) a simpler block structure of multi-head attention or convolution modules followed up by feed-forward module instead of the Macaron structure proposed in Conformer. Furthermore, for the micro-architecture, Squeezeformer (i) simplifies the activations in the convolutional block, (ii) removes redundant Layer Normalization operations, and (iii) incorporates an efficient depthwise down-sampling layer to efficiently sub-sample the input signal. Squeezeformer achieves state-of-the-art results of 7. 5%, 6. 5%, and 6. 0% word-error-rate (WER) on LibriSpeech test-other without external language models, which are 3. 1%, 1. 4%, and 0. 6% better than Conformer-CTC with the same number of FLOPs. Our code is open-sourced and available online.