Arrow Research search

Author name cluster

Oren Tropp

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

2 papers
2 author rows

Possible papers

2

NeurIPS Conference 2025 Conference Paper

FFN Fusion: Rethinking Sequential Computation in Large Language Models

  • Akhiad Bercovich
  • Mohammed Dabbah
  • Omri Puny
  • Ido Galil
  • Amnon Geifman
  • Yonatan Geifman
  • Izik Golan
  • Ehud Karpas

We introduce \textit{FFN Fusion}, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3. 1-405B-Instruct, we create a 253B model (253B-Base), an efficient and soon-to-be publicly available model that achieves a 1. 71$\times$ speedup in inference latency and 35$\times$ lower per-token cost while maintaining strong performance across benchmarks. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

ICML Conference 2025 Conference Paper

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

  • Akhiad Bercovich
  • Tomer Ronen
  • Talor Abramovich
  • Nir Ailon
  • Nave Assaf
  • Mohammad Dabbah
  • Ido Galil
  • Amnon Geifman

Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework’s impact via Llama-3. 1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3. 3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2. 17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98. 4% of the original model’s benchmark accuracies. These are the most accurate models supporting single H100 GPU inference with large batch sizes, despite training on 45B tokens at most, far fewer than the 15T used to train Llama-70B. Lastly, we show that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.