Oren Tropp Papers

NeurIPS Conference 2025 Conference Paper

FFN Fusion: Rethinking Sequential Computation in Large Language Models

Akhiad Bercovich
Mohammed Dabbah
Omri Puny
Ido Galil
Amnon Geifman
Yonatan Geifman
Izik Golan
Ehud Karpas

We introduce \textit{FFN Fusion}, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3. 1-405B-Instruct, we create a 253B model (253B-Base), an efficient and soon-to-be publicly available model that achieves a 1. 71$\times$ speedup in inference latency and 35$\times$ lower per-token cost while maintaining strong performance across benchmarks. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.

PDF Details

ICML Conference 2025 Conference Paper

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs

Akhiad Bercovich
Tomer Ronen
Talor Abramovich
Nir Ailon
Nave Assaf
Mohammad Dabbah
Ido Galil
Amnon Geifman

Large language models (LLMs) offer remarkable capabilities, yet their high inference costs restrict wider adoption. While increasing parameter counts improves accuracy, it also broadens the gap between state-of-the-art capabilities and practical deployability. We present Puzzle, a hardware-aware framework that accelerates the inference of LLMs while preserving their capabilities. Using neural architecture search (NAS) at a large-scale, Puzzle optimizes models with tens of billions of parameters. Our approach utilizes blockwise local knowledge distillation (BLD) for parallel architecture exploration and employs mixed-integer programming for precise constraint optimization. We showcase our framework’s impact via Llama-3. 1-Nemotron-51B-Instruct (Nemotron-51B) and Llama-3. 3-Nemotron-49B, two publicly available models derived from Llama-70B-Instruct. Both models achieve a 2. 17x inference throughput speedup, fitting on a single NVIDIA H100 GPU while retaining 98. 4% of the original model’s benchmark accuracies. These are the most accurate models supporting single H100 GPU inference with large batch sizes, despite training on 45B tokens at most, far fewer than the 15T used to train Llama-70B. Lastly, we show that lightweight alignment on these derived models allows them to surpass the parent model in specific capabilities. Our work establishes that powerful LLM models can be optimized for efficient deployment with only negligible loss in quality, underscoring that inference performance, not parameter count alone, should guide model selection.

Details

Possible papers

FFN Fusion: Rethinking Sequential Computation in Large Language Models

Puzzle: Distillation-Based NAS for Inference-Optimized LLMs