Author name cluster

Ravi Kumar

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

36 papers

2 author rows

ICML Conference 2025 Conference Paper

LAuReL: Learned Augmented Residual Layer

Gaurav Menghani
Ravi Kumar
Sanjiv Kumar

One of the core pillars of efficient deep learning methods are architectural improvements, such as residual/skip connections, which have led to significantly better model convergence and quality. Since their introduction, residual connections have become ubiquitous not only in convolutional neural networks but also in transformer-based architectures, the backbone of LLMs. In this paper, we introduce the Learned Augmented Residual Layer (LAuReL) — a novel generalization of the canonical residual connection — designed to serve as an in-situ replacement while outperforming it in both model quality and footprint metrics. Our experiments show that LAuReL can enhance quality for both vision and language models while adding fewer parameters and incurring less latency and memory overhead than naively increasing parameter count. For example, on the ImageNet-1K task, LAuReL achieves the same model quality improvements as naively adding an extra layer while using $2. 6 \times$ fewer parameters. Similarly, when pre-training 1B and 4B parameter LLMs, LAuReL improves performance on a variety of challenging downstream evaluation tasks by 2. 54% to 20. 05%, while adding only 0. 012% and 0. 1% additional parameters, respectively.

Details

NeurIPS Conference 2025 Conference Paper

Length Generalization via Auxiliary Tasks

Pranjal Awasthi
Anupam Gupta
Ravi Kumar

Length generalization, the ability of sequence models to generalize to sequences longer than those encountered during training, remains a key challenge for transformers, especially in tasks requiring algorithmic reasoning. Existing theoretical understanding of length generalization is limited, often providing only asymptotic results or focusing on specific problem classes or architectural variants, while empirical approaches frequently rely on ad hoc and often fragile techniques. In this work we introduce a novel framework for analyzing and proving length generalization bounds under specified, verifiable assumptions. A key outcome of the theory is the identification of a natural set of auxiliary tasks, intricately related to the primary task structure, such that strong performance on these auxiliary tasks, alongside the primary task, provably guarantees length generalization within the framework. This motivates a multi-task training procedure that explicitly optimizes performance on both the primary and the identified auxiliary tasks. Empirical evaluations on a variety of synthetic benchmarks known to be challenging for length generalization, including sequence sorting, and reversal, demonstrate that our proposed method yields significant improvements in generalization to substantially longer sequences.