Author name cluster

Alexander Long

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

Mixtures of Subspaces for Bandwidth Efficient Context Parallel Training

Sameera Ramasinghe
Thalaiyasingam Ajanthan
Hadi Mohaghegh Dolatabadi
Gil Avraham
Violetta Shevchenko
Yan Zuo
Chamin P Hewa Koneputugodage
Alexander Long

Pretraining language models with extended context windows enhances their ability to leverage rich information during generation. Existing methods split input sequences into chunks, broadcast them across multiple devices, and compute attention block by block which incurs significant communication overhead. While feasible in high-speed clusters, these methods are impractical for decentralized training over low-bandwidth connections. We propose a compression method for communication-efficient context parallelism in decentralized settings, achieving a remarkable compression rate of over 95% with negligible overhead and no loss in convergence. Our key insight is to exploit the intrinsic low-rank structure of activation outputs by dynamically constraining them to learned mixtures of subspaces via efficient reparameterizations. We demonstrate scaling billion-parameter decentralized models to context lengths exceeding 100K tokens on networks as slow as 300Mbps, matching the wall-clock convergence speed of centralized models on 100Gbps interconnects.

PDF Details

ICML Conference 2025 Conference Paper

Nesterov Method for Asynchronous Pipeline Parallel Optimization

Thalaiyasingam Ajanthan
Sameera Ramasinghe
Yan Zuo
Gil Avraham
Alexander Long

Pipeline Parallelism (PP) enables large neural network training on small, interconnected devices by splitting the model into multiple stages. To maximize pipeline utilization, asynchronous optimization is appealing as it offers 100% pipeline utilization by construction. However, it is inherently challenging as the weights and gradients are no longer synchronized, leading to stale (or delayed) gradients. To alleviate this, we introduce a variant of Nesterov Accelerated Gradient (NAG) for asynchronous optimization in PP. Specifically, we modify the look-ahead step in NAG to effectively address the staleness in gradients. We theoretically prove that our approach converges at a sublinear rate in the presence of fixed delay in gradients. Our experiments on large-scale language modelling tasks using decoder-only architectures with up to 1B parameters, demonstrate that our approach significantly outperforms existing asynchronous methods, even surpassing the synchronous baseline.

Details

NeurIPS Conference 2025 Conference Paper

Subspace Networks: Scaling Decentralized Training with Communication-Efficient Model Parallelism

Sameera Ramasinghe
Thalaiyasingam Ajanthan
Gil Avraham
Yan Zuo
Alexander Long

Scaling models has led to significant advancements in deep learning, but training these models in decentralized settings remains challenging due to communication bottlenecks. While existing compression techniques are effective in data-parallel, they do not extend to model parallelism. Unlike data-parallel training, where weight gradients are exchanged, model-parallel requires compressing activations and activation gradients as they propagate through layers, accumulating compression errors. We propose a novel compression algorithm that compresses both forward and backward passes, enabling up to 99% compression with no convergence degradation with negligible memory/compute overhead. By leveraging a recursive structure in transformer networks, we predefine a low-dimensional subspace to confine the activations and gradients, allowing full reconstruction in subsequent layers. Our method achieves up to 100x improvement in communication efficiency and enables training billion-parameter-scale models over low-end GPUs connected via consumer-grade internet speeds as low as 80Mbps, matching the convergence of centralized datacenter systems with 100Gbps connections with model parallel.

PDF Details

NeurIPS Conference 2025 Conference Paper

Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

Alexander Long
Chamin P Hewa Koneputugodage
Thalaiyasingam Ajanthan
Yan Zuo
Gil Avraham
Violetta Shevchenko
Hadi Mohaghegh Dolatabadi
Sameera Ramasinghe

We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. In this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant. We introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i. e. ,, subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent. On Qwen-2. 5-0. 5B and Llama-3. 2-1B, 10 000 transforms leave FP32 perplexity unchanged ($\Delta$PPL$< 0. 01$; Jensen–Shannon drift $<4 \times 10^{-5}$), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0. 1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1. 6% time and < 1% memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes $\geq 60\%$ of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.

PDF Details

ICML Conference 2024 Conference Paper

A sampling theory perspective on activations for implicit neural representations

Hemanth Saratchandran
Sameera Ramasinghe
Violetta Shevchenko
Alexander Long
Simon Lucey

Implicit Neural Representations (INRs) have gained popularity for encoding signals as compact, differentiable entities. While commonly using techniques like Fourier positional encodings or non-traditional activation functions (e. g. , Gaussian, sinusoid, or wavelets) to capture high-frequency content, their properties lack exploration within a unified theoretical framework. Addressing this gap, we conduct a comprehensive analysis of these activations from a sampling theory perspective. Our investigation reveals that, especially in shallow INRs, $\mathrm{sinc}$ activations—previously unused in conjunction with INRs—are theoretically optimal for signal encoding. Additionally, we establish a connection between dynamical systems and INRs, leveraging sampling theory to bridge these two paradigms.

Details

AAAI Conference 2022 Conference Paper

Fast and Data Efficient Reinforcement Learning from Pixels via Non-parametric Value Approximation

Alexander Long
Alan Blair
Herke van Hoof

We present Nonparametric Approximation of Inter-Trace returns (NAIT), a Reinforcement Learning algorithm for discrete action, pixel-based environments that is both highly sample and computation efficient. NAIT is a lazy-learning approach with an update that is equivalent to episodic Monte- Carlo on episode completion, but that allows the stable incorporation of rewards while an episode is ongoing. We make use of a fixed domain-agnostic representation, simple distance based exploration and a proximity graph-based lookup to facilitate extremely fast execution. We empirically evaluate NAIT on both the 26 and 57 game variants of ATARI100k where, despite its simplicity, it achieves competitive performance in the online setting with greater than 100x speedup in wall-time.

PDF Details