Author name cluster

Brian Lester

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

2 author rows

NeurIPS Conference 2025 Conference Paper

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

Nikhil Kandpal
Brian Lester
Colin Raffel
Sebastian Majstorovic
Stella Biderman
Baber Abbasi
Luca Soldaini
Enrico Shippole

Large language models (LLMs) are typically trained on enormous quantities of unlicensed text, a practice that has led to scrutiny due to possible intellectual property infringement and ethical concerns. Training LLMs on openly licensed text presents a first step towards addressing these issues, but prior data collection efforts have yielded datasets too small or low-quality to produce performant LLMs. To address this gap, we collect, curate, and release the Common Pile v0. 1, an eight terabyte collection of openly licensed text designed for LLM pretraining. The Common Pile comprises content from 30 sources that span diverse domains including research papers, code, books, encyclopedias, educational materials, audio transcripts, and more. Crucially, we validate our efforts by training two 7 billion parameter LLMs on text from the Common Pile: Comma v0. 1-1T and Comma v0. 1-2T, trained on 1 and 2 trillion tokens respectively. Both models attain competitive performance to LLMs trained on unlicensed text with similar computational budgets, such as Llama 1 and 2 7B. In addition to releasing the Common Pile v0. 1 itself, we also release the code used in its creation as well as the training mixture and checkpoints for the Comma v0. 1 models.

PDF Details

TMLR Journal 2024 Journal Article

Training LLMs over Neurally Compressed Text

Brian Lester
Jaehoon Lee
Alexander A Alemi
Jeffrey Pennington
Adam Roberts
Jascha Sohl-Dickstein
Noah Constant

In this paper, we explore the idea of training large language models (LLMs) over highly compressed text. While standard subword tokenizers compress text by a small factor, neural text compressors can achieve much higher rates of compression. If it were possible to train LLMs directly over neurally compressed text, this would confer advantages in training and serving efficiency, as well as easier handling of long text spans. The main obstacle to this goal is that strong compression tends to produce opaque outputs that are not well-suited for learning. In particular, we find that text naïvely compressed via Arithmetic Coding is not readily learnable by LLMs. To overcome this, we propose Equal-Info Windows, a novel compression technique whereby text is segmented into blocks that each compress to the same bit length. Using this method, we demonstrate effective learning over neurally compressed text that improves with scale, and outperforms byte-level baselines by a wide margin on perplexity and inference speed benchmarks. While our method delivers worse perplexity than subword tokenizers for models trained with the same parameter count, it has the benefit of shorter sequence lengths. Shorter sequence lengths require fewer autoregressive generation steps, often reducing latency. Finally, we provide extensive analysis of the properties that contribute to learnability, and offer concrete suggestions for how to further improve the performance of high-compression tokenizers.

PDF Details

ICML Conference 2023 Conference Paper

Git-Theta: A Git Extension for Collaborative Development of Machine Learning Models

Nikhil Kandpal
Brian Lester
Mohammed Muqeeth
Anisha Mascarenhas
Monty Evans
Vishal Baskaran
Tenghao Huang
Haokun Liu

Currently, most machine learning models are trained by centralized teams and are rarely updated. In contrast, open-source software development involves the iterative development of a shared artifact through distributed collaboration using a version control system. In the interest of enabling collaborative and continual improvement of machine learning models (Raffel, 2023), we introduce Git-Theta, a version control system for machine learning models. Git-Theta is an extension to Git, the most widely used version control software, that allows fine-grained tracking of changes to model parameters alongside code and other artifacts. Unlike existing version control systems that treat a model checkpoint as a blob of data, Git-Theta leverages the structure of checkpoints to support communication-efficient updates, automatic model merges, and meaningful reporting about the difference between two versions of a model. In addition, Git-Theta includes a plug-in system that enables users to easily add support for new functionality. In this paper, we introduce Git-Theta’s design and features and include an example use-case of Git-Theta where a pre-trained model is continually adapted and modified. We publicly release Git-Theta in hopes of kickstarting a new era of collaborative model development. https: //github. com/r-three/git-theta/

Details

JMLR Journal 2023 Journal Article

Scaling Up Models and Data with t5x and seqio

Adam Roberts
Hyung Won Chung
Gaurav Mishra
Anselm Levskaya
James Bradbury
Daniel Andor
Sharan Narang
Brian Lester

Scaling up training datasets and model parameters have benefited neural network-based language models, but also present challenges like distributed compute, input data bottlenecks and reproducibility of results. We introduce two simple and scalable software libraries that simplify these issues: t5x enables training large language models at scale, while seqio enables reproducible input and evaluation pipelines. These open-source libraries have been used to train models with hundreds of billions of parameters on multi-terabyte datasets. Configurations and instructions for T5-like and GPT-like models are also provided. The libraries can be found at https://github.com/google-research/t5x and https://github.com/google/seqio. [abs] [ pdf ][ bib ] [ code ] &copy JMLR 2023. ( edit, beta )

PDF Details

ICLR Conference 2022 Conference Paper

Finetuned Language Models are Zero-Shot Learners

Jason Wei
Maarten P. Bosma
Vincent Y. Zhao
Kelvin Guu
Adams Wei Yu
Brian Lester
Nan Du 0002
Andrew M. Dai

This paper explores a simple method for improving the zero-shot learning abilities of language models. We show that instruction tuning—finetuning language models on a collection of datasets described via instructions—substantially improves zero-shot performance on unseen tasks. We take a 137B parameter pretrained language model and instruction tune it on over 60 NLP datasets verbalized via natural language instruction templates. We evaluate this instruction-tuned model, which we call FLAN, on unseen task types. FLAN substantially improves the performance of its unmodified counterpart and surpasses zero-shot 175B GPT-3 on 20 of 25 datasets that we evaluate. FLAN even outperforms few-shot GPT-3 by a large margin on ANLI, RTE, BoolQ, AI2-ARC, OpenbookQA, and StoryCloze. Ablation studies reveal that number of finetuning datasets, model scale, and natural language instructions are key to the success of instruction tuning.

Details