Author name cluster

Llion Jones

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

5 papers

1 author row

NeurIPS Conference 2025 Conference Paper

Continuous Thought Machines

Luke Darlow
Ciaran Regan
Sebastian Risi
Jeffrey Seely
Llion Jones

Biological brains demonstrate complex neural activity, where neural dynamics are critical to how brains process information. Most artificial neural networks ignore the complexity of individual neurons. We challenge that paradigm. By incorporating neuron-level processing and synchronization, we reintroduce neural timing as a foundational element. We present the Continuous Thought Machine (CTM), a model designed to leverage neural dynamics as its core representation. The CTM has two innovations: (1) neuron-level temporal processing}, where each neuron uses unique weight parameters to process incoming histories; and (2) neural synchronization as a latent representation. The CTM aims to strike a balance between neuron abstractions and biological realism. It operates at a level of abstraction that effectively captures essential temporal dynamics while remaining computationally tractable. We demonstrate the CTM's performance and versatility across a range of tasks, including solving 2D mazes, ImageNet-1K classification, parity computation, and more. Beyond displaying rich internal representations and offering a natural avenue for interpretation owing to its internal process, the CTM is able to perform tasks that require complex sequential reasoning. The CTM can also leverage adaptive compute, where it can stop earlier for simpler tasks, or keep computing when faced with more challenging instances. The goal of this work is to share the CTM and its associated innovations, rather than pushing for new state-of-the-art results. To that end, we believe the CTM represents a significant step toward developing more biologically plausible and powerful artificial intelligence systems. We provide an accompanying interactive online demonstration and an extended technical report.

PDF Details

AAAI Conference 2025 Conference Paper

Transformer Layers as Painters

Qi Sun
Marc Pickett
Aakash Kumar Nain
Llion Jones

Despite their nearly universal adoption for large language models, the internal workings of transformers are not well understood. We aim to better understand the impact of removing or reorganizing information throughout the layers of a pretrained transformer. Such an understanding could both yield better usage of existing models as well as to make architectural improvements to produce new variants. We present a series of empirical studies on frozen models that show that the lower and final layers of pretrained transformers differ from middle layers, but that middle layers have a surprising amount of uniformity. We further show that some classes of problems have robustness to skipping layers, running the layers in an order different from how they were trained, or running the layers in parallel. Our observations suggest that even frozen pretrained models may gracefully trade accuracy for latency by skipping layers or running layers in parallel.

PDF Details DOI

AAAI Conference 2019 Conference Paper

Character-Level Language Modeling with Deeper Self-Attention

Rami Al-Rfou
Dokook Choe
Noah Constant
Mandy Guo
Llion Jones

LSTMs and other RNN variants have shown strong performance on character-level language modeling. These models are typically trained using truncated backpropagation through time, and it is common to assume that their success stems from their ability to remember long-term contexts. In this paper, we show that a deep (64-layer) transformer model (Vaswani et al. 2017) with fixed context outperforms RNN variants by a large margin, achieving state of the art on two popular benchmarks: 1. 13 bits per character on text8 and 1. 06 on enwik8. To get good results at this depth, we show that it is important to add auxiliary losses, both at intermediate network layers and intermediate sequence positions.

PDF Details

AAAI Conference 2018 Conference Paper

Byte-Level Machine Reading Across Morphologically Varied Languages

Tom Kenter
Llion Jones
Daniel Hewlett

The machine reading task, where a computer reads a document and answers questions about it, is important in artiﬁcial intelligence research. Recently, many models have been proposed to address it. Word-level models, which have words as units of input and output, have proven to yield state-of-theart results when evaluated on English datasets. However, in morphologically richer languages, many more unique words exist than in English due to highly productive preﬁx and suf- ﬁx mechanisms. This may set back word-level models, since vocabulary sizes too big to allow for efﬁcient computing may have to be employed. Multiple alternative input granularities have been proposed to avoid large input vocabularies, such as morphemes, character n-grams, and bytes. Bytes are advantageous as they provide a universal encoding format across languages, and allow for a small vocabulary size, which, moreover, is identical for every input language. In this work, we investigate whether bytes are suitable as input units across morphologically varied languages. To test this, we introduce two large-scale machine reading datasets in morphologically rich languages, Turkish and Russian. We implement 4 byte-level models, representing the major types of machine reading models and introduce a new seq2seq variant, called encoder-transformer-decoder. We show that, for all languages considered, there are models reading bytes outperforming the current state-of-the-art word-level baseline. Moreover, the newly introduced encoder-transformer-decoder performs best on the morphologically most involved dataset, Turkish. The large-scale Turkish and Russian machine reading datasets are released to public.

PDF Details

NeurIPS Conference 2017 Conference Paper

Attention is All you Need

Ashish Vaswani
Noam Shazeer
Niki Parmar
Jakob Uszkoreit
Llion Jones
Aidan Gomez
Łukasz Kaiser
Illia Polosukhin

The dominant sequence transduction models are based on complex recurrent orconvolutional neural networks in an encoder and decoder configuration. The best performing such models also connect the encoder and decoder through an attentionm echanisms. We propose a novel, simple network architecture based solely onan attention mechanism, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superiorin quality while being more parallelizable and requiring significantly less timeto train. Our single model with 165 million parameters, achieves 27. 5 BLEU onEnglish-to-German translation, improving over the existing best ensemble result by over 1 BLEU. On English-to-French translation, we outperform the previoussingle state-of-the-art with model by 0. 7 BLEU, achieving a BLEU score of 41. 1.

PDF Details