Arrow Research search

Author name cluster

Jonathan May

Possible papers associated with this exact author name in Arrow. This page groups case-insensitive exact name matches and is not a full identity disambiguation profile.

6 papers
2 author rows

Possible papers

6

NeurIPS Conference 2025 Conference Paper

Language Models Can Predict Their Own Behavior

  • Dhananjay Ashok
  • Jonathan May

The text produced by language models (LMs) can exhibit specific `behaviors, ' such as a failure to follow alignment training, that we hope to detect and react to during deployment. Identifying these behaviors can often only be done post facto, i. e. , after the entire text of the output has been generated. We provide evidence that there are times when we can predict how an LM will behave early in computation, before even a single token is generated. We show that probes trained on the internal representation of input tokens alone can predict a wide range of eventual behaviors over the entire output sequence. Using methods from conformal prediction, we provide provable bounds on the estimation error of our probes, creating precise early warning systems for these behaviors. The conformal probes can identify instances that will trigger alignment failures (jailbreaking) and instruction-following failures, without requiring a single token to be generated. An early warning system built on the probes reduces jailbreaking by 91%. Our probes also show promise in pre-emptively estimating how confident the model will be in its response, a behavior that cannot be detected using the output text alone. Conformal probes can preemptively estimate the final prediction of an LM that uses Chain-of-Thought (CoT) prompting, hence accelerating inference. When applied to an LM that uses CoT to perform text classification, the probes drastically reduce inference costs (65% on average across 27 datasets), with negligible accuracy loss. Encouragingly, probes generalize to unseen datasets and perform better on larger models, suggesting applicability to the largest of models in real-world settings.

NeurIPS Conference 2024 Conference Paper

Megalodon: Efficient LLM Pretraining and Inference with Unlimited Context Length

  • Xuezhe Ma
  • Xiaomeng Yang
  • Wenhan Xiong
  • Beidi Chen
  • Lili Yu
  • Hao Zhang
  • Jonathan May
  • Luke Zettlemoyer

The quadratic complexity and weak length extrapolation of Transformers limits their ability to scale to long sequences, and while sub-quadratic solutions like linear attention and state space models exist, they empirically underperform Transformers in pretraining efficiency and downstream task accuracy. We introduce MEGALODON, an neural architecture for efficient sequence modeling with unlimited context length. MEGALODON inherits the architecture of MEGA (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability and stability, including complex exponential moving average (CEMA), timestep normalization layer, normalized attention mechanism and pre-norm with two-hop residual configuration. In a controlled head-to-head comparison with LLAMA2, MEGALODON achieves better efficiency than Transformer in the scale of 7 billion parameters and 2 trillion training tokens. MEGALODON reaches a training loss of 1. 70, landing mid-way between LLAMA2-7B (1. 75) and LLAMA2-13B (1. 67). This result is robust throughout a wide range of benchmarks, where MEGALODON consistently outperforms Transformers across different tasks, domains, and modalities.

ICLR Conference 2023 Conference Paper

Mega: Moving Average Equipped Gated Attention

  • Xuezhe Ma
  • Chunting Zhou
  • Xiang Kong
  • Junxian He
  • Liangke Gui
  • Graham Neubig
  • Jonathan May
  • Luke Zettlemoyer

The design choices in the Transformer attention mechanism, including weak inductive bias and quadratic computational complexity, have limited its application for modeling long sequences. In this paper, we introduce Mega, a simple, theoretically grounded, single-head gated attention mechanism equipped with (exponential) moving average to incorporate inductive bias of position-aware local dependencies into the position-agnostic attention mechanism. We further propose a variant of Mega that offers linear time and space complexity yet yields only minimal quality loss, by efficiently splitting the whole sequence into multiple chunks with fixed length. Extensive experiments on a wide range of sequence modeling benchmarks, including the Long Range Arena, neural machine translation, auto-regressive language modeling, and image and speech classification, show that Mega achieves significant improvements over other sequence models, including variants of Transformers and recent state space models.

NeurIPS Conference 2021 Conference Paper

Luna: Linear Unified Nested Attention

  • Xuezhe Ma
  • Xiang Kong
  • Sinong Wang
  • Chunting Zhou
  • Jonathan May
  • Hao Ma
  • Luke Zettlemoyer

The quadratic computational and memory complexities of the Transformer's attention mechanism have limited its scalability for modeling long sequences. In this paper, we propose Luna, a linear unified nested attention mechanism that approximates softmax attention with two nested linear attention functions, yielding only linear (as opposed to quadratic) time and space complexity. Specifically, with the first attention function, Luna packs the input sequence into a sequence of fixed length. Then, the packed sequence is unpacked using the second attention function. As compared to a more traditional attention mechanism, Luna introduces an additional sequence with a fixed length as input and an additional corresponding output, which allows Luna to perform attention operation linearly, while also storing adequate contextual information. We perform extensive evaluations on three benchmarks of sequence modeling tasks: long-context sequence modelling, neural machine translation and masked language modeling for large-scale pretraining. Competitive or even better experimental results demonstrate both the effectiveness and efficiency of Luna compared to a variety of strong baseline methods including the full-rank attention and other efficient sparse and dense attention methods.

IJCAI Conference 2013 Conference Paper

Identifying Useful Human Correction Feedback from an On-line Machine Translation Service

  • Alberto Barrón-Cedeño
  • Lluís Màrquez
  • Carlos A. Henríquez Q.
  • Lluís Formiga
  • Enrique Romero
  • Jonathan May

Post-editing feedback provided by users of on-line translation services offers an excellent opportunity for automatic improvement of statistical machine translation (SMT) systems. However, feedback provided by casual users is very noisy, and must be automatically filtered in order to identify the potentially useful cases. We present a study on automatic feedback filtering in a real weblog collected from Reverso. net. We extend and re-annotate a training corpus, define an extended set of simple features and approach the problem as a binary classification task, experimenting with linear and kernelbased classifiers and feature selection. Results on the feedback filtering task show a significant improvement over the majority class, but also a precision ceiling around 70-80%. This reflects the inherent difficulty of the problem and indicates that shallow features cannot fully capture the semantic nature of the problem. Despite the modest results on the filtering task, the classifiers are proven effective in an application-based evaluation. The incorporation of a filtered set of feedback instances selected from a larger corpus significantly improves the performance of a phrase-based SMT system, according to a set of standard evaluation metrics.

TCS Journal 2009 Journal Article

Backward and forward bisimulation minimization of tree automata

  • Johanna Högberg
  • Andreas Maletti
  • Jonathan May

We improve on an existing [P. A. Abdulla, J. Högberg, L. Kaati, Bisimulation minimization of tree automata, International Journal of Foundations of Computer Science 18(4) (2007) 699–713] bisimulation minimization algorithm for finite-state tree automata by introducing backward and forward bisimulation and developing minimization algorithms for them. Minimization via forward bisimulation is also effective on deterministic tree automata, faster than the previous algorithm, and yields the minimal equivalent deterministic tree automaton. Minimization via backward bisimulation generalizes the previous algorithm and can yield smaller automata but is just as fast. We demonstrate implementations of these algorithms on a typical task in natural language processing.