Eric Winsor Papers

NeurIPS Conference 2025 Conference Paper

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies
Eric Winsor
Alexandra Souly
Tomek Korbak
Robert Kirk
Christian Schroeder de Witt
Yarin Gal

LLM developers deploy technical mitigations to prevent fine-tuning misuse attacks, attacks in which adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences; however, prior attacks training and/or inference samples can be easily flagged as suspicious. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We demonstrate a class of 'pointwise-undetectable' attacks that repurpose semantic or syntactic variations in benign model outputs to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. Our results showing fundamental limitations of defending against pointwise attacks suggest focusing research efforts on mitigations towards multi-point defences.

PDF Details

ICLR Conference 2025 Conference Paper

Look Before You Leap: Universal Emergent Mechanism for Retrieval in Language Models

Alexandre Variengien
Eric Winsor

When solving challenging problems, language models (LMs) are able to identify relevant information from long and complicated contexts. To study how LMs solve retrieval tasks in diverse situations, we introduce ORION, a collection of structured retrieval tasks spanning six domains, from text understanding to coding. Each task in ORION can be represented abstractly by a request (e.g. a question) that retrieves an attribute (e.g. the character name) from a context (e.g. a story). We apply causal analysis on 18 open-source language models with sizes ranging from 125 million to 70 billion parameters. We find that LMs internally decompose retrieval tasks in a modular way: middle layers at the last token position process the request, while late layers retrieve the correct entity from the context. After causally enforcing this decomposition, models are still able to solve the original task, preserving 70% of the original correct token probability in 98 of the 106 studied model-task pairs. We connect our macroscopic decomposition with a microscopic description by performing a fine-grained case study of a question-answering task on Pythia-2.8b. Building on our high-level understanding, we demonstrate a proof of concept application for scalable internal oversight of LMs to mitigate prompt-injection while requiring human supervision on only a single input. Our solution improves accuracy drastically (from 15.5% to 97.5% on Pythia-12b). This work presents evidence of a universal emergent modular processing of tasks across varied domains and models and is a pioneering effort in applying interpretability for scalable internal oversight of LMs.

Details

NeurIPS Conference 2021 Conference Paper

Scatterbrain: Unifying Sparse and Low-rank Attention

Beidi Chen
Tri Dao
Eric Winsor
Zhao Song
Atri Rudra
Christopher Ré

Recent advances in efficient Transformers have exploited either the sparsity or low-rank properties of attention matrices to reduce the computational and memory bottlenecks of modeling long sequences. However, it is still challenging to balance the trade-off between model quality and efficiency to perform a one-size-fits-all approximation for different tasks. To better understand this trade-off, we observe that sparse and low-rank approximations excel in different regimes, determined by the softmax temperature in attention, and sparse + low-rank can outperform each individually. Inspired by the classical robust-PCA algorithm for sparse and low-rank decomposition, we propose Scatterbrain, a novel way to unify sparse (via locality sensitive hashing) and low-rank (via kernel feature map) attention for accurate and efficient approximation. The estimation is unbiased with provably low error. We empirically show that Scatterbrain can achieve $2. 1 \times$ lower error than baselines when serving as a drop-in replacement in BigGAN image generation and pre-trained T2T-ViT. On a pre-trained T2T Vision transformer, even without fine-tuning, Scatterbrain can reduce $98\%$ of attention memory at the cost of only $1\%$ drop in accuracy. We demonstrate Scatterbrain for end-to-end training with up to $4$ points better perplexity and 5 points better average accuracy than sparse or low-rank efficient transformers on language modeling and long-range-arena tasks.

PDF Details

TCS Journal 2020 Journal Article

Generalized Lyndon factorizations of infinite words

Amanda Burcroff
Eric Winsor

A generalized lexicographic order on words is a lexicographic order where the total order of the (possibly infinite) alphabet depends on the position of the comparison. A generalized Lyndon word is a finite word which is strictly smallest among its class of rotations with respect to a generalized lexicographic order. This notion can be extended to infinite words: an infinite generalized Lyndon word is an infinite word which is strictly smallest among its class of suffixes. We positively answer a question of Dolce, Restivo, and Reutenauer by proving that every infinite word has a unique nonincreasing factorization into finite and infinite generalized Lyndon words. When this factorization has finitely many terms, we characterize the last term of the factorization. Our methods also show that the infinite generalized Lyndon words are precisely the words with infinitely many generalized Lyndon prefixes.

Details DOI

TCS Journal 2020 Journal Article

Reprint of: Generalized Lyndon factorizations of infinite words

Amanda Burcroff
Eric Winsor

A generalized lexicographic order on words is a lexicographic order where the total order of the (possibly infinite) alphabet depends on the position of the comparison. A generalized Lyndon word is a finite word which is strictly smallest among its class of rotations with respect to a generalized lexicographic order. This notion can be extended to infinite words: an infinite generalized Lyndon word is an infinite word which is strictly smallest among its class of suffixes. We positively answer a question of Dolce, Restivo, and Reutenauer by proving that every infinite word has a unique nonincreasing factorization into finite and infinite generalized Lyndon words. When this factorization has finitely many terms, we characterize the last term of the factorization. Our methods also show that the infinite generalized Lyndon words are precisely the words with infinitely many generalized Lyndon prefixes.

Details DOI

Possible papers

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Look Before You Leap: Universal Emergent Mechanism for Retrieval in Language Models

Scatterbrain: Unifying Sparse and Low-rank Attention

Generalized Lyndon factorizations of infinite words

Reprint of: Generalized Lyndon factorizations of infinite words