Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Xander Davies; Eric Winsor; Alexandra Souly; Tomek Korbak; Robert Kirk; Christian Schroeder de Witt; Yarin Gal

Back to NeurIPS

NeurIPS 2025

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Conference Paper Main Conference Track Artificial Intelligence · Machine Learning

PDF Details

Abstract

LLM developers deploy technical mitigations to prevent fine-tuning misuse attacks, attacks in which adversaries evade safeguards by fine-tuning the model using a public API. Previous work has established several successful attacks against specific fine-tuning API defences; however, prior attacks training and/or inference samples can be easily flagged as suspicious. In this work, we show that defences of fine-tuning APIs that seek to detect individual harmful training or inference samples ('pointwise' detection) are fundamentally limited in their ability to prevent fine-tuning attacks. We demonstrate a class of 'pointwise-undetectable' attacks that repurpose semantic or syntactic variations in benign model outputs to covertly transmit dangerous knowledge. Our attacks are composed solely of unsuspicious benign samples that can be collected from the model before fine-tuning, meaning training and inference samples are all individually benign and low-perplexity. We test our attacks against the OpenAI fine-tuning API, finding they succeed in eliciting answers to harmful multiple-choice questions, and that they evade an enhanced monitoring system we design that successfully detects other fine-tuning attacks. Our results showing fundamental limitations of defending against pointwise attacks suggest focusing research efforts on mitigations towards multi-point defences.

Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs

Abstract

Authors

Keywords

Context