Arrow Research search
Back to NeurIPS

NeurIPS 2025

Safety Pretraining: Toward the Next Generation of Safe AI

Conference Paper Main Conference Track Artificial Intelligence ยท Machine Learning

Abstract

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. In this work, we present a data-centric pretraining framework that builds safety into the model from the start. Our framework consists of four key steps: (i) Safety Filtering: building a safety classifier to classify webdata into safe and unsafe categories; (ii) Safety Rephrasing: we recontextualize unsafe webdata into safer narratives; (iii) Native Refusal: we synthetically generate pretraining datasets that actively teach models to refuse on unsafe content and the moral reasoning behind it, and (iv) Harmfulness-Tag annotated pretraining: we flag unsafe content during pretraining using a special token, and use it to steer models away from unsafe generations at inference-time. Our safety-pretrained models reduce attack success rates from 38. 8% to 8. 4% on standard LLM safety benchmarks with no performance degradation on general tasks.

Authors

Keywords

No keywords are indexed for this paper.

Context

Venue
Annual Conference on Neural Information Processing Systems
Archive span
1987-2025
Indexed papers
30776
Paper id
584593271751869370